I don’t think many people understand how essential statistics is to the design of experiments. The very job of a statistician is to design scientific studies and analyze the data that are collected. Unfortunately, not everyone can be a statistician. So, it’s the job of an individual interested in science to learn about these things. However, if they don’t take time to understand study methodology and statistics, they cannot correctly appraise studies. And if they can’t accurately evaluate studies, then they shouldn’t be doing it and misleading people.
Statistical mistakes seem to be prevalent everywhere, but some of the claims from the low-carbohydrate community have recently caught my interest and I thought that they would be worth clearing up in a blog post because spreading misinformation helps no one.
One of the first things that caught my eye was in a presentation given by Gary Taubes where he discusses the flaws in a pilot study conducted by Kevin Hall. Taubes says in the video (at 1:00),
“We have to do a pilot study to pilot the experiment. Because we don’t know how to power the experiment because we don’t know what size effects we would expect people to see. So the idea is rather than spending 30 million dollars on having any idea on how many people you’d need in the study to see a certain minimal effect size, we’re going to do a pilot study that’s going to be inherently flawed.”
This is misguided. Pilot studies are small studies that are done for the sake of feasibility. They tell you how an intervention will play out and what shortcomings you might expect with a larger study. The data that are produced from them cannot be used for sample size calculations because they are heavily influenced by noise (random error). That means that the standard deviations and the effect sizes that you get will likely be of little use for calculating sample sizes for larger studies. They may be larger or smaller than what is the actual effect size (type M errors).
Rather than use pilot studies to calculate sample sizes, it’s better to determine what is deemed to be the minimally clinically significant effect and go on from there with power calculations.
As Albers & Lakens point out,
“First, one can determine the smallest effect size of interest (SESOI), based on either utility or theoretical arguments, and use the SESOI in an a-priori power analysis. This leads to main studies that have a pre-determined statistical power to detect or reject the smallest effect size that is deemed worthwhile to study. For example, if researchers decide their SESOI is a medium effect size of η2 = .0588 a study with 87 participants in each of two groups will in the long run have a power of 0.9 to detect the SESOI, or reject it in an equivalence test (Lakens, 2017). Choosing a SESOI allows researchers to control their Type II error rate exactly for effect sizes they care about.”
The next mistake is from a post where nephrologist, Jason Fung, who’s also a proponent of the carbohydrate-insulin hypothesis of obesity and intermittent fasting, comments on the study done by Kevin Hall and writes,
“Look at how Hall describes the absolutely critical increase in EE. Here’s what he writes “the KD coincided with increased EEchamber (57 ± 13 kcal/d, P = 0.0004) and SEE (89 ± 14 kcal/d, P < 0.0001)” (emphasis mine). Hall is telling you that this was merely a coincidence that patients are all burning an extra 57 calories per day. WTF??? There is nothing coincidental about it. You switched them to a KD. EE increased. The P value of 0.0004 means that there is a 99.96% chance that this is NOT COINCIDENCE. Hall knows this as well as I do. This is basic statistics 101. Hall, a mathematician is surely aware of this.”
I don’t believe it’s worth clarifying his misunderstanding of the use of the word ‘coincide.’ His misinterpretation of the p-value, however, is.
The p-value does not give us a probability of random chance and the data. It certainly does NOT tell us the probability that our results are not a result of random chance. They give us the probability of getting data at least as extreme as what we’ve gotten, given that there is no effect. Thus, a result of random chance.
The definition is again, “if the null hypothesis is true, the p-value is the probability of getting a test statistic at least as extreme as what was observed.”
They are a form of long-term error control. They do not give us the probability of our data. This is not a semantical argument; rather, it is a conceptual one. Many people treat p-values as if they were Bayesians. However, p-values are, again, used for long-term error control.
The next series of mistakes are taken from the blog of Richard Feinman, a biochemist, who’s also a low-carbohydrate proponent and supporter of the carbohydrate-insulin-hypothesis of obesity.
Feinman states in a post,
"You can think of the hazard ratio as similar to an odds ratio which is what it sounds like: the comparative odds of different possible outcomes. The basic idea is that if 10 people in a group of 100 have a heart attack with saturated fat in their diet, the odds = 10 out of 100 or 1/10. "
- This is not what an odds ratio is. That's a risk ratio, not an odds ratio.
- Hazard ratios are nothing like odds ratios. They are far more similar to rate ratios if anything because they are a ratio of hazard rates, which involve person-time and provide instantaneous risk.
In the same post, he criticizes the use of meta-analysis, which he admits he had only recently learned of and states the following,
“There is one important point here. It is a statistical rule that if the 95% CI bar crosses the line for hazard ratio = 1.0 then this is taken as indiction that there is no significant difference between the two conditions, in this case, SFAs or a replacement. Looking at the figure from Jakobsen, we are struck by the fact that, in the list of 15 different studies for two replacements, all but one cross the hazard ratio = 1.0 line; one study found that keeping SFAs in the diet provides a lower risk than replacement with carbohydrate. For all the others it was a wash. At this point, one has to ask why a combined value was calculated. How could 15 studies that show nothing add up to a new piece of information. Who says two wrongs, or even 15, can’t make a right? The remarkable thing is that some of the studies in this meta-analysis are more than 20 years old. How could these have had so little impact? Why did we keep believing that saturated fat was bad?”
Let’s unpack this statement,
“How could 15 studies that show nothing add up to a new piece of information. Who says two wrongs, or even 15, can’t make a right?”
This needs to be addressed in two points.
First, all studies suffer from sampling error and smaller studies will generally have a larger standard error. Pooling these studies will reduce the standard error and give us the ability to detect the effects that we are interested in.
Second, it’s worth remembering that there is no “right.” When pooling studies in the real world, it often makes sense to go with a random-effects model to average the observed treatment effects from each study, which are believed to have deviations from the actual effect for each study...because of sampling error.
A random-effects meta-analysis is trying to get an average for these observed effects deviated from their true effects. It is not looking to come closer to a true universal effect. So, there is no “right” and “wrong.” This is not meant to be a fixed-effects model, and if the authors did use a fixed-effects model, then they are violating statistical assumptions.
He goes on with this argument against meta-analysis in another post…
“The idea underlying the meta-analysis, however, usually unstated, is that the larger the number of subjects in a study, the more compelling the conclusion. One might make the argument, instead, that if you have two or more studies which are imperfect, combining them is likely to lead to greater uncertainty and more error, not less. I am one who would make such an argument. So where did meta-analysis come from and what, if anything, is it good for?”
This would be true if the meta-analysis used a fixed-effects model where we put more weight on larger studies because they have less sampling error. However, in a random effects model, larger studies are given more balanced weights, and the number of participants no longer becomes the only factor in the equation; instead, the number of studies also plays a huge role.
I do agree though that adding studies that are imperfect, systematically, will not help. That’s why it may be worth including only studies that are at low-risk of bias and that are of high quality.
“If all of the studies go in the same direction, you are unlikely to learn anything from combining them. In fact, if you come out with a value for the output that is different from the value from the individual studies, in science, you are usually required to explain why your analysis improved things. Just saying it is a larger n won’t cut it, especially if it is my study that you are trying to improve on.”
Not true at all. Type-S (sign) errors aren’t the only concern; we need to also worry about type-M (magnitude errors). Whether they go in the same direction is not of importance if we’re trying to determine whether a treatment if equivalent or superior to another treatment and for that, we need precision and a reduction of standard error.
“Finally, suppose that you are doing a meta-analysis on several studies and that they have very different outcomes, showing statistically significant associations in different directions. For example, if some studies showed substituting saturated fat for carbohydrate increased risk while some showed that it decreased risk. What will you gain by averaging them? I don’t know about you but it doesn’t sound good to me.”
The answer to this, again, is random-effects. A lot of this mostly seems to stem from a misunderstanding behind how meta-analysis and heterogeneity work and the theory behind them.
These are only a few posts I found with clear errors and misunderstandings. I cannot fathom how many more poor study interpretations there are because of a misunderstanding of study methods and statistics.