49 Comments

Thanks for the article and the discussion around it.

I want to focus on your conclusion #4 because I think it's where things tend to go astray, in part because it extends what looks like an error in conclusion in #2. (moving from an effect was captured to inferring it was from the intervention - I don't think subjective confidence is enough, but getting beyond that requires recognition, discussion, etc., effectively making explicit Bayesian-style priors).

The use of CIs, p-scores, etc. all depend on certain assumptions about how the underlying reality actually looks. If, for example, the underlying spread isn't a normal distribution, I don't think any of those assertions are justified, much less hold. But even on their own terms, neither the CI nor the p-score are making a claim about the underlying reality, but rather about what the results of repeated testing would/should look like.

As a non-normal distribution that I hold near-and-dear, first year salaries for lawyers are generally distributed bi-modally.

https://www.nalp.org/salarydistrib

For kicks, I think it's possible to think of pharmacogenetic responses to drugs as potentially working that way to some extent in at least some circumstances. (E.g. to keep it simple, responsiveness tracks just a particular variant with a discrete spread in the population).

Making any population-wide inferences from any given RCT study's results becomes particularly problematic if the underlying assumptions about the model's ties to "true values" don't hold. And those assumptions should be disclosed and subject to real discussion as to why it's believed to be the case (or not). Which ends up coming back to Bayesian approaches.

In other words, #4 seems to be asking the statistical tools to do something that they can't do and don't purport to do on their own. If it holds in the real world, it does due to a number of assumptions that should be in some sense empirically tested and/or justified before just assuming they apply.

But I think the point made by Bayesians and some of the other comments is that -- especially if the signal is clear -- you can get to the causation question without relying or getting caught up in the statistical math reflected in the CI or p-scores. More to the point, that one should go straight to the causation question without getting hung up on the CI or p-scores.

At this point, I've probably stepped it in somewhere.

That said, as an industry, I think medicine might benefit from looking back at the historical experience and disagreements between Ronald Fisher and William Gossett (aka Student).

As I understand it, Gossett wasn't a fan of the large RCT because it wasn't economically feasible in his role at the beer-maker for Guiness. There is room for it, but a fetish-like focus on the RCT, CI and p-scores tends to distract from getting effective answers to the questions that matter for getting things done. A properly designed RCT can be a powerful piece of evidence but isn't conclusive in and of itself and can actually end up more biased than other approaches in some circumstances.

Curious as to your thoughts.

Expand full comment

When I read a medical paper, I give little weight to the statistical analysis. I look at the raw figures and percentage differences and use common sense to determine whether the differences have any practical significance. Almost anyone with an average education can do the same. Other factors such as the quality of the data may greatly outweigh or even negate that of statistical analysis. No better illustration of that than the recent "Covid" episode where the label was attached to a common clinical syndrome for which there was no definitive test. This led to the enumeration of "cases" and absurd diagnostic categories like asymptomatic cases, "Covid" carriers and "spreaders", "long Covid", etc.

Expand full comment

This is true about 40% of the time.

Expand full comment

1. Does a p-value automatically translate into a confidence interval? For example, does a p-value of 0.01 immediately define a 99% confidence interval?

2. Would it be correct to say that a confidence interval tells us that "the true effect/association lies within the confidence interval with a likelihood that is not definable, but is reasonably suspected to be typically sufficient enough to warrant subjective consideration that the true efficacy/association exists at a level contained within the interval." ?

3. What is the difference between likelihood and probabilty?

Expand full comment

Very interesting stuff. I originally read about the problem with P-values and one of their problems was about the ability to test for a viral infection. Which made me somewhat skeptical of the CoVid narrative. I found the article before CoVid but remembered it when I started hearing people talk about testing. There are fundamental problems with statistical analysis which pre-date many people's ideas from 2020 onward.

Expand full comment

"with this single poll, all we can say is the true result is likely somewhere between 37% and 43% but we will be wrong with that statement 5% of the time."

This is false.

Frequentist statistics tells us NOTHING AT ALL about the true result. Frequentist statistics is only able to make statements ABOUT THE EXPERIMENT assuming, as a hypothesis, some actual true result.

But this is not what we care about. We care about what the "true result" is.

To make a statement about the "true result", we MUST introduce a prior and then do Bayesian updating to a posterior from the results of the experiment.

That medicine still has not adopted Bayesian analysis in the 21st century is perhaps the greatest medical scandal.

https://www.ahajournals.org/doi/full/10.1161/CIRCOUTCOMES.117.003563

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10117244/

https://www.nature.com/articles/s41573-023-00638-0

Expand full comment

Great article. Very informative, and effectively verbalizes concepts that are not intuitively obvious for an end-user of evidence like me.

Expand full comment
Mar 11·edited Mar 11

I am dubious about the difference between two options:

1. There is an undetected source of bias or confounding

2. The result really was due to chance

I think that in randomized trial these are the same, and any "confounding" or "bias" is also due to chance. I take it in quotes because I think it is not correct to talk about confounding in randomized settings: from causal point of view disbalanced covariate is not a confounder because there is no causal link between this covariate and randomization. It is just yet one "due to chance".

Expand full comment

1. For many decades, I have kept coming back to Deming (1975) on the difference between enumerative studies and analytical studies. https://deming.org/wp-content/uploads/2020/06/On-Probability-As-a-Basis-For-Action-1975.pdf

2. F. Harrell's comment makes some very important points.

3. Contrary to the author's suggestion: 95% CIs have NOT stood the test of time. In my judgment (which could be wrong), the test of time has shown the opposite!

99% CIs are what medicine should be using! Medical research has time and time reveal that it is hiding ignorance in the space between 2 and 3 sigmas.

If we went back for 50 years and re-examined all of the literature and used 99% CIs, we would see very well the missteps and wasted dead ends of misunderstanding.

Studies that purport to show something meaningful with a 95% CI but fail with a 99% CI are an opportunity to investigate heterogeneity, i.e what really is the particular class of patients who might benefit from the proposed medical intervention.

Expand full comment

Thank you for the nice discussion of an always tricky topic.

I want to comment on a side issue you discuss: that of using the word "trend". I want to defend its use when summarizing the results of a trial.

Imagine the sentence, "Patients who received drug X had no reduction in all-cause mortality, but CV mortality was reduced by 12% and there were trends toward reductions in MI, stroke, and CABG."

Statisticians (and many clinical epidemiologist) will correctly point out that no test for trend was done and will then conclude that the word trend is ambiguous and unhelpful. While it is slightly vague, it is actually quite helpful in quickly communicating an interpretation of this trial.

The author of such a sentence is saying two things with the word "trend": the point estimates for reduction in MI, stroke, and CABG were clinically meaningful, and each p-value was more than 0.05 but not so big as to make the statement laughable. No one would laugh if a p-value were 0.07 and everyone would laugh if a p-value were 0.48. We can be pretty sure that most of us, who dislike being laughed at, would not risk writing such a sentence for a p-value bigger than about 0.15 or maybe 0.20.

The word "trend" is a shorthand here. If you want to know more, you need to go actually look at the trial results, but this is the interpretation of those results by the person writing the sentence. It is quick, clean, and easy to understand.

In contrast, let's consider the statistically and epidemiologically correct version of the sentence:

"Patients who received drug X had no statistically-significant reduction in all-cause mortality (HR 0.98; 95% CI 0.94 to 1.03) although the CI includes values that could be clinically meaningful; CV mortality was statistically significantly reduced (HR 0.88; 95% CI 0.75 to 0.98) but at the upper extreme of the HR, this would likely not be clinically meaningful; there was no statistically significant reduction in MI (HR 0.86; 95% CI 0.71 to 1.03), stroke (HR 0.90; 95% CI 0.70 to 1.08), or CABG (HR 0.85; 95% CI 0.70 to 1.12), but the point estimates for each of these would be clinically meaningful if real, while the upper ends of the confidence intervals include the possibility of harm."

There are absolutely times when I want to write that longer, more accurate sentence. But the shorthand of "trend" has value too.

Expand full comment

Very nice to have this refresher review---much appreciated.

Expand full comment

A useful post here -- thank you. The noun "chance" was used many times in this overview and this innocent bit of verbiage signals a communication problem that's rarely mentioned in any textbook.

My major beef for a long time (I'm a retired academic surgeon) has been that folks never explicitly articulate "the chance of WHAT". This issue never failed to arise whenever I asked (during journal clubs) our residents/staff what some given P-value signified in a research paper. My best understanding is that what's "actually measured" with a P-value (in a project where conventional null hypothesis testing is used) will be the probability that the particular observed data in our sample [whether counts or measured continuous variables] arose *merely because of imperfect sampling* of two compared populations given, and this is important: four ordinarily tacit, but very crucial, contingencies: IF there was no fraud, IF there were uniformly accurate measurements, IF the two sampled populations were indeed identical, and IF there was a perfectly executed experiment in all possible respects. A toy example can be used to easily illustrate: Presume that a large jar contains a huge (e.g. let's say 4,416 beads) number of beads that are either red or black. There are exactly equal numbers of red and black beads in the jar but WE DO NOT KNOW THIS. The beads are perfectly spherical and all have the same diameter. There is presumed thorough mixing of the beads before sampling and nobody cheats in any way at any time. While wearing a blindfold I use a special scoop that removes 100 beads -- we then tally the number of red beads and black beads removed in the scoop. Our task is to estimate jar composition from what is found in that single sample.

Of course, we would not necessarily find 50 black and 50 red beads in our single sample of 100 beads. Use of the noun "chance" in this context reflects the fact that it would not be impossible to find (for example) 47 black and 53 red beads in some given single sample of 100 beads **EVEN IF* the jar truly contained equal numbers of blacks and reds. A computed P-value would provide us the probability of drawing the sample that we presently have in hand IF the jar actually contained equal proportions of red and black beads, and IF all other contingencies listed above were satisfied. If the P-value were sufficiently small, however, we could draw the inference that the population DID NOT contain equal numbers of red and black beads.

Expand full comment
Mar 11·edited Mar 11

Thanks for a simple refresher course in stats. "When trial results are presented, we need not worry too much about the complex mathematical computations underlying the statistical presentations." Um, maybe, but if researchers are making such silly mistakes as using p values on descriptive statistics, maybe we DO need to take a closer look at the statistical tests being used, since there are different types of tests for different types of data, and we should make sure they match. (We need to know if the scale on which the variable(s) are measured is ordinal, categorical, or interval, and also whether they are normally distributed - see link #1 below). Also, some researchers dichotomize continuous variables, which creates several problems (link #2). Personally, I think all studies should report the absolute risk reduction in addition to the relative risk reduction, since the latter is meaningless without knowing the baseline absolute risk, and can be very misleading. For example, a 50% RRR might mean an ARR of 50% (i.e., risk cut from 100% to 50%, very unlikely), or it might mean an ARR of 1% (i.e, risk cut from 2% to 1%). Also, relative risk reduction seems to be misunderstood even by some doctors and Ph.D. scientists. I have heard a Ph.D. researcher who is an expert in infectious disease say that a vaccine with a 96% RRR means that it will be effective in preventing disease in 96% of the people who get the vaccine. Um, really? No, it means that in the clinical trial, the Control Event Rate minus the Experimental Event Rate divided by the Control Event Rate = 96%. The lower the event rate in the control group, the larger the difference between relative and absolute risk reduction, which is why it's really important to know the event rate in the control group. I'm pro-vaccine but it bothers me when I hear such statements from people who should know better, because they just give ammunition to people who are against vaccines for being misleading. We should also choose outcomes that matter, such as hospitalization and death, rather than say, mere infection.

https://stats.oarc.ucla.edu/spss/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-spss/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1458573/

Expand full comment

Thank you! Regression analysis aged me. Wish I came across this summary last year - it’s a solid overview of concepts that appeals to all students wherever they are on the learning continuum. Statistics is a blood sport not for the faint of heart nor online orphans.

Expand full comment
Mar 11·edited Mar 11

There is much, much good in this article. The authors started out with great pains to interpret a confidence interval exactly correctly. Then they made a mistake:

"So, with this single poll, all we can say is the true result is likely somewhere between 37%

and 43% but we will be wrong with that statement 5% of the time."

No. Both parts of this sentence are incorrect. In frequentist statistics the true value is either in or outside the interval; there is no probability attached to this. The probability statement does not apply to 0.37 and 0.43 but to the process that generated this interval.

The extreme difficulty in interpreting confidence intervals should drive more people to Bayes, as described in my Bayesian journey at https://fharrell.com/post/journey.

Later the authors say

"Inferential statistics actually do NOT help us test a research hypothesis about whether an intervention worked or not. They assume the observed difference was solely due to chance and simply give us an estimate of the probability of such an occurrence over many potential repetitions of the study."

This is incorrect, as the statement applies only to classical frequentist inferential statistics. Any article on statistics that doesn't acknowledge the existence of Bayes is problematic.

Now take a look at

"No statistics can tell us if the medication worked or if the differences seen were clinically important. These decisions are clinical judgments--not statistical judgements. The ONLY reason we do inferential statistics is to singularly deal with the issue of chance. This concept is key to understanding inferential statistics."

That is false as again it applies only to classical frequentist statistics. With Bayesian posterior probabilities you are not needing to deal with "chance" in the sense above, and you obtain direct evidence measures such as the probability the treatment has any effectiveness and the probability of clinically meaningful effectiveness. And Bayesian uncertainty intervals are so much easier to interpret than confidence intervals.

An article about statistics should be exactly correct to not mislead readers, and researchers should stop pretending that the p-value/confidence limit form of interence is the only form that exists. Otherwise, new confusions will arise.

Expand full comment