46 Comments
User's avatar
Georgia Barry's avatar

Please do more stuff like this. Basics are so important.

Expand full comment
Lorenzo Ferro's avatar

The topic starting at min 40 about the statistical value and subgroups' outcomes.

I agree that you can almost always find correlations if you look deep enough. But would these be significant? Most of them, no. The subgroups for which the correlations are significant show that the main trial hypothesis was (partially) false, because an effect of the intervention can only be seen in these subgroups.

But in these subgroups the intervention shows indeed a significant difference with the control group and it must therefore be taken for a positive result even if it's something we can't believe like astrological signs.

This doesn't show that subgroup analysis is "bad". It shows that it is as "bad" as statistical analysis of medical trials is: the randomization is always imperfect; individuals are not all equal; you can't understand how the world works with experiments only.

If you think about it, epistemologically it makes not much sense to claim that an intervention works in x% of the cases. Either it achieves a certain measurable result in an individual, or it doesn't. RCTs help finding promising interventions for groups of people, but they are just the first step of finding which actual effect has the intervention on each individual.

Actually doing this work of finding bigger differences in subgroups should be an integral part of large trials both to address further research and medical practice. Especially if data is not open and other researchers cannot make these analyses themselves.

But ofc the ideal is that all the available (anonymized) data be public because the publishing research team can't analyze all the possible subgroups and must choose.

Expand full comment
mani malagón's avatar

Kahneman & Tversky, & Ioannidis ... That being said, we should also hear the 19 c's warning" : "Lies, damn lies & statistics!"

The only thing I learned at MIT was that "hard" science requires quantification. But, how do you take that concept into bioindividual effects? For example, folks on low-carb alimentation with lots of lean body mass have higher cholesterol levels than the 90% metabolically impaired American "norm" with lower cholesterol levels.

The statistics can be totally valid in describing measured cholesterol, but there's an inherent bias that makes a lie of all the "data "

Expand full comment
Steve Cheung's avatar

Thanks for providing the transcript. I find it quicker to read it than to listen to it, even if sped up to 1.25x or 1.5x.

Expand full comment
James R's avatar

Especially since they probably didn’t power the trial to detect the safety signal.

Expand full comment
James R's avatar

The most egregious use of P-values is when the clinical trial finds adverse events, but they’re not statistically significant. Like Vioxx caused 4 times as many heart attacks, but it wasn’t statistically significant, so it’s a good drug. No. I don’t want to take a pill that causes heart attacks just because the P-value was greater than 0.05.

Expand full comment
Lorenzo Ferro's avatar

In another comment I explained that the problem lies in the need for systematic subgroup analysis which indirectly would help deal with the safety issue you're raising.

If you do subgroup analysis, it's likely that a safety signal that isn't statistically significant in the overall sample, becomes significant in a specified subgroup.

Expand full comment
Seneca Plutarchus's avatar

Isn’t post hoc analysis just fishing for statistical significance? Like if you look for a .05 P value in 5 additional comparisons, your chances of finding one randomly is over 20%.

Expand full comment
Lorenzo Ferro's avatar

But you're giving up the main hypothesis and settling for a weaker (and more precise) one, so I don't see any problem. Plus the new subgroups still need to satisfy obvious standards for sample selection.

If a proper subgroup analysis tells you that the Geminis (?) have 4x more adverse events and it is significant, you have to understand the result: you can't just shrug it off saying that it is someway "wrong" because you can find dozens of similar significant correlations.

I think we should analyze these correlations if they are "medically relevant" and not just weird curiosities (which clashes with the almost criminal non openness of experimental data).

Of course the explanation may be as easy as selection bias or finding that people born in a certain period have different features; or it can relate to other confounders and remain baffling forever.

But for "medically relevant" cases we could do more experiments and possibly show that the initial trial was indeed an unlucky tail of the distribution. Or not.

Expand full comment
Seneca Plutarchus's avatar

No, you would have to replicate the subgroup finding, because the more subgroups you start analyzing, the chances that you randomly find something significant approaches 100%. And by randomly I mean like a fair coin flipping heads 8 times in a row by chance, not an actual finding.

Expand full comment
Lorenzo Ferro's avatar

I don't understand your argument.

There's no point after which you can be "absolutely sure" and stop needing to replicate your findings. It's the problem of scientific induction and reasonably it can only be solved in a pragmatic way.

You can claim that there's also a theoretical problem of the absence of a plausible biological mechanism for what happens.

But lots (most?) of scientific results have come out of nowhere, almost by surprise and an implausible correlation can hide something very plausible.

Sure: as long as you have lots of data about your sample you can find lots of significant correlations events/subgroups. But most of not all of these subgroups will be biased. Having 100% chance of finding unbiased subgroups with significant outcomes doesn't sound right. Care to elaborate?

Expand full comment
Seneca Plutarchus's avatar

Do you understand the null hypothesis and P values? So the more subgroups you start analyzing, you are guaranteed to find one where you have hit the 5% chance of random chance having caused the association. The most likely explanation for your example of post hoc Geminis having 4x the chance of some finding is in fact random chance.

If you decide to accept a 5% chance of any result happening by random chance, if you, for example, you go and do a post hoc analysis of 20 subgroups you would expect to find 1 or more significant results 65% of the time, and it would be due to random chance, not an actual finding.

Expand full comment
Ernest N. Curtis's avatar

I also loved this conversation.(Full disclosure---I am also a UC, Berkeley alum). I think one statement by Professor Recht sums up the entire discussion:

"When did we decide we are too dumb to just look at numbers in a table?"

I like it because it validates the advice I gave in an appendix to a book I wrote about a dozen years ago. The appendix was titled "How to Read a Medical Journal Report". My advice was to disregard p values, regression equations, confidence intervals, and other statistical figures; just look at the percentages for the relevant outcomes and decide for yourself whether the differences between the treatment and control groups are of any practical significance.

Expand full comment
Seneca Plutarchus's avatar

Well, except that the paper cited in this post seems to show that you’d need much more dramatic raw results to show that the study was actually powered correctly. So the typical P values found in trials are almost certainly correlated with raw results that may look impressive but don’t actually show that the trial was successful.

Expand full comment
GJTL's avatar

Totally agree. I look at the raw data (sample size, and outcomes of treatment vs. controls, and adverse outcomes) and decide if treatment is convincing or worth the risk. For shared decision making with a patient, many patients would appreciate (and understand) this data (in a simple table) instead of trying to convince them of reducing their risk (absolute or relative) by x%.

Expand full comment
Ernest N. Curtis's avatar

Thanks. Apparently my comment irritated at least one biostatistician. In many years of practice I rarely encountered any patients who couldn't comprehend that kind of common sense explanation.

Expand full comment
JDK's avatar

"Eye-balling"?

Humans are simultaneously both horrible and good at this. But our errors and excellence are not likely symmetric. We are mostly crap at it even with a lot of training.

Eye balling cannot work where errors can be catastrophic and benefits small. Only effective strategy if loss very small and benefits big.

Less eye-balling and more math in medicine required because human beings are being treated.

Expand full comment
Ernest N. Curtis's avatar

I am not suggesting "eye-balling"---a term which connotes carelessness. I think anyone of average intelligence can look at the numbers and come to a reasonable conclusion about the practical significance of any given procedure or medication. Statistical significance only comes into play when the benefits are quite small.

Expand full comment
JDK's avatar

"I think anyone of average intelligence can look at the numbers and come to a reasonable conclusion about the practical significance of any given procedure or medication."

"Eyeballing" = "looking at the numbers and thinking you can come to a reasonable conclusion."

Time and time again, we see that this is not true.

Expand full comment
Philip Miller's avatar

If this took 40 minutes to explain, you did not accomplish the task. p values are so difficult to understand. In fact, there is a movement to use other measures of statistical validity. So often these trials today use so much statistical chicancery to hide or confuse the real results.

Relative vs Absolute values is the real issue.

RCTs suffer from dose response variables. RTCs are not applicable to surgical procedures.

Expand full comment
Dennis Robert's avatar

Enjoyed the conversation. Especially the example citations of DANISH trial and the zodiac sign-aspirin thing.

I think the most puzzling and frustrating piece of element in a clinical research protocol is sample size estimation. Everyone wants it, but very few understand it, and those who understand it do it with a lot skepticism. It is also frustrating to see papers getting published with very small sample sizes, but when it comes to our own paper submissions, the editor or the peer-reviewers reject it saying you don't have "enough" sample size. Science is also unjust and unfair just like politics in many cases. "All animals are equal, but some animals are more equal" works in science too.

By the way, I don't think I agree quite with Ben on the explanation regarding 95% CI. Perhaps I didn't fully get it, but I am sure there are much more intuitive ways to explain it.

Also, with regard to power, sample size and statistical signficance testings, I really like this Neyman-Pearson approach for its intuitiveness.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4347431/

Expand full comment
JDK's avatar

The "line of surprise" was empirically shown by Walter Shewhart to be greater than 3 sigmas. And this is true even if the distribution is not Gaussian. See Wheeler.

And theoretically, Chebyshev's inequality is valid for any unimodel distribution, ie; within two standard deviations away from the mean contains 75% of the values, and within three standard deviations away from the mean contains 88.9% of the values.

https://www.nature.com/articles/s41562-017-0189-z

If we went back and re-analyzed all research over the last 60 years (observational and rct) using 3 sigma line of surprise we'd be better off.

Expand full comment
Sander Greenland's avatar

Should be "for any unimodal distribution with a standard deviation, within two SDs of the mean contains AT LEAST 75% of the values, and within three standard deviations of the mean contains AT LEAST 88.9% of the values." But even then, most distributions I see in medical statistics are far closer to Gaussian than to the extreme distributions allowed by Chebyshev. A Gaussian has over 95% of the values within two SDs, and over 99.7% within 3 SDs, and data at 3 SDs are more surprising than all heads in 8 fair tosses.

Also: far better than using a "line of surprise" would be to present the degree of surprise at seeing the data for various effect sizes of practical interest, including no effect, a minimal important effect, a moderately strong effect, etc.

Expand full comment
JDK's avatar

Well we have different ideas. Physics uses 6 sd! At 2 sigma we are being fooled by randomness.

I'm not sure "degree" of surprise makes sense.

Expand full comment
Sander Greenland's avatar

Please try reading the citations I gave earlier before responding. Degree of surprise (surprisal) has been in use in communication engineering since the 1940s and has plenty of literature on it.

You won't be fooled by randomness if you don't take 2 sigma as much evidence of conflict, because it represents only about 4 bits of information against the hypothesis or model; so it's like judging coins using only 4 tosses. Physics uses a one-sided 5 SD per experiment (p = 1 in 3.4 million, like getting almost 22 heads in a row) - check the Higgs discovery papers for that. 6 SD is from the quality-control literature (one-sided p=1 in a billion, almost 30 heads in a row), but anyone dealing with defective returns these days knows it's not enforced on assembled products (even though QC for individual components like chips may use it to limit returns on completed products).

Expand full comment
JDK's avatar

You didn't cite any papers in this thread. Maybe you were replying to someone else?

But there are at least 3 generations of poorly trained physicians. (statistically illiterate, see eg (Odette Wegwarth 2013))

Yes, 2 sigma is NOT much evidence, but we are stuck in a paradigm where medicine generally thinks it is enough. And it verges on scientism!

Ok 5 sigma for Physics, I mistated. But I am not advocating for that in medicine, 3 sigma would be sufficient.

But I would note that Shewhart and then Deming, were pretty explicit that their approach was not probabilistic. It was empirically based on the best place to minimize (but not eliminate) Type I and Type II errors.

I will keep an open mind about "degree of surprise" (Shackle, I presume) and further contemplate. Is starting with Derbyshire (1977), a good place?

Expand full comment
Sander Greenland's avatar

I don't know Derbyshire. Some history of what I was describing is cited in a few of the articles below. Shannon (1948) is the earliest cite I have that settled on and proved theorems about measuring the information in seeing an event of probability p as s = -log_b(p), where the base b of the logs determines the scale of information units. For use with P-values, I cited the following in my first post on this page this morning, and I ask you to read that post as well as:

Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf, https://www.jstor.org/stable/44851769

Rafi, Z., Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9

Greenland, S., Mansournia, M., and Joffe, M. (2022). To curb research misreporting, replace significance and confidence by compatibility. Preventive Medicine, 164, https://www.sciencedirect.com/science/article/pii/S0091743522001761

- Those are very basic however, and you might prefer these treatments: (next post as this is getting cut off)

Expand full comment
JDK's avatar

I thought Surprise Theory is G. L. S. Shackle beginning in the 30s.

https://www.sciencedirect.com/science/article/pii/S0040162516300671

Expand full comment
Sander Greenland's avatar

More sophisticated treatments:

Greenland, S. (2019). Some misleading criticisms of P-values and their resolution with S-values. The American Statistician, 73, supplement 1, 106-114, open access at www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625

Greenland, S., Rafi, Z. (2020). Technical issues in the interpretation of S-values and their relation to other information measures. Supplement to Rafi & Greenland at https://arxiv.org/abs/2008.

Greenland, S. (2023). Divergence vs. decision P-values: A distinction worth making in theory and keeping in practice. Scandinavian Journal of Statistics, 50(1), 1-35, free preprint at https://arxiv.org/ftp/arxiv/papers/2301/2301.02478.pdf. Discussion and Rejoinder appear in issue 3 of vol. 50.

Amrhein, V., and Greenland, S. (2022). Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values. Journal of Information Technology, 37(3), 316-320, https://journals.sagepub.com/doi/full/10.1177/02683962221105904

Here is some motivating background for these approaches:

Greenland, S. (2017). The need for cognitive science in methodology. American Journal of Epidemiology, 186, 639-645, https://academic.oup.com/aje/article/186/6/639/3886035.

Expand full comment
Sander Greenland's avatar

Recht gives a lot of sensible comments, but sadly his coin-toss interpretation of a P-value a few minutes in is quite incorrect. First, he leaves out the important "or more extreme" modifier: The P-value refers to a probability of what was seen or more extreme. Second, the version of coin tossing he gives only works when the number of outcome events is very a small proportion of each treatment arm; otherwise it can be far too large, and it doesn't work for continuous or ordinal outcomes.

Here's some correct interpretations that work for any P-value and also for interval estimates:

A P-value can be seen as an index of compatibility between the data and a hypothesis or model, ranging from 0=completely incompatible to 1=perfectly compatible. A 95% compatibility ("confidence") interval is just the collection of all effect sizes whose P-values are greater than 0.05.

How can we correctly interpret P-values between 0 and 1 in terms of coin tosses?

An interpretation that applies to any situation is to compute the Shannon surprisal or S-value of the P-value. The S-value s is the number of fair-coin tosses one would have to make in order for the P-value p to equal the probability of getting all heads in the s tosses. That number s is just the base-2 log of 1/p; it is one measure of the information the P-value conveys against the hypothesis or model it tests. Then we see that p=0.05 corresponds to s=4.3, and so under the tested hypothesis or model is just a little more improbable than getting 4 heads in a row from 4 fair coin tosses; also, every effect size inside the 95% CI has p>0.05, and thus is highly compatible with the data in the sense of corresponding to 4 or fewer heads in a row from fair coin tossing.

The philosophy here is to make things as simple and general as possible, but not to the point that you make them wrong. If compatibility and surprisal are too complicated for a user, there's not much hope they will be interpreting statistics correctly because that's about as simple as you can make general and correct interpretations of P-values and CIs. The traditional repeated-sampling ("significance" and "confidence") interpretations are even harder to get right; when you try and and make them simpler (as Recht and typical statistics primers do) you get them wrong or worse, as seen in the massive errors in study reports (such as claiming the study supported the null just because the null was inside the 95% CI).

For more discussion of P-value and CI misinterpretations and correct interpretations see these open-access articles:

Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at

https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf, https://www.jstor.org/stable/44851769

Rafi, Z., and Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9

Greenland, S., Mansournia, M., and Joffe, M. (2022). To curb research misreporting, replace significance and confidence by compatibility. Preventive Medicine, 164, https://www.sciencedirect.com/science/article/pii/S0091743522001761

Expand full comment
JDK's avatar

It seems like your guest suggested that 40 heads in a hundred flips leads one to reject the hypothesis that the coin is fair? Really?

If you had said 34 heads should lead one to investigate whether the coin is fair, I'd be with you. We are hiding a lot of ignorance between 2 and 3 sigmas.

It is very hard to unsee Deming performing the red bead experiment.

[Of course, von Neuman's practical solution to flipping a coin which might not be fair provides a different practical way to be fair.]

Expand full comment
Adam Wilson's avatar

This seems like a really long winded way of explaining the p value.

My understanding is that it is a measure of the confidence that the result is not down to pure chance. A p value of 0.05 suggests that there is 5% likelihood that the result was down to chance, rather than the intervention. 0.05 has been set as the threshold of acceptable risk that the result is trustworthy. It does not mean the result is 100% accurate.

Expand full comment
JDK's avatar

Sigh. Your proposed definition is not what a p-value actual is.

Expand full comment
Adam Wilson's avatar

" The P stands for probability and measures how likely it is that any observed difference between groups is due to chance. Being a probability, P can take any value between 0 and 1. Values close to 0 indicate that the observed difference is unlikely to be due to chance, whereas a P value close to 1 suggests no difference between the groups other than due to chance."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111019/

Expand full comment
JDK's avatar

The p-value is the probability that IF you repeated the experiment ...

Expand full comment
Adam Wilson's avatar

ok, please share your understanding.

Expand full comment
Sander Greenland's avatar

Please see what I posted above about the mistakes in Recht's definition, especially https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf

Expand full comment
Biomeg's avatar

Can you please fix the typo in the title? It will help the important credibility of this article. Thanks!

Expand full comment
John Mandrola's avatar

Done. Sorry.

Expand full comment
J Lee MD PhD's avatar

I don’t want to pay 20 bucks to NEJM Evidence just to read the article. How can I get around that?

Expand full comment
Sobshrink's avatar

Your intervals are confidential? How will we know what they are?! 😂 Sorry. Couldn't help myself! Yes, I know what you meant but I think you've been studying HIPAA too much! :)

Expand full comment