46 Comments

Please do more stuff like this. Basics are so important.

Expand full comment

The topic starting at min 40 about the statistical value and subgroups' outcomes.

I agree that you can almost always find correlations if you look deep enough. But would these be significant? Most of them, no. The subgroups for which the correlations are significant show that the main trial hypothesis was (partially) false, because an effect of the intervention can only be seen in these subgroups.

But in these subgroups the intervention shows indeed a significant difference with the control group and it must therefore be taken for a positive result even if it's something we can't believe like astrological signs.

This doesn't show that subgroup analysis is "bad". It shows that it is as "bad" as statistical analysis of medical trials is: the randomization is always imperfect; individuals are not all equal; you can't understand how the world works with experiments only.

If you think about it, epistemologically it makes not much sense to claim that an intervention works in x% of the cases. Either it achieves a certain measurable result in an individual, or it doesn't. RCTs help finding promising interventions for groups of people, but they are just the first step of finding which actual effect has the intervention on each individual.

Actually doing this work of finding bigger differences in subgroups should be an integral part of large trials both to address further research and medical practice. Especially if data is not open and other researchers cannot make these analyses themselves.

But ofc the ideal is that all the available (anonymized) data be public because the publishing research team can't analyze all the possible subgroups and must choose.

Expand full comment

Kahneman & Tversky, & Ioannidis ... That being said, we should also hear the 19 c's warning" : "Lies, damn lies & statistics!"

The only thing I learned at MIT was that "hard" science requires quantification. But, how do you take that concept into bioindividual effects? For example, folks on low-carb alimentation with lots of lean body mass have higher cholesterol levels than the 90% metabolically impaired American "norm" with lower cholesterol levels.

The statistics can be totally valid in describing measured cholesterol, but there's an inherent bias that makes a lie of all the "data "

Expand full comment

Thanks for providing the transcript. I find it quicker to read it than to listen to it, even if sped up to 1.25x or 1.5x.

Expand full comment

Especially since they probably didn’t power the trial to detect the safety signal.

Expand full comment

The most egregious use of P-values is when the clinical trial finds adverse events, but they’re not statistically significant. Like Vioxx caused 4 times as many heart attacks, but it wasn’t statistically significant, so it’s a good drug. No. I don’t want to take a pill that causes heart attacks just because the P-value was greater than 0.05.

Expand full comment

I also loved this conversation.(Full disclosure---I am also a UC, Berkeley alum). I think one statement by Professor Recht sums up the entire discussion:

"When did we decide we are too dumb to just look at numbers in a table?"

I like it because it validates the advice I gave in an appendix to a book I wrote about a dozen years ago. The appendix was titled "How to Read a Medical Journal Report". My advice was to disregard p values, regression equations, confidence intervals, and other statistical figures; just look at the percentages for the relevant outcomes and decide for yourself whether the differences between the treatment and control groups are of any practical significance.

Expand full comment
Jan 18·edited Jan 19

If this took 40 minutes to explain, you did not accomplish the task. p values are so difficult to understand. In fact, there is a movement to use other measures of statistical validity. So often these trials today use so much statistical chicancery to hide or confuse the real results.

Relative vs Absolute values is the real issue.

RCTs suffer from dose response variables. RTCs are not applicable to surgical procedures.

Expand full comment
Jan 18·edited Jan 18

Enjoyed the conversation. Especially the example citations of DANISH trial and the zodiac sign-aspirin thing.

I think the most puzzling and frustrating piece of element in a clinical research protocol is sample size estimation. Everyone wants it, but very few understand it, and those who understand it do it with a lot skepticism. It is also frustrating to see papers getting published with very small sample sizes, but when it comes to our own paper submissions, the editor or the peer-reviewers reject it saying you don't have "enough" sample size. Science is also unjust and unfair just like politics in many cases. "All animals are equal, but some animals are more equal" works in science too.

By the way, I don't think I agree quite with Ben on the explanation regarding 95% CI. Perhaps I didn't fully get it, but I am sure there are much more intuitive ways to explain it.

Also, with regard to power, sample size and statistical signficance testings, I really like this Neyman-Pearson approach for its intuitiveness.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4347431/

Expand full comment

The "line of surprise" was empirically shown by Walter Shewhart to be greater than 3 sigmas. And this is true even if the distribution is not Gaussian. See Wheeler.

And theoretically, Chebyshev's inequality is valid for any unimodel distribution, ie; within two standard deviations away from the mean contains 75% of the values, and within three standard deviations away from the mean contains 88.9% of the values.

https://www.nature.com/articles/s41562-017-0189-z

If we went back and re-analyzed all research over the last 60 years (observational and rct) using 3 sigma line of surprise we'd be better off.

Expand full comment

Recht gives a lot of sensible comments, but sadly his coin-toss interpretation of a P-value a few minutes in is quite incorrect. First, he leaves out the important "or more extreme" modifier: The P-value refers to a probability of what was seen or more extreme. Second, the version of coin tossing he gives only works when the number of outcome events is very a small proportion of each treatment arm; otherwise it can be far too large, and it doesn't work for continuous or ordinal outcomes.

Here's some correct interpretations that work for any P-value and also for interval estimates:

A P-value can be seen as an index of compatibility between the data and a hypothesis or model, ranging from 0=completely incompatible to 1=perfectly compatible. A 95% compatibility ("confidence") interval is just the collection of all effect sizes whose P-values are greater than 0.05.

How can we correctly interpret P-values between 0 and 1 in terms of coin tosses?

An interpretation that applies to any situation is to compute the Shannon surprisal or S-value of the P-value. The S-value s is the number of fair-coin tosses one would have to make in order for the P-value p to equal the probability of getting all heads in the s tosses. That number s is just the base-2 log of 1/p; it is one measure of the information the P-value conveys against the hypothesis or model it tests. Then we see that p=0.05 corresponds to s=4.3, and so under the tested hypothesis or model is just a little more improbable than getting 4 heads in a row from 4 fair coin tosses; also, every effect size inside the 95% CI has p>0.05, and thus is highly compatible with the data in the sense of corresponding to 4 or fewer heads in a row from fair coin tossing.

The philosophy here is to make things as simple and general as possible, but not to the point that you make them wrong. If compatibility and surprisal are too complicated for a user, there's not much hope they will be interpreting statistics correctly because that's about as simple as you can make general and correct interpretations of P-values and CIs. The traditional repeated-sampling ("significance" and "confidence") interpretations are even harder to get right; when you try and and make them simpler (as Recht and typical statistics primers do) you get them wrong or worse, as seen in the massive errors in study reports (such as claiming the study supported the null just because the null was inside the 95% CI).

For more discussion of P-value and CI misinterpretations and correct interpretations see these open-access articles:

Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at

https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf, https://www.jstor.org/stable/44851769

Rafi, Z., and Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9

Greenland, S., Mansournia, M., and Joffe, M. (2022). To curb research misreporting, replace significance and confidence by compatibility. Preventive Medicine, 164, https://www.sciencedirect.com/science/article/pii/S0091743522001761

Expand full comment

It seems like your guest suggested that 40 heads in a hundred flips leads one to reject the hypothesis that the coin is fair? Really?

If you had said 34 heads should lead one to investigate whether the coin is fair, I'd be with you. We are hiding a lot of ignorance between 2 and 3 sigmas.

It is very hard to unsee Deming performing the red bead experiment.

[Of course, von Neuman's practical solution to flipping a coin which might not be fair provides a different practical way to be fair.]

Expand full comment

This seems like a really long winded way of explaining the p value.

My understanding is that it is a measure of the confidence that the result is not down to pure chance. A p value of 0.05 suggests that there is 5% likelihood that the result was down to chance, rather than the intervention. 0.05 has been set as the threshold of acceptable risk that the result is trustworthy. It does not mean the result is 100% accurate.

Expand full comment

Can you please fix the typo in the title? It will help the important credibility of this article. Thanks!

Expand full comment

I don’t want to pay 20 bucks to NEJM Evidence just to read the article. How can I get around that?

Expand full comment

Your intervals are confidential? How will we know what they are?! 😂 Sorry. Couldn't help myself! Yes, I know what you meant but I think you've been studying HIPAA too much! :)

Expand full comment