A larger sample size can be the answer, but not always.
I 100% agree that increasing sample size increases confidence (by reducing the width of the confidence interval) which can change an inconclusive finding to a conclusive one; even when using the old school standard of p<0.05. This can be true to a fault. This is the issue of a study being over-powered. With a huge sample size, yes we can be very confident and most likely find statistical significance, but we still need to interpret the importance and utility of that result. A statistically significant result can still have very limited practical implications. This occurs when the effect size is still near zero even though statistical significance is found. This represents a different way to game the system. If you have enough money to buy enough trials, likely you'll be able to publish positive results even when those results really aren't going to matter. Granted, given the expense of medical trials this is a less likely scenario. I'm a social scientist working in education where there can exist very large secondary data sets that researchers use. Either way, like what everyone else is pointing to, the way conclusions are written in journal articles needs to be super clear. This should include not just statistical significance, but practical significance.
One of the reasons I'm frequently cautious about the results of trials, even well-conceived and executed ones, relates to the coin-tossing analogy mentioned by Dr. Mandrola.
In fact, to get seven heads out of ten isn't rare. Happens frequently.
Using even a very large number of tosses of the same coin - an apparently unbiased coin - will always have hidden surprises.
For example, in 2009 the University of Berkeley carried out a fascinating heads/tails experiment where two students each flipped a coin 20,000 times - i.e. about an hour every day for three months. 40,000 tosses in all.
The full result was almost even: 20,217 heads and 19,983 tails.
But what's interesting to me is that different 'batches' of tosses had very astonishing results. Runs of 8 or 9 out of ten were common. Some batches had runs of 8, 9, 10, or 11 in a row.
But, as seen in the final result, things evened out in the end.
For me, the lesson here is that any trial, even a very careful one, can by sheer happenstance give positive or negative results. So confidence intervals really don't impress me much. What I'd prefer is several different well-powered trials in several different locales. And, if the results are so good that they bring about a change in medical protocol, I''d like the new protocol to be validated by extensive real world observation. And by comparison of results with those of patients who refuse or didn't receive the protocol.
Unfortunately, as both Dr. Mandrola and Dr. Prasad have frequently written, protocol revisions often take years to emerge. Frequently, after many years of real life observation and additional research, a 'gold standard' protocol is revealed as useless or even harmful.
So it goes.
PS - If you're interested in the Berkeley heads/tails experiment, the details are here, together with a downloadable complete data-set spreadsheet of all the coin tosses.
Now, please consider an evaluation of the ISCHEMIA trial, which has resulted in a pivot away from revascularization, to the detriment of long-term outcomes. Years one through four show an advantage for underwriting loss ratios, with providers ignoring an increase in suboptimal outcomes when the patient survives beyond the fourth year.
At least, that's how I read it. I'd very much like to be proven wrong.
The foundational assumption is that medical research is all about "saving lives" seems to me to be what's responsible for driving our critical thinking bus down a blind alley. Earnest criticism of the flaws in studies and conclusions drawn from them will only get us so far. Given that we've come to be a society which long ago abandoned whole person health care for the "modern/better" find-a-symptom-find-a-pharmaceutical business model, I'd say we need to refocus on a new foundational assumption: medicine and medical research is all about generation of ROI. From Rockefeller looking for new uses of his petrochemicals to tech bros looking for places to park the Fed's decade plus of free money, there are few in power who care to look beyond the misleading titles and conclusions of marquee level research efforts if that might derail the generation of profits in quarterly cycles. In fact, they like building AI models with creative statistical assumptions just fine, thank you. "Move fast and break things" suggests innovative omelets are in the investment menu -- and yet I'd say far too often we're simply being sold a "new & improved" McMuffin microwaved sandwich, simply because the margins are better. (Just don't ask where they're sourcing the eggs or they're likely to tell you a fairy tale about a new way to free-range caged chickens, all without having the expense of actual outside access. But I digress.) It's all about building market share and repeat customers for consumable products. Ask Bed Bath what it's like for any company which doesn't have guaranteed contracts from national governments. I'm sure their c-suite would rather be in pharmaceuticals.
With due respect to all of the Big Cigars in Boston who are worthy of that designation, the NEJM has been repeatedly polluted with articles submitted by authors seemingly mesmerized by what NHST (null hypothesis statistical tests) can lead to when improperly deployed. There are many egregious examples that can easily be dredged up. There is huge misunderstanding, for many years now, about what Sir Ronald Fisher hatched so long ago. I can only highly recommend Deb Mayo's *brilliant* 2018 text ("Statistical Inference as Severe Testing: How To Get Beyond The Statistics Wars"). Bring a sharpened number two pencil and prepare to read slowly and with seriousness as you jot notes. Forego the after-dinner brandy. Put down the bong.
The NEJM forcing authors to conclude that a treatment doesn't work whe p > 0.05 is and out-and-out statistical error. BUT the study used the lowest power approach available for analyzing multiple endpoints (time to first endpoint). Had they used the timing and severity of all component events, and respected the word "recurrent" in recurrent hospitalization, the result might have been different. An ordinal longitudinal analysis would have respected all the raw data.
So "old school" means drawing statistically inaccurate and misleading conclusions.
We see that a lot.
For example, when doing a statistical analysis of all-cause deaths in the Pfizer covid vaccine trial, the p-value is 0.5, which is often wrong reported as meaning that there was no evidence that the vaccine was responsible for more deaths. The correct interpretation is that the evidence suggests that maybe the vaccine was responsible for more deaths, but further investigation is needed.
Since Pfizer made the vaccine available to the placebo arm, it eliminated the possibility of future comparison and eliminated blinding. One wonders why.
Thank you for taking the time to write this. Dana
Thank you so much for this lucid summary!
Thank you so much for this lucid summary!
Good luck getting authors to admit their trials are inconclusive when they can get away with reporting a definite (if statistically incorrect) result.
A larger sample size can be the answer, but not always.
I 100% agree that increasing sample size increases confidence (by reducing the width of the confidence interval) which can change an inconclusive finding to a conclusive one; even when using the old school standard of p<0.05. This can be true to a fault. This is the issue of a study being over-powered. With a huge sample size, yes we can be very confident and most likely find statistical significance, but we still need to interpret the importance and utility of that result. A statistically significant result can still have very limited practical implications. This occurs when the effect size is still near zero even though statistical significance is found. This represents a different way to game the system. If you have enough money to buy enough trials, likely you'll be able to publish positive results even when those results really aren't going to matter. Granted, given the expense of medical trials this is a less likely scenario. I'm a social scientist working in education where there can exist very large secondary data sets that researchers use. Either way, like what everyone else is pointing to, the way conclusions are written in journal articles needs to be super clear. This should include not just statistical significance, but practical significance.
My 2 cents, thanks.
One of the reasons I'm frequently cautious about the results of trials, even well-conceived and executed ones, relates to the coin-tossing analogy mentioned by Dr. Mandrola.
In fact, to get seven heads out of ten isn't rare. Happens frequently.
Using even a very large number of tosses of the same coin - an apparently unbiased coin - will always have hidden surprises.
For example, in 2009 the University of Berkeley carried out a fascinating heads/tails experiment where two students each flipped a coin 20,000 times - i.e. about an hour every day for three months. 40,000 tosses in all.
The full result was almost even: 20,217 heads and 19,983 tails.
But what's interesting to me is that different 'batches' of tosses had very astonishing results. Runs of 8 or 9 out of ten were common. Some batches had runs of 8, 9, 10, or 11 in a row.
But, as seen in the final result, things evened out in the end.
For me, the lesson here is that any trial, even a very careful one, can by sheer happenstance give positive or negative results. So confidence intervals really don't impress me much. What I'd prefer is several different well-powered trials in several different locales. And, if the results are so good that they bring about a change in medical protocol, I''d like the new protocol to be validated by extensive real world observation. And by comparison of results with those of patients who refuse or didn't receive the protocol.
Unfortunately, as both Dr. Mandrola and Dr. Prasad have frequently written, protocol revisions often take years to emerge. Frequently, after many years of real life observation and additional research, a 'gold standard' protocol is revealed as useless or even harmful.
So it goes.
PS - If you're interested in the Berkeley heads/tails experiment, the details are here, together with a downloadable complete data-set spreadsheet of all the coin tosses.
https://www.stat.berkeley.edu/~aldous/Real-World/coin_tosses.html
NNTT = 1,000
Thank you for writing this post in a way that any layperson can understand. Much appreciated!
Concise and informative essay, as usual.
Now, please consider an evaluation of the ISCHEMIA trial, which has resulted in a pivot away from revascularization, to the detriment of long-term outcomes. Years one through four show an advantage for underwriting loss ratios, with providers ignoring an increase in suboptimal outcomes when the patient survives beyond the fourth year.
At least, that's how I read it. I'd very much like to be proven wrong.
The foundational assumption is that medical research is all about "saving lives" seems to me to be what's responsible for driving our critical thinking bus down a blind alley. Earnest criticism of the flaws in studies and conclusions drawn from them will only get us so far. Given that we've come to be a society which long ago abandoned whole person health care for the "modern/better" find-a-symptom-find-a-pharmaceutical business model, I'd say we need to refocus on a new foundational assumption: medicine and medical research is all about generation of ROI. From Rockefeller looking for new uses of his petrochemicals to tech bros looking for places to park the Fed's decade plus of free money, there are few in power who care to look beyond the misleading titles and conclusions of marquee level research efforts if that might derail the generation of profits in quarterly cycles. In fact, they like building AI models with creative statistical assumptions just fine, thank you. "Move fast and break things" suggests innovative omelets are in the investment menu -- and yet I'd say far too often we're simply being sold a "new & improved" McMuffin microwaved sandwich, simply because the margins are better. (Just don't ask where they're sourcing the eggs or they're likely to tell you a fairy tale about a new way to free-range caged chickens, all without having the expense of actual outside access. But I digress.) It's all about building market share and repeat customers for consumable products. Ask Bed Bath what it's like for any company which doesn't have guaranteed contracts from national governments. I'm sure their c-suite would rather be in pharmaceuticals.
With due respect to all of the Big Cigars in Boston who are worthy of that designation, the NEJM has been repeatedly polluted with articles submitted by authors seemingly mesmerized by what NHST (null hypothesis statistical tests) can lead to when improperly deployed. There are many egregious examples that can easily be dredged up. There is huge misunderstanding, for many years now, about what Sir Ronald Fisher hatched so long ago. I can only highly recommend Deb Mayo's *brilliant* 2018 text ("Statistical Inference as Severe Testing: How To Get Beyond The Statistics Wars"). Bring a sharpened number two pencil and prepare to read slowly and with seriousness as you jot notes. Forego the after-dinner brandy. Put down the bong.
The NEJM forcing authors to conclude that a treatment doesn't work whe p > 0.05 is and out-and-out statistical error. BUT the study used the lowest power approach available for analyzing multiple endpoints (time to first endpoint). Had they used the timing and severity of all component events, and respected the word "recurrent" in recurrent hospitalization, the result might have been different. An ordinal longitudinal analysis would have respected all the raw data.
So "old school" means drawing statistically inaccurate and misleading conclusions.
We see that a lot.
For example, when doing a statistical analysis of all-cause deaths in the Pfizer covid vaccine trial, the p-value is 0.5, which is often wrong reported as meaning that there was no evidence that the vaccine was responsible for more deaths. The correct interpretation is that the evidence suggests that maybe the vaccine was responsible for more deaths, but further investigation is needed.
Since Pfizer made the vaccine available to the placebo arm, it eliminated the possibility of future comparison and eliminated blinding. One wonders why.
Are the journal editors really so inept?
And I don't trust people like you advertising on other people's blogs. Nor do I trust McCullough or anyone pushing drugs without clinical trials.
But you are ok with the new booster, without clinical trials?
No