The Uncertainty of Clinical Trial Results
For the Study of the Week, I will show examples of results that I hope make you more cautious about drawing conclusions.
Clinical trials are supposed to give doctors answers to major questions. You test treatment A against treatment B and count outcomes. You then learn which treatment delivers better results.
Grin. It rarely works that cleanly. Rare are times when treatment A is so dominant over B that everyone agrees. Such scenarios are rare because trials are mostly applied to medical decisions in which doctors cannot agree—so called equipoise. (The debate about when there is equipoise is complicated and a matter for another post.)
Professor Frank Harrell is an eminent professor of statistics at Vanderbilt. Thankfully he stays active on Twitter, so that many of us can learn from him.
Harrell recently criticized Vinay Prasad for publishing a paper that defended the authors of the Cochrane review of physical interventions to interrupt or reduce the spread of respiratory viruses.
Let me do a brief summary and then make my point about drawing conclusions from randomized trials.
The Cochrane review of studies of medical/surgical masks for preventing community spread of respiratory viruses found a relative risk of 1.01. We call that the point estimate. And 1.0 indicates no effect.
Point estimates come with 95% confidence intervals, which stipulate the precision of the experiment. In this case the 95% confidence intervals ranged from 0.72 to 1.42. That means masks could have reduced viral spread by 28% or increased it by 42%.
The authors of the Cochrane review wrote:
“Wearing masks in the community probably makes little or no difference to the outcome of laboratory-confirmed influenza/SARS-CoV-2 compared to not wearing masks.''
Professor Harrell’s point was that wide confidence intervals indicate too much uncertainty, and the data is not informative. The Cochrane editor-in-chief agreed, and she felt a more appropriate conclusion would have been to say the results were inconclusive.
Vinay then led a study showing that 20 previous Cochrane reviews that had similar point estimates and confidence intervals concluded no effect. So there seemed to be selective enforcement of uncertainty in experiments.
This brings me to the larger point of how clinicians should interpret the uncertainty in clinical trials.
Let me show you two examples of results and conclusions from the same month in the same journal.
DANCAVAS compared an invitation to have a slew of cardiac screening tests vs no invitation. The outcome was death. (A strong endpoint.) The final results: HR 0.95 with 95% CI 0.90-100. Translation: (roughly) the screening program reduced the chance of dying by 5%, and, if the experiment was repeated many times and a 95% confidence interval constructed each time, 95% of such confidence intervals would contain the true unknown value. (I put that in italics because the origin wording was wrong. Thanks to Professor Harrell for the correction.)
Conclusion in the NEJM: (emphasis mine)
After more than 5 years, the invitation to undergo comprehensive cardiovascular screening did not significantly reduce the incidence of death…
My comments: Really? The vast majority of the CI lies below 1.0. If this experiment was repeated many times, most of the true results would have shown a reduction of death. Death is an important endpoint. Yet, because the CI reached no effect (1.0), editors of the journal required the authors to conclude the intervention was unsuccessful. This seems extreme to me.
PROTECTED-TAVR compared use of device placed in the carotid arteries to capture debris during replacement of a diseased aortic valve. The idea is if you stop debris going north to the brain, than you reduce stroke.
Researchers randomized one group to get the device and one group did not. Outcome was stroke. The authors presented the results in the form of absolute differences—stroke rates of 2.3% vs 2.9%, device vs no device, respectively. The point estimate of the absolute risk reduction was -0.6%. The 95% confidence intervals ranged from a 1.7% lower risk of stroke to a 0.5% risk increase.
Unlike DANCAVAS, the upper bound of the confidence interval included a chance that strokes were higher in the device arm.
Conclusion in the NEJM: (my emphasis on the last phrase)
Among patients with aortic stenosis undergoing transfemoral TAVR, the use of CEP did not have a significant effect on the incidence of periprocedural stroke, but on the basis of the 95% confidence interval around this outcome, the results may not rule out a benefit of CEP during TAVR.
We discussed this issue on the Sensible Medicine podcast this week.
The first problem here is the uncertainty inherent in trial results. When I was young, I erred in focusing too much on the point estimate and whether the confidence intervals reached significance (less than 1.0 when looking for a reduction in an outcome).
But it is clear from the Cochrane review, Vinay’s re-analysis of other Cochrane papers, and these two examples from NEJM, that there is a lot more uncertainty in results and flex in making conclusions than I had appreciated.
Regulators have it easy because they can simply say something doesn’t meet a statistical threshold.
But doctors have it harder, for many reasons:
A) because we have to assess the width of the confidence intervals—too wide and we can’t draw conclusions due to imprecision. But…no one tells us how wide is too wide. Whatever this threshold is, it has to be consistent.
B) Because we have to look at the confidence intervals and determine the probability that an intervention has a benefit. The screening program in DANCAVAS may not have met its threshold but clearly the probability is more than not that it reduced death.
C) In the stroke prevention trial, the larger part of the confidence interval trends to benefit, but there is a chance of harm. What are we to think? Why were those investigators allowed to conclude possible benefit whereas DANCAVAS authors were not? I am asking.
If you are more confused and now have lower confidence in medical evidence, I have succeeded.
It’s why I am cautious and medically conservative. Please free to educate us in the comments.
What is an "ideal" metric anyway? (is "metric" the right vocabulary?) My initial beliefs are that it would reflect cognizance of all of the following in its design:
1) No patient is average and a variety of effect sizes could happen on an individual level.
2) No patient is average with respect to adverse effects either.
3) Individual patients could hence have realized favorable or unfavorable risk-benefit tradeoffs within a trial regardless of the overall results
4) There is uncertainty around effect sizes
5) There is uncertainty around adverse effects
6) Arbitrary cutoffs are arbitrary
7) An assessment of risk-benefit tradeoffs on an individual (not overall) level is what patients and clinicians either want or will make on their own, by definition.
8) Metrics that report overall or average happenings are in essence obscuring or reducing all available information that exists on a patient level.
9) Metric should be amenable to both interventional and observational studies of different types of outcomes.
10) Metric should be amenable to meta-analysis, to enriching interpretation of past and future results, and to having interpretation enriched by past and future results.
11) An “ideal” metric assumes the existence of excellent data transparency and adverse event reporting.
12) Relative and standardized measures are not intuitive to read.
13) Clinically significant is not the same as statistically significant.
14) Even in the presence of complete knowledge, risk-benefit tradeoffs are subjective (e.g. palliative chemo that will give you 2 more months but lots of side effects).
15) Potential adverse effects are of three types: those recorded during the study, the potential for a negative effect size, the potential for long-term effects not captured.
- Various metric ideas going on in this head, but none of them coherent enough to write.
- One thought is since we want to get away from arbitrary thresholds, maybe we can make a metric that dynamically bakes-in its own decision about a confidence level and then standardizes itself against something. Or maybe present a continuum of a metric values that are a function of readers' personal subjective risk-benefit values. (On a side note, the word risk implies possibility of harm, whereas the word benefit implies certainty of benefit. I always found this consistent. It perhaps ought to be "benefit-harm " tradeoff.)
- Another thought: In cosmetic trials dermatologists subjectively rate things and they look at inter and intra-rater reliability. At the end of the day, the subjective interpretations of humans is what determines the value of a study. Should we be routinely polling people?
Thanks for lasting through this rant.
Why aren't they doing better power analysis? It seems like a lot of these problems could be avoided with better pre-study design.