The Frustration of Not Being Able to Sort Signal From Noise
The Study of the Week describes why I often struggle to find certainty in clinical trials.
A treatment to reduce stroke is tested in a clinical trial. In the treatment group, 2.3% of patients had a stroke vs 2.9% in the control arm. The question that everyone wants to know is whether this -0.6% difference is signal or noise.
For this, we look to the 95% confidence intervals. In the PROTECTED TAVR trial, the confidence intervals went from -1.7% (a lower stroke rate) to 0.5% (a higher rate of stroke).
I will avoid the controversy over defining confidence intervals, but suffice to say, this degree of wideness allows for the treatment to be better or worse than the control arm.
NEJM allowed the authors to hedge in the conclusion. (Emphasis mine.)
Among patients with aortic stenosis undergoing TAVR, the use of the *device* did not have a significant effect on the incidence of periprocedural stroke, but on the basis of the 95% confidence interval around this outcome, the results may not rule out a benefit of the *device* during TAVR.
Wide confidence intervals are a problem because trials are supposed to give answers. Doctors use trials to guide recommendations. Trials are the foundation of knowledge in medicine. But. A trial that has this much uncertainty doesn’t help.
It seems that most weeks I am writing to you about such a trial.
And even when I don’t see that much uncertainty in a result, Vanderbilt statistics professor Frank Harrell points out that, indeed, John, there was uncertainty. I chose cerebral protection as an example, but I could have picked many different examples.
Why does this happen? How often does it happen?
A group of investigators, mostly from Germany, have published a sobering analysis of trials over a ten-year period (from JAMA, NEJM, Lancet) and discovered why we struggle to sort signal from noise in standard cardiology trials.
JAMA Network Open published the review of 344 cardiology trials. The authors studied the accuracy between the estimated event rates and effect sizes and the actual event rates and effect sizes.
Pause there. Don’t glaze over that sentence. Estimated event rates and effect sizes are how investigators decide how many patients to enroll in a trial.
If your event rate is rare or effect size small, you need lots of patients. (The PCSK9i trials had 18K patients.) If your event rate is high or effect size is large, you need fewer patients. (Cardiogenic shock kills more than half its patients, and the DanGer-Shock trial of the Impella device had 360 patients.)
Here’s the problem: if you are too optimistic, either about how many events you will see (event rate), or, how good the therapy is (effect size), then you don’t enroll enough patients. This leads to wide confidence intervals.
We then say the trial was underpowered. Again, that sentence doesn’t sound awful. Let me rephrase: if you have wide confidence intervals, which can be consistent with benefit or harm, you have experimented on humans for nothing.
The German-led study found that cardiac investigators are quite poor in their pre-trial estimates.
The observed event rates were substantially lower than expected rates. More than half the trials overestimated event rates.
It was the same with effect size estimates. The median effect size observed in the actual trials was only 9% (RR 0.91). But the average estimated effect size was 28%. (RR 0.72)
Device trials (vs drug trials) had less accuracy between estimates and observed values.
Comments
This is bad. An experiment on humans should have enough power to provide an answer.
Having been involved with an underpowered trial, I understand the issues. The problem is that trials don’t do themselves. Trials cost money; they take huge efforts. Extra patients means extra costs and effort.
But it’s not just about money and time. Since a trial is an experiment on humans, you want to enroll the minimum number of patients. Too many patients is also a problem.
Yet the current paper describes a widespread problem of not having enough power to get answers from experiments. It’s an important paper because it exposes a common and serious problem.
Since I am a user of evidence (a doctor), I don’t have the answers. My knee-jerk solution would be to advise more pessimism before the trial. Cardiologists are not so good at pessimism.
Another solution would be to figure out ways to glean more data from the experiment. I suspect Professor Harrell has some ideas.
The take-home message for readers of clinical trials is to understand that uncertainty around trial results stems largely from overly optimistic pre-trial estimates of how many events there will be or how effective the treatment will be.
Users of evidence might be part of the solution by publicly calling out the problem of inaccurate pre-trial estimates.
Instead of being sad about a trial not finding a positive result, we could be sad about wide confidence intervals. JMM
My primary question here, and I am a nurse with limited knowledge, is why strong anecdotal evidence cannot be used to make emergency medical decisions. This data compilation on IVM used in care homes in France came out in March of 2020. Residents being treated for scabies with IVM had an astonishingly low rate of Covid infections. If an EUA can be given for an experimental injection, why not for a long-tested and extraordinarily safe drug like IVM? This seems like sensible medicine to me.
https://www.clinmedjournals.org/articles/jide/journal-of-infectious-diseases-and-epidemiology-jide-7-202.php?jid=jide
Here's my take on the problem:
A post on my substack noted that "The Introduction to a good statistics text will tell you that 'what we do in statistics is to put a number on our intuition.' .... The idea is that you start from the science, from the question to be answered and what the outcome will look like. You propose or apply a mathematical model to the results of your experiment. In other words, the medical or scientific question comes first.... A major defect in the medical literature is that often the opposite is what’s going on — many papers are trying to come up with an intuition to fit a number, trying to derive the science from the statistics. ....The implication, in these cases, is that your experiment did not have independent justification and the significance was revealed by the statistics. The corollary is that the type of experiment becomes more important than its quality."
The description of the case here: “A treatment to reduce stroke is tested in a clinical trial. In the treatment group, 2.3% of patients had a stroke vs 2.9% in the control arm. The question that everyone wants to know …” should be, first, the researchers assessment of how meaningful the procedure is relative to the data. Frank Harrell's comment pointing to Bayes may be helpful but it is the (philosophical) idea contained in Bayes that is key: statistics is taken as the belief in the data. Science is expected to be an intellectual activity. We trust that the researcher has enough training to interpret the experiment. Otherwise, who would have hired him? The most distressing thing about the advent of AI is that we ourselves have become like AI.
“For this, we look to the 95% confidence intervals.…” This is wrong. The key phrase in the post is "A treatment..." emphasis on "A." We might look first to our understanding, that is, our belief (a priori in Bayes terms).