When Studies Don’t Answer Their Question
The Study of the Week delves into the frustrating matter of inconclusive trials
As always, we at Sensible Medicine appreciate the support from our readers. We are surprised and grateful. JMM
Let’s start by leaving out the disease and treatment.
This randomized controlled trial was simple and elegant. One group received an active drug (an inexpensive generic), the other group received placebo.
The primary endpoint required no judging—alive or dead at 60 days.
The Results
At 60 days, 17.3% of those in the active arm died vs 21.3% in the placebo arm. That is an absolute risk reduction of 4%. The relative risk reduction in death equaled 33%.
There were no important safety signals.
This sounds pretty amazing: a generic drug reduced the risk of dying by 33% in relative terms and a whopping 4% in absolute terms. These are impressive effect sizes.
Yet something is missing. I haven’t told you all that you need to know.
No, it’s not what the disease or treatment was. I will get to that.
When you look at a study’s results, you need a measure of confidence in signal vs noise.
Were the results a chance finding? So far, all that I have told you is the effect size. Now you need to know whether the findings are statistically robust.
Confidence Intervals
That absolute risk reduction—in death—of 4% had 95% confidence intervals that went from 14% lower to 4.7% higher.
We use a hazard ratio to express the relative risk reduction of 33%. The HR in this trial was 0.77 and the 95% confidence intervals went from 0.45 (a 55% reduction) to 1.31 (a 31% increase).
The p-value, which quantifies the surprise value of a result given the assumption that there was no difference in the two treatment arms was high at 0.33.
JAMA published the AntibioCor trial. The disease in question was severe alcoholic hepatitis. The treatment in the active arm was the super-common antibiotic amoxicillin-clavulanate.
The idea is that patients who present with severe liver injury and inflammation are usually treated with oral corticosteroids. Both liver injury and steroid therapy increase the susceptibility to bacterial infection—which is very bad.
That is why the authors studied the use of long-term antibiotics in a preventive role. Of course, if there were signs of infection, antibiotics would be used. The question in this trial was about prevention.
It is an important question because a) alcoholic hepatitis is common, b) it is severe (note the high mortality rates) and c) amoxicillin-clavulanate is inexpensive.
The reason I highlight the AntibioCor Trial is that it is a shining example of an inconclusive trial.
The authors and editorialists conclude that the antibiotic did not improve survival and do not support preventive antibiotics in this disease.
The current clinical trial by Louvet et al suggests no role for prophylactic prescription of antibiotics when treating all patients who have alcohol-related hepatitis with corticosteroids.
While this may be a technically correct conclusion, I don’t think it is accurate.
The more accurate conclusion is that the wide confidence intervals do not preclude a substantial benefit or harm.
For instance, the lower bound (best case scenario) of the primary outcome allows for a massive 14% reduction in death. DEATH. Yet it also allows for (worst case scenario) of a 4% higher rate of death.
This data cannot give an answer.
Why did this happen?
The simple answer is that there were too few patients enrolled and too few events. It’s like determining whether a coin is fair with only 10 flips.
Had the investigators enrolled more patients, there would have been more events, and the confidence intervals would have been tighter.
You might now ask: why were there too few patients? How do trialists decide on the number of patients?
Well, this gets a bit complicated. And it is not an exact science.
There are two main factors, each with opposing forces. One force is ethical and the other pragmatic. Ethically speaking, RCTs are experiments on humans. So, you want to enroll enough patients to answer a question while exposing the minimum number to the experiment. Pragmatically, trials require effort and money. So, again, you want to enroll the goldilocks number of patients.
Most studies include a few sentences on how the authors determined the sample size. It’s usually in the Statistics paragraph. In JAMA, Sample Size Calculation gets its own section. Understanding this calculation usually requires some content expertise.
In this case, the authors estimated that 27% of placebo-treated patients would die. They then powered their study to detect a 14% reduction in death from the addition of amoxicillin-clavulanate. Given these estimates, they estimated a sample size of 280 patients.
The problem was that only 21% of patients in the placebo arm died. They underestimated.
I, and likely you, don’t have the content expertise to criticize the choice of sample size. Though it does seem strange to think that an everyday oral antibiotic used in a preventive way would reduce death by that much.
The message of this Study of the Week is that the point estimates from this trial suggest a major reduction in death. But due to underestimation of sample size, and wide confidence intervals, we don’t really know whether the intervention works.
Results like this have changed my mind about trials.
I used to think we need to do more trials. Randomize, randomize and randomize some more.
But now I have come to learn that trialists need to be super-careful about design. Because perhaps the worst outcome is doing an experiment that cannot answer a question—like this one.
(Let us know what you think in the comments)
It has been my experience that most physicians today only read the title and abstract. Few read conclusions and nobody (but us old guys) read the entire paper including references. It’s embarrassing listening to physicians quote papers that do not substantiate their points. Yet, I have no problem pointing out their inadequacies and getting a chuckle.
Nice article John. You say that the journal's interpretation is technically correct. I do not believe that to be the case. IMHO it is grossly inaccurate due to a massive misunderstanding of p-values and hypothesis tests. Sir Ronald Fisher himself said that a large p-value means "get more data". Any journal wanting to say that there is evidence that a treatment doesn't work should be forced to base that on a Bayesian probability that the treatment effect is clinically negligible. For this study you'd find such a probability to be near 0.5 so we haven't a clue about the ineffectiveness of the therapy.