Study of the Week – The Matter of Confidence
The POST-PCI trial was reported as “negative.” The better interpretation was that it was inconclusive. These are not the same.
Imagine an experiment to determine if a coin is fair.
If you flipped it 10 times and got 7 heads and 3 tails, you would not be sure.
If you flipped it 100 times and got 70 heads and 30 tails, you’d be worried that it was biased towards heads.
If you flipped it 1000 times and got 700 heads and 300 tails, you would be highly confident that this coin was biased to turn up heads.
Medical experiments are similar. Scientists who compare one intervention to another have to have enough outcomes to sort out signal from noise.
Scientists make educated guesses before the experiment starts about the rate of outcomes and the expected differences in the two interventions. They try to avoid the situation in the first coin-flip experiment where you can’t tell if the 7 heads/3 tails is signal or noise.
One way to avoid an inconclusive result is to recruit oodles of participants. The two problems with that are a) trials are expensive and funds are limited, and b) trials are actually human experiments, so you want to expose the minimum number of people to experimentation.
The POST-PCI trial set out to determine if the strategy of doing surveillance stress tests after a person gets a cardiac stent is beneficial.
The backstory: In the old days, cardiologists did stress tests after a patient had a coronary blockage fixed. The idea was that the fixes (balloon angioplasty, stents, or even bypass surgery) were not 100% reliable, and it was good to know if there were still residual obstructions. It didn’t hurt that that cardiologists were generously paid for these stress tests.
Over time, we learned that this lucrative strategy was mostly unnecessary. For two reasons: stents and bypass became very reliable and numerous studies showed that medical therapy was highly effective in preventing future events even when arteries had severe blockages.
Now, however, technology has advanced so much that cardiologists can put stents in super-high-risk situations. Situations that, in previous years, would have required bypass surgery.
One example is placing a stent in the left main coronary artery. This is high-risk because if that stent were to clot off or become re-stenosed (re-blocked), a huge amount of heart muscle would be in jeopardy.
The POST-PCI trialists randomized about 850 patients who had a high-risk coronary intervention to either having a stress test afterwards or standard of care (stress tests only when indicated by symptoms).
They measured a really important endpoint—a composite of death, heart attack or a hospital admission for unstable angina.
The results: After 2 years, a bad outcome occurred in 5.5% of patients in the stress testing arm and 6.0% of those in the standard care arm. We say the relative risk difference was 10%, which is often expressed as a hazard ratio. Here, the HR was 0.90.
The authors also reported a p-value, which is a measure of how surprising the data is given the assumption that there was no difference in the treatment arms. We call a result statistically significant if the p-value is below 0.05, (very surprising). In POST-PCI, the p-value was 0.62, which is not at all surprising given the assumption of no difference in the two strategies.
I realize that paragraph may read like gibberish. Try this:
What you want to know about this (or any) experiment is if the result was like the coin flip experiment of 7 heads/3 tails or the one that had 700H/300T. IOW, you want to know how confident to be in that 10% lower rate of events in the stress-testing arm.
Boom—that is why we have confidence intervals!
In POST-PCI, the 95% confidence intervals surrounding that hazard ratio of 0.90 were 0.61 to 1.35. What does that mean?
Sadly, this is not a good result. It means, roughly, that the intervention of doing surveillance stress tests after high-risk coronary intervention could have lowered the chance of a bad outcome by as much as 39% or increased it by as much as 35%.
The editors of the New England Journal of Medicine are old-school. If an experiment does not meet that p-value threshold of 0.05, they make the authors conclude that there was no difference. Hence, the conclusion of this experiment reads:
Among high-risk patients who had undergone PCI, a follow-up strategy of routine functional testing, as compared with standard care alone, did not improve clinical outcomes at 2 years.
I don’t think that is right.
The message is that when the confidence intervals include a substantial benefit and substantial harm, the result is inconclusive. POST-PCI did not include enough patients to answer the question. We simply don’t know.
It is nobody’s fault. The authors assumed high-risk patients would have had more bad outcomes. If there were more bad outcomes (coin flips), we’d have had more confidence. The 95% confidence intervals would have been tighter.
The message for readers of medical studies is always look at the confidence intervals.
These give you a clue to how confident to be in the trial’s results.
JMM
(BTW, a good contrast is last week’s study of the week, DANCAVAS. This trial found a 5% reduction in death with screening. The confidence intervals ranged from 0.90 to 1.00 (or a 10% reduction to a 0% reduction). This means that there was a very good chance that the screening program in that trial had a positive effect. It’s one of the reasons why my take differed from Dr. Foy’s.
A larger sample size can be the answer, but not always.
I 100% agree that increasing sample size increases confidence (by reducing the width of the confidence interval) which can change an inconclusive finding to a conclusive one; even when using the old school standard of p<0.05. This can be true to a fault. This is the issue of a study being over-powered. With a huge sample size, yes we can be very confident and most likely find statistical significance, but we still need to interpret the importance and utility of that result. A statistically significant result can still have very limited practical implications. This occurs when the effect size is still near zero even though statistical significance is found. This represents a different way to game the system. If you have enough money to buy enough trials, likely you'll be able to publish positive results even when those results really aren't going to matter. Granted, given the expense of medical trials this is a less likely scenario. I'm a social scientist working in education where there can exist very large secondary data sets that researchers use. Either way, like what everyone else is pointing to, the way conclusions are written in journal articles needs to be super clear. This should include not just statistical significance, but practical significance.
My 2 cents, thanks.
Thank you for writing this post in a way that any layperson can understand. Much appreciated!