A Note On Criticism Of Yesterday's Study of the Week
Experts in trial analysis and statistics objected to my review of the ARCADIA Trial. This is a great teaching point.
Yesterday, someone made an error on the Internet. And I think it was me.
Here is a link to my take of the ARCADIA trial. In brief, the trial compared apixaban vs aspirin in patients who had had a stroke of unknown source and evidence of atrial cardiopathy.
The trial was stopped early for futility. Recurrent stroke (the primary endpoint) occurred in 40 patients in each group. The hazard ratio was 1.00.
I interpreted this in a similar way as the authors: as a negative study. Apixaban did not reduce the rate of recurrent stroke over aspirin in these patients.
My mistake, most clearly made by Dr. Frank Harrell, a statistical expert from Vanderbilt, was that I wrongly equated the absence of evidence with evidence of absence. His full criticism is in the comments of yesterday’s post.
My understanding of Professor Harrell’s point is that the trial was underpowered, and we cannot be sure that there was no difference in stroke rates. IOW: we cannot say that ARCADIA provided evidence of absence of no benefit from apixaban.
The first clue—one that I noticed but dismissed—was the confidence intervals around the hazard ratio of 1.00. These were 0.64-1.55. I hesitate to write the next sentence because the probability that I get the wording exactly correct is very low. But suffice to say that confidence intervals allowed for a 36% reduction in stroke with apixaban and up to a 55% increase in stroke with apixaban. Translation: these are wide and imprecise estimates.
The next question is why were the estimates so wide. The answer is that there were not enough primary outcome events because too few patients were enrolled.
Trialists have to make estimates before a trial as to how many patients to enroll. The main variables that go into this estimate are the expected event rates in the respective arms of the study, as well as the minimally clinically important difference you would like to detect (MCID).
The authors estimated a 7% annual stroke rate in patients with atrial cardiopathy who had just had a stroke. They used a previous trial of apixaban vs asa (AVERROES), which found a 55% reduction in stroke with apixaban, to declare 40% reduction as their minimal clinically important difference. More translation: they “powered” ARCADIA to find a very large effect size.
You can use a rough sample size calculator to estimate how many patients this would require. They got to about 1100 patients.
But. But. There is another factor in these calculations that does not appear in any calculator. And that is pragmatism. Before I started reviewing trials, I did not know about pragmatism in the design of trials.
What I mean by pragmatism is that trials cost money and effort. So, if you try to detect smaller differences in the two groups, you need many more patients. Precision, like money, does not grow on trees.
Dr Sanjay Kaul calls this sample size samba. A dance (or tension) between getting enough patients enrolled to detect differences but doing so within the pragmatic budget constraints.
Many cardiology trials are funded by industry. Industry has plenty of money. These tend to be mega trials that can detect small differences between groups. (Look at the FOURIER trial of evolocumab vs placebo…N ≈ 27,000)
I played around with the online sample size calculator and entered a 25% reduction in stroke with apixaban in ARCADIA. This would have required triple the number of patients. Well, that would have required a lot more funding. ARCADIA was not funded primarily by industry.
Professor Harrell noted that the problem with ARCADIA, and really, many trials in cardiology, is that if you declare a minimal clinically important difference of a 40% reduction, you ignore a 39% reduction in stroke.
Testing Atrial Cardiopathy: I want to offer one rebuttal regarding the specifics of ARCADIA. This trial was not only a comparison of two drugs in reducing a binary outcome of stroke. ARCADIA was also about testing the concept of atrial cardiopathy as a screening tool for high stroke risk.
The authors estimated that their novel idea of atrial structural disease would significantly influence stroke rates. It did not. That may have been because the criteria for atrial disease (ECG, biomarker, Echo) were not stringent enough. Or it might be that atrial cardiopathy is just not that strong of a causative factor in recurrent stroke.
There is more to learn about AF and stroke.
I write this post to alert readers to an important teaching point about uncertainty in trial estimates. I noted the confidence intervals but perhaps I was distracted by the hazard ratio of 1.00 and the exact same number of events in each group.
Being wrong is a great teaching tool. I am grateful for the interaction that Sensible Medicine brings out from real experts. I learned a lot.
One fear I have though is that if we apply Professor Harrell’s strict criteria, we will find that many trials have high degrees of uncertainty. Confidence intervals in our trials are often wide.
That’s ok. Because rare are the things that we should be certain about in Medicine. It’s one of the reasons that I do not love those colored boxes in guideline statements.
In his comment, Dr. Harrell offers some solutions to resolving the tension between not having a trial or having an imperfect trial. Please do read his entire comment.
But wait. We are ALWAYS referring to absence of evidence. That should go without saying.
To require evidence of absence would be to require proof of a negative. That’s a logical impossibility. That’s not how trials work…ever.
So a trial starts with a hypothesis, and pre-specified statistical parameters upon which one can reject the null and accept the hypothesis. In this case, the statistical analysis showed we could not reject the null and accept the hypothesis; so instead, we must reject the hypothesis and accept the null: ie. accept that “X” does NOT work (in this case; in how the trial was done in this instance). That is of course different from stating one has PROOF that X doesn’t work. This is how it is, and how it has always been.
Hence, there is always the possibility that a larger trial with more endpoints MIGHT show that X in fact does work (and we can reject the null, that it doesn’t). Or that an identical study using different power assumptions may have yielded different statistical results. The onus is on the proponents, of whatever, to do that work, and show that proof.
You can always say…under different trial and statistical circumstances….”X might have worked”….but that still absolves precisely nobody from having to go out and proving it.
That said, the interesting thing with this trial to me is still the metrics for determining atrial cardiopathy as a predictor of future overt AF. We already know DOAC is better than ASA in clinical AF. That’s news to nobody. What I take from this is that the way this study tried to determine atrial cardiopathy has not been shown to be effective (ie absence of evidence). I would not, on the basis of this study, yet conclude that the entire concept of atrial cardiopathy is without merit (ie. evidence of absence).
Very educative and kudos to dr Mandrola for his lack of prestige! This just cements my confidence in his assessments. Not everyone is so wise to be humble.