Can Data Just Speak for Itself?
Gosh did I find a surprise looking at the original carvedilol trials in heart failure.
Tomorrow over at CardiologyTrials we will describe the CAPRICORN trial of carvedilol vs placebo in patients post myocardial infarction. Carvedilol is now widely accepted for this indication, but CAPRICORN featured a super-interesting twist regarding how we interpret data given pre-experiment choices. Do tune in. (Also, the bad typo in the original e-mail is fixed.)
This discovery led me to an even more interesting twist regarding the first Carvedilol trial in heart failure. Here is a screenshot of the trial from NEJM in 1996.
I want to state clearly now that I do not know the right resolution to the main question of this post. I am not sure there is a right answer.
The main conclusion of this seminal paper was that carvedilol reduces death and the risk of hospitalization for cardiac reasons.
The trial enrolled about a 1100 patients with 2:1 randomization; about 700 in the carvedilol arm and 400 in the placebo arm. “There were 31 deaths (7.8 percent) in the placebo group and 22 deaths (3.2 percent) in the carvedilol group.” This represented a 65% decrease in death and the 95% confidence intervals went from a 39% to an 80% decrease in death. The p-value calculated at less than 0.001.
It sounds like a no-brainer, doesn’t it? Yet the first time it went to an FDA advisory committee meeting it was voted down—definitively.
How could that be?
Well, it turns out that there were 4 US carvedilol trials in HF. Three of the four, including this one, had a functional endpoint of exercise tolerance. None of these three trials met statistical significance for this endpoint. Carvedilol did not help patients with heart failure feel better.
But there is even more. In the above seminal trial, total mortality was not selected as a primary or secondary endpoint. Experts in HF, therefore, at the time, must not have expected it.
The difference in total mortality was a surprise finding.
The first FDA advisory committee had many statisticians on it. Their thinking, summed up by an amazingly clear piece by Dr. Lemuel Moye, goes something like this:
Scientists must declare a primary endpoint before the experiment. This, Moye writes, is the access which a trial revolves around.
That is not what happened in the case of the US carvedilol trials. The investigators, and to be honest me as well, wondered why you would not just let the data speak for itself. Such an important death signal cannot be ignored.
I wrote about this issue to regulatory expert Dr. Sanjay Kaul, who sent me two examples wherein an after-the-fact surprise mortality signal was NOT confirmed.
One was vesnarinone and the other amlodipine.
You can see from this half slide that both drugs in their first trial found huge reductions in death. In both cases however, total mortality was not the primary endpoint.
Here is the slide with the larger studies. You can see that neither signal panned out in subsequent larger studies.
In the carvedilol case, the investigators asked us to ignore the negative primary endpoint of the trials (exercise tolerance) and focus on the mortality finding. Because it is a) important, b) a big decrease, and c) statistically persuasive.
Dr Moye disagreed. He likens ignoring the negative primary endpoint as the “crazy aunt in the attic.” He explained that an experiment has a glassful of false positive chances. If you use part of it up in the primary endpoint, there is only so much left to use in subsequent endpoints.
In the carvedilol example the p-value for the primary endpoint was over 0.05 so all the false positive chances were completely used. And, thus, the downstream mortality finding was highly likely to be a false positive—akin to the vesnarinone and amlodipine example I just showed you.
Moye worried greatly about foisting false findings into the public realm.
In the end, Moye’s worries were not realized. A subsequent carvedilol study proved the drug reduced mortality in heart failure. Two other beta-blockers, metoprolol succinate and bisoprolol were also shown effective in reducing death in stable outpatients with heart failure.
The Take-Home Lessons
I’ve been showing the beta-blocker heart failure trials as examples of clear signals of benefit. And they ended up being so.
But I did not know that the first trials were actually negative for their primary endpoint. Investigators surprisingly noted the lower signal of mortality. They let the data speak for itself.
Carvedilol in stable outpatients with heart failure turned out not to be vesnarinone or amlodipine, but this was not known at the time of the second FDA meeting in 1997 when a more clinician-heavy committee approved carvedilol.
When we show CAPRICORN tomorrow on CardiologyTrials, a similar story will come up. That is, what should we think of surprising results downstream from a nonsignificant primary endpoint?
I don’t have the answer. Some of it involves philosophy. Namely, what is worse, letting an uncertain potentially harmful therapy out in society, or holding back a potentially beneficial one?
For me, I lean on the first do no harm principle. Proponents of a new therapy ought to show us their intervention works in a proper way—via the primary endpoint.
The second question I pose pertains to use of statistical principles.
A reader of that first carvedilol trial in NEJM in 1996 would have little idea that the authors made quite a contentious statistical leap. They were proven correct, but this was not known at the time.
I agree with your answer to the question:
"Namely, what is worse, letting an uncertain potentially harmful therapy out in society, or holding back a potentially beneficial one?"
Stats 101 is built on the notion of rejecting the NULL hypothesis, i.e. we assume at the start of any research project that there are NO differences/benefits. It's the scary high wall of PROOF that science demands we get over. No evidence or "potential" or "suggestive" benefits should be treated as NO proof. With the replicability problem in focus now, as well as the tendency to not publish research with null findings, we need to be all the more strict about the fundamentals of science/research. Otherwise we will do HARM and/or waste precious resources that could be better directed elsewhere.
Thanks for your excellent cautionary writing John. Keep those standards and guardrails strong!
Dr. Mandrola: Thanks for the great examples and wisely uncertain conclusions...
Here is my view of your title question, the examples, and certain earlier comments. Other comments stated some of these points below succinctly in their own frameworks. My take will not be succinct, and I hope that anyone who starts reading it will read it carefully to the end before replying.
First and foremost: Data don't speak - If you hear the data speaking, seek psychiatric care. As was recognized by the 1940s in the then-emergent field of information theory, the data only convey information based on assumptions about how they were generated. In statistics, those assumptions are encoded (really, buried very abstractly) in the methods used to generate statistical estimates and P-values. Those assumptions include that all the trials in play are perfectly conducted and analyzed. Those assumptions might not be in question in the present examples. Nonetheless, more controversial yet equally abstract assumptions and arbitrary conventions are used to derive qualitative decisions or "conclusions" derived from those statistics; those include assumptions about the costs of possibly mistaken actions in light of the estimates and P-values, such as to declare "significance" or "nonsignificance" based on a P-value or interval estimate, or to report apparent conflict between earlier and later studies as "replication failure".
To illustrate: The first thing I noticed about the VEST and PRAISE pairs is that they look cherry picked to raise alarm about overdrawing decisions from single trials where no previous trials existed to check against. Following Fisher, the P-values for the secondary endpoints in the first studies (VEST-1 and PRAISE-1) only indicated that a "confirmatory study" or "replication attempt" was worth pursuing. Hopefully, this decision to pursue was not just because of their small P-values; presumably, the ranges of their interval estimates looked important enough to warrant the expense of further study of the effects, and included effect sizes that were plausible in light of what was already known.
Of course, comparing results from different studies needs to be direct and analytical, not based on pointing out that one was "significant" and the other "not significant." There was certainly more than that in the present examples: By any standard I know of, the second trials were in outright conflict with the first trials: The differences of VEST-2 from VEST-1 and of PRAISE-2 from PRAISE-1 were huge in practical terms, and both differences had p < 0.001. The changes in the estimates from the first to second trials look so large statistically and practically as to suggest something other than or in addition to chance caused the difference.
Using the same methodology that raised suspicion against the secondary results in VEST-1 and PRAISE-1, the list of possible causes includes selection bias in the form of selecting these trials precisely because of the replication attempts were in clear conflict. This meta-selection bias is no different in its capacity to mislead than is selecting effect estimates out of one trial based their looking important and "statistically significant". It could also reflect real differences in effects between the first and second trials, perhaps due to differences in patient selection for the trials, differences in standard of care used for the untreated (controls), and differences in follow-up - one would have to scrutinize the reports carefully to judge such post hoc hypotheses, but to ignore these possibilities would be scientifically irresponsible.
So then, what's the difference between these examples of secondary endpoints and examples of primary endpoints? One would be the difference in pre-trial baseline (or prior) expectations: The primary endpoint presumably had the benefit of enough background information (e.g., earlier-phase trials) to motivate conducting the first trial with that endpoint named from the start as primary and used in the trial planning (e.g., in study-size calculations). This kind of use is seen when the active-treatment arm is twice the size of the placebo arm: This imbalance suggests that a halving of an adverse primary outcome was expected. In contrast, describing results for secondary endpoints as "unexpected" suggests there was little or no such background information. And labeling secondary results as likely "false positives" suggests an even stronger baseline expectation, namely that the treatment would have no important effect on the secondary outcomes.
Conventional frequentist statistics can account for these baseline expectations using controversial, counterintuitive, and often torturous multiple-comparisons adjustments, e.g., as reviewed here:
https://onlinelibrary.wiley.com/doi/full/10.1111/ppe.12711
As Frank Harrell could I think verify, Bayesian methodology can offer a more straightforward alternative by noting that the baseline expectations for the primary endpoint were concentrated toward beneficial effects, whereas the baseline expectations for the secondary endpoints were concentrated toward no effect. By translating these expectations into prior probability distributions and using them in a Bayesian analysis, we can get posterior distributions for all effects, primary or secondary.
Unfortunately, in statistics as in life there is no free lunch: With Bayesian methods, the problem of multiple comparisons is replaced by the challenge of creating a credible multivariate prior for the effects. Reasonable possibilities for a credible prior are limitless, as are disagreements about what "credible" means. One attempt to address that problem is to build the prior from what has been reported in other trials. That is what one does in an intuitive fashion when one dismisses an initial result by saying that initial results usually don't replicate. A much more empirically grounded (and hopefully much less prejudicial) way to operationalize this intuition has been developed by van Zwet and colleagues, as reviewed here:
https://evidence.nejm.org/doi/full/10.1056/EVIDoa2300003
In pointing to this and any alternative to familiar analyses, it is important to compare its results to those from conventional approaches. This advice leads to method-sensitivity analysis, also known as multiverse analysis, which applies and compares results from multiple methodologies,
https://en.wikipedia.org/wiki/Multiverse_analysis
- cautions include that all analyses must be reported (at least as a supplement) and summarized, and possible reasons for any apparent conflict should be investigated and discussed. With that done, such comparisons should help understand why the data never "speak for themselves"; rather, it is we who speak of the data filtered through whatever methodologies and prejudices we used to analyze the data - including methodological misconceptions and prejudices.