Can Data Just Speak for Itself?

John Mandrola

Feb 5, 2024

Gosh did I find a surprise looking at the original carvedilol trials in heart failure.

Read →

20 Comments

Sander Greenland

Feb 6, 2024

Dr. Mandrola: Thanks for the great examples and wisely uncertain conclusions...

Here is my view of your title question, the examples, and certain earlier comments. Other comments stated some of these points below succinctly in their own frameworks. My take will not be succinct, and I hope that anyone who starts reading it will read it carefully to the end before replying.

First and foremost: Data don't speak - If you hear the data speaking, seek psychiatric care. As was recognized by the 1940s in the then-emergent field of information theory, the data only convey information based on assumptions about how they were generated. In statistics, those assumptions are encoded (really, buried very abstractly) in the methods used to generate statistical estimates and P-values. Those assumptions include that all the trials in play are perfectly conducted and analyzed. Those assumptions might not be in question in the present examples. Nonetheless, more controversial yet equally abstract assumptions and arbitrary conventions are used to derive qualitative decisions or "conclusions" derived from those statistics; those include assumptions about the costs of possibly mistaken actions in light of the estimates and P-values, such as to declare "significance" or "nonsignificance" based on a P-value or interval estimate, or to report apparent conflict between earlier and later studies as "replication failure".

To illustrate: The first thing I noticed about the VEST and PRAISE pairs is that they look cherry picked to raise alarm about overdrawing decisions from single trials where no previous trials existed to check against. Following Fisher, the P-values for the secondary endpoints in the first studies (VEST-1 and PRAISE-1) only indicated that a "confirmatory study" or "replication attempt" was worth pursuing. Hopefully, this decision to pursue was not just because of their small P-values; presumably, the ranges of their interval estimates looked important enough to warrant the expense of further study of the effects, and included effect sizes that were plausible in light of what was already known.

Of course, comparing results from different studies needs to be direct and analytical, not based on pointing out that one was "significant" and the other "not significant." There was certainly more than that in the present examples: By any standard I know of, the second trials were in outright conflict with the first trials: The differences of VEST-2 from VEST-1 and of PRAISE-2 from PRAISE-1 were huge in practical terms, and both differences had p < 0.001. The changes in the estimates from the first to second trials look so large statistically and practically as to suggest something other than or in addition to chance caused the difference.

Using the same methodology that raised suspicion against the secondary results in VEST-1 and PRAISE-1, the list of possible causes includes selection bias in the form of selecting these trials precisely because of the replication attempts were in clear conflict. This meta-selection bias is no different in its capacity to mislead than is selecting effect estimates out of one trial based their looking important and "statistically significant". It could also reflect real differences in effects between the first and second trials, perhaps due to differences in patient selection for the trials, differences in standard of care used for the untreated (controls), and differences in follow-up - one would have to scrutinize the reports carefully to judge such post hoc hypotheses, but to ignore these possibilities would be scientifically irresponsible.

So then, what's the difference between these examples of secondary endpoints and examples of primary endpoints? One would be the difference in pre-trial baseline (or prior) expectations: The primary endpoint presumably had the benefit of enough background information (e.g., earlier-phase trials) to motivate conducting the first trial with that endpoint named from the start as primary and used in the trial planning (e.g., in study-size calculations). This kind of use is seen when the active-treatment arm is twice the size of the placebo arm: This imbalance suggests that a halving of an adverse primary outcome was expected. In contrast, describing results for secondary endpoints as "unexpected" suggests there was little or no such background information. And labeling secondary results as likely "false positives" suggests an even stronger baseline expectation, namely that the treatment would have no important effect on the secondary outcomes.

Conventional frequentist statistics can account for these baseline expectations using controversial, counterintuitive, and often torturous multiple-comparisons adjustments, e.g., as reviewed here:

https://onlinelibrary.wiley.com/doi/full/10.1111/ppe.12711

As Frank Harrell could I think verify, Bayesian methodology can offer a more straightforward alternative by noting that the baseline expectations for the primary endpoint were concentrated toward beneficial effects, whereas the baseline expectations for the secondary endpoints were concentrated toward no effect. By translating these expectations into prior probability distributions and using them in a Bayesian analysis, we can get posterior distributions for all effects, primary or secondary.

Unfortunately, in statistics as in life there is no free lunch: With Bayesian methods, the problem of multiple comparisons is replaced by the challenge of creating a credible multivariate prior for the effects. Reasonable possibilities for a credible prior are limitless, as are disagreements about what "credible" means. One attempt to address that problem is to build the prior from what has been reported in other trials. That is what one does in an intuitive fashion when one dismisses an initial result by saying that initial results usually don't replicate. A much more empirically grounded (and hopefully much less prejudicial) way to operationalize this intuition has been developed by van Zwet and colleagues, as reviewed here:

https://evidence.nejm.org/doi/full/10.1056/EVIDoa2300003

In pointing to this and any alternative to familiar analyses, it is important to compare its results to those from conventional approaches. This advice leads to method-sensitivity analysis, also known as multiverse analysis, which applies and compares results from multiple methodologies,

https://en.wikipedia.org/wiki/Multiverse_analysis

- cautions include that all analyses must be reported (at least as a supplement) and summarized, and possible reasons for any apparent conflict should be investigated and discussed. With that done, such comparisons should help understand why the data never "speak for themselves"; rather, it is we who speak of the data filtered through whatever methodologies and prejudices we used to analyze the data - including methodological misconceptions and prejudices.

Expand full comment

Reply (1)

Ernest N. Curtis

Feb 6, 2024

Your thoughtful comments are, as always, greatly appreciated.

Expand full comment

Sensible Medicine

Can Data Just Speak for Itself?