"Namely, what is worse, letting an uncertain potentially harmful therapy out in society, or holding back a potentially beneficial one?"

Stats 101 is built on the notion of rejecting the NULL hypothesis, i.e. we assume at the start of any research project that there are NO differences/benefits. It's the scary high wall of PROOF that science demands we get over. No evidence or "potential" or "suggestive" benefits should be treated as NO proof. With the replicability problem in focus now, as well as the tendency to not publish research with null findings, we need to be all the more strict about the fundamentals of science/research. Otherwise we will do HARM and/or waste precious resources that could be better directed elsewhere.

Thanks for your excellent cautionary writing John. Keep those standards and guardrails strong!

Dr. Mandrola: Thanks for the great examples and wisely uncertain conclusions...

Here is my view of your title question, the examples, and certain earlier comments. Other comments stated some of these points below succinctly in their own frameworks. My take will not be succinct, and I hope that anyone who starts reading it will read it carefully to the end before replying.

First and foremost: Data don't speak - If you hear the data speaking, seek psychiatric care. As was recognized by the 1940s in the then-emergent field of information theory, the data only convey information based on assumptions about how they were generated. In statistics, those assumptions are encoded (really, buried very abstractly) in the methods used to generate statistical estimates and P-values. Those assumptions include that all the trials in play are perfectly conducted and analyzed. Those assumptions might not be in question in the present examples. Nonetheless, more controversial yet equally abstract assumptions and arbitrary conventions are used to derive qualitative decisions or "conclusions" derived from those statistics; those include assumptions about the costs of possibly mistaken actions in light of the estimates and P-values, such as to declare "significance" or "nonsignificance" based on a P-value or interval estimate, or to report apparent conflict between earlier and later studies as "replication failure".

To illustrate: The first thing I noticed about the VEST and PRAISE pairs is that they look cherry picked to raise alarm about overdrawing decisions from single trials where no previous trials existed to check against. Following Fisher, the P-values for the secondary endpoints in the first studies (VEST-1 and PRAISE-1) only indicated that a "confirmatory study" or "replication attempt" was worth pursuing. Hopefully, this decision to pursue was not just because of their small P-values; presumably, the ranges of their interval estimates looked important enough to warrant the expense of further study of the effects, and included effect sizes that were plausible in light of what was already known.

Of course, comparing results from different studies needs to be direct and analytical, not based on pointing out that one was "significant" and the other "not significant." There was certainly more than that in the present examples: By any standard I know of, the second trials were in outright conflict with the first trials: The differences of VEST-2 from VEST-1 and of PRAISE-2 from PRAISE-1 were huge in practical terms, and both differences had p < 0.001. The changes in the estimates from the first to second trials look so large statistically and practically as to suggest something other than or in addition to chance caused the difference.

Using the same methodology that raised suspicion against the secondary results in VEST-1 and PRAISE-1, the list of possible causes includes selection bias in the form of selecting these trials precisely because of the replication attempts were in clear conflict. This meta-selection bias is no different in its capacity to mislead than is selecting effect estimates out of one trial based their looking important and "statistically significant". It could also reflect real differences in effects between the first and second trials, perhaps due to differences in patient selection for the trials, differences in standard of care used for the untreated (controls), and differences in follow-up - one would have to scrutinize the reports carefully to judge such post hoc hypotheses, but to ignore these possibilities would be scientifically irresponsible.

So then, what's the difference between these examples of secondary endpoints and examples of primary endpoints? One would be the difference in pre-trial baseline (or prior) expectations: The primary endpoint presumably had the benefit of enough background information (e.g., earlier-phase trials) to motivate conducting the first trial with that endpoint named from the start as primary and used in the trial planning (e.g., in study-size calculations). This kind of use is seen when the active-treatment arm is twice the size of the placebo arm: This imbalance suggests that a halving of an adverse primary outcome was expected. In contrast, describing results for secondary endpoints as "unexpected" suggests there was little or no such background information. And labeling secondary results as likely "false positives" suggests an even stronger baseline expectation, namely that the treatment would have no important effect on the secondary outcomes.

Conventional frequentist statistics can account for these baseline expectations using controversial, counterintuitive, and often torturous multiple-comparisons adjustments, e.g., as reviewed here:

As Frank Harrell could I think verify, Bayesian methodology can offer a more straightforward alternative by noting that the baseline expectations for the primary endpoint were concentrated toward beneficial effects, whereas the baseline expectations for the secondary endpoints were concentrated toward no effect. By translating these expectations into prior probability distributions and using them in a Bayesian analysis, we can get posterior distributions for all effects, primary or secondary.

Unfortunately, in statistics as in life there is no free lunch: With Bayesian methods, the problem of multiple comparisons is replaced by the challenge of creating a credible multivariate prior for the effects. Reasonable possibilities for a credible prior are limitless, as are disagreements about what "credible" means. One attempt to address that problem is to build the prior from what has been reported in other trials. That is what one does in an intuitive fashion when one dismisses an initial result by saying that initial results usually don't replicate. A much more empirically grounded (and hopefully much less prejudicial) way to operationalize this intuition has been developed by van Zwet and colleagues, as reviewed here:

In pointing to this and any alternative to familiar analyses, it is important to compare its results to those from conventional approaches. This advice leads to method-sensitivity analysis, also known as multiverse analysis, which applies and compares results from multiple methodologies,

- cautions include that all analyses must be reported (at least as a supplement) and summarized, and possible reasons for any apparent conflict should be investigated and discussed. With that done, such comparisons should help understand why the data never "speak for themselves"; rather, it is we who speak of the data filtered through whatever methodologies and prejudices we used to analyze the data - including methodological misconceptions and prejudices.

Assuming the correlation isn't spurious, mechanism of action remains an open question.

It's beginning to look like neurohormonal suppression of inflammatory response may be the answer. Given the associated risk of hypotension, the best we might hope for with Carvedilol is its wider tolerance.

For post-STEMI patients without hypertension, further research could suggest an alternate anti-inflammatory regimen.

As a physician with PAF, I've been on metoprolol before. I didn't continue it, because it made me feel like crap. I can't comment on how patients in failure feel on metoprolol compared to how they feel untreated nor can I comment on the newer med. I do know that quality of life is as important, if not more important, than quantity of life. I think far too many physicians ignore this side of the issue.

Very useful and excellent article. I think the take home lessons are all in the table that compares Vest 1 & 2 and Praise 1 &2. I think the lessons are obvious and commonsensical--no PHD required---and have little to do with specific end point selection.

1) Relatively small studies can give wildly inaccurate results.

2) Studies of relatively short duration can give similarly inaccurate results.

3) The more end points included, the more likely one will find some that are "statistically significant".

Vest 2 has a sample size more than 5 times that of Vest 1 and completely erases the "highly significant" mortality differences of Vest 1; in fact, by conventional criteria (p=0.02) placebo outperforms the drug. No duration of follow up was given for Vest 1 & 2, but I would guess #2 was longer.

Praise 2 had sample sizes about 3 times that of Praise 1 and the duration of follow up was about 2.5 times longer. It also showed a reversal in favor of placebo but the difference was not statistically significant. An ancillary lesson here is not to give too much credence to measures of statistical significance.

The third point about multiple end points is intuitively obvious.

I appreciate that you focus on “first do no harm.” I’m seeing less and less of that in (omg I sound like an old fart) young docs. I sometimes wonder the way they are (and not) taught if some of the harms done are due to ignorance or mixed messaging.

John, what a great question. I believe there’s no clear answer, but I’m inclined to let data speak for itself - so long as that data is accurate. Finding important data that was no considered as a primary or secondary endpoint means you have to verify its accuracy, though we should be doing that anyway (insert discussion of importance of reproducibility here).

There’s a flip side, too. Am I the only one who finds that most trials do NOT let data speak for themselves? Instead, we obsfuscate with p values, ARR, and confidence intervals. We lean on statistical significance, ignoring that clinical significance is what matters (but is subjective). when my patients ask me why they have to be on a statin/Plavix/empagliflozin/Repatha, why should they have a screening mammogram or colonoscopy - I share the raw data. I tell them to ignore these funny sounding stats terms and just look at raw numbers. This group had this happen, that group had that happen.

Notably, patients typically find these changes to be clinically insignificant and question why their doctors think these things are so important.

You ask if data can speak for itself. I believe that's a very good question. And I believe with AI and in the scientific and analytic communities, we are so enamored of numbers that no one really spends much time wondering about the nature of the choices of data biases results. In essence, no study lacks some sort of bias by virtue of what is omitted in the selection of the data to use.

I don’t really see the difficulty here to be honest. Is it really so expensive to run another trial with an updated primary endpoint? Or is it SO time consuming that so many patients would suffer in the interim?

The alternative of accepting results like this could see pharmaceutical companies scouring trials (past present and future?!) for every false positive (primary endpoint or not!) that they think they can slide by regulators. Data absolutely cannot speak for itself. This is literally the first lesson of statistics but it’s amazing how many really smart people are able to convince themselves otherwise!

If you recast this as drawing samples from bin that was about 4.8% "bad" (all we know from this study) the only thing this study tells us is that carvedilol is no worse than a placebo.

But "no worse than a placebo" is not the same as "better than a placebo" (discounting questions of cost of proposed alternative to placebo.)

And IF a placebo ("doing nothing") is the standard of care, then one is free to use alternative to placebo, since there is no evidence from this study that it worse than the standard of care.

If you have a theory, power the study to find a theoretical difference between the two arms with 99.7% CIs.

Does medicine want to be "scientific" or "sort of scientific"?

I think that there were many papers about this case in Statistics in Medicine. This case is all about the illogic of multiplicity adjustments when one wants to ask specific questions about specific endpoints. Other than in frequentist statistics, rules of evidence are much clearer than this discussion would have you believe. For example, if the police decide that the first suspect in a crime is not the best suspect and they arrest a second person, there is no logic in downweighting the fingerprint and motives of the 2nd suspect just because there was an earlier suspect. Evidence of guild for the 2nd suspect must come from evidence about THAT suspect and the detective’s prior beliefs. The logical Bayesian approach would be to formulate prior distributions of effects of the drug, separately for each endpoint, before data are available. Then apply those priors to the data and don’t look back. Even better: create a hierarchical ordinal endpoint for judging which group of patients fared better overall. Death would be at the top of the scale, and the ordinal analysis would penalize or reward for death even though the sample size may have been inadequate for judging mortality on its own.

I certainly don't have all the answers either, but it seems to me that this part of Dr. Moye's answer can be rightly disregarded as statistical flourish: "He explained that an experiment has a glassful of false positive chances. If you use part of it up in the primary endpoint, there is only so much left to use in subsequent endpoints." The rate of false positives is approximately (1-p) for each outcome independently. Having said that, the more end points you evaluate the higher will be the likelihood that one or more will show a false positive; however, this does not change the validity of any particular end point. If I evaluate 20 end points using a p-value of 0.05, and one shows statistical significance, in one once sense that is completely expected and consistent with a random occurrence. On the other hand, for that one end point, surely it warrants further study.

I agree with your answer to the question:

"Namely, what is worse, letting an uncertain potentially harmful therapy out in society, or holding back a potentially beneficial one?"

Stats 101 is built on the notion of rejecting the NULL hypothesis, i.e. we assume at the start of any research project that there are NO differences/benefits. It's the scary high wall of PROOF that science demands we get over. No evidence or "potential" or "suggestive" benefits should be treated as NO proof. With the replicability problem in focus now, as well as the tendency to not publish research with null findings, we need to be all the more strict about the fundamentals of science/research. Otherwise we will do HARM and/or waste precious resources that could be better directed elsewhere.

Thanks for your excellent cautionary writing John. Keep those standards and guardrails strong!

Dr. Mandrola: Thanks for the great examples and wisely uncertain conclusions...

Here is my view of your title question, the examples, and certain earlier comments. Other comments stated some of these points below succinctly in their own frameworks. My take will not be succinct, and I hope that anyone who starts reading it will read it carefully to the end before replying.

First and foremost: Data don't speak - If you hear the data speaking, seek psychiatric care. As was recognized by the 1940s in the then-emergent field of information theory, the data only convey information based on assumptions about how they were generated. In statistics, those assumptions are encoded (really, buried very abstractly) in the methods used to generate statistical estimates and P-values. Those assumptions include that all the trials in play are perfectly conducted and analyzed. Those assumptions might not be in question in the present examples. Nonetheless, more controversial yet equally abstract assumptions and arbitrary conventions are used to derive qualitative decisions or "conclusions" derived from those statistics; those include assumptions about the costs of possibly mistaken actions in light of the estimates and P-values, such as to declare "significance" or "nonsignificance" based on a P-value or interval estimate, or to report apparent conflict between earlier and later studies as "replication failure".

To illustrate: The first thing I noticed about the VEST and PRAISE pairs is that they look cherry picked to raise alarm about overdrawing decisions from single trials where no previous trials existed to check against. Following Fisher, the P-values for the secondary endpoints in the first studies (VEST-1 and PRAISE-1) only indicated that a "confirmatory study" or "replication attempt" was worth pursuing. Hopefully, this decision to pursue was not just because of their small P-values; presumably, the ranges of their interval estimates looked important enough to warrant the expense of further study of the effects, and included effect sizes that were plausible in light of what was already known.

Of course, comparing results from different studies needs to be direct and analytical, not based on pointing out that one was "significant" and the other "not significant." There was certainly more than that in the present examples: By any standard I know of, the second trials were in outright conflict with the first trials: The differences of VEST-2 from VEST-1 and of PRAISE-2 from PRAISE-1 were huge in practical terms, and both differences had p < 0.001. The changes in the estimates from the first to second trials look so large statistically and practically as to suggest something other than or in addition to chance caused the difference.

Using the same methodology that raised suspicion against the secondary results in VEST-1 and PRAISE-1, the list of possible causes includes selection bias in the form of selecting these trials precisely because of the replication attempts were in clear conflict. This meta-selection bias is no different in its capacity to mislead than is selecting effect estimates out of one trial based their looking important and "statistically significant". It could also reflect real differences in effects between the first and second trials, perhaps due to differences in patient selection for the trials, differences in standard of care used for the untreated (controls), and differences in follow-up - one would have to scrutinize the reports carefully to judge such post hoc hypotheses, but to ignore these possibilities would be scientifically irresponsible.

So then, what's the difference between these examples of secondary endpoints and examples of primary endpoints? One would be the difference in pre-trial baseline (or prior) expectations: The primary endpoint presumably had the benefit of enough background information (e.g., earlier-phase trials) to motivate conducting the first trial with that endpoint named from the start as primary and used in the trial planning (e.g., in study-size calculations). This kind of use is seen when the active-treatment arm is twice the size of the placebo arm: This imbalance suggests that a halving of an adverse primary outcome was expected. In contrast, describing results for secondary endpoints as "unexpected" suggests there was little or no such background information. And labeling secondary results as likely "false positives" suggests an even stronger baseline expectation, namely that the treatment would have no important effect on the secondary outcomes.

Conventional frequentist statistics can account for these baseline expectations using controversial, counterintuitive, and often torturous multiple-comparisons adjustments, e.g., as reviewed here:

https://onlinelibrary.wiley.com/doi/full/10.1111/ppe.12711

As Frank Harrell could I think verify, Bayesian methodology can offer a more straightforward alternative by noting that the baseline expectations for the primary endpoint were concentrated toward beneficial effects, whereas the baseline expectations for the secondary endpoints were concentrated toward no effect. By translating these expectations into prior probability distributions and using them in a Bayesian analysis, we can get posterior distributions for all effects, primary or secondary.

Unfortunately, in statistics as in life there is no free lunch: With Bayesian methods, the problem of multiple comparisons is replaced by the challenge of creating a credible multivariate prior for the effects. Reasonable possibilities for a credible prior are limitless, as are disagreements about what "credible" means. One attempt to address that problem is to build the prior from what has been reported in other trials. That is what one does in an intuitive fashion when one dismisses an initial result by saying that initial results usually don't replicate. A much more empirically grounded (and hopefully much less prejudicial) way to operationalize this intuition has been developed by van Zwet and colleagues, as reviewed here:

https://evidence.nejm.org/doi/full/10.1056/EVIDoa2300003

In pointing to this and any alternative to familiar analyses, it is important to compare its results to those from conventional approaches. This advice leads to method-sensitivity analysis, also known as multiverse analysis, which applies and compares results from multiple methodologies,

https://en.wikipedia.org/wiki/Multiverse_analysis

- cautions include that all analyses must be reported (at least as a supplement) and summarized, and possible reasons for any apparent conflict should be investigated and discussed. With that done, such comparisons should help understand why the data never "speak for themselves"; rather, it is we who speak of the data filtered through whatever methodologies and prejudices we used to analyze the data - including methodological misconceptions and prejudices.

Assuming the correlation isn't spurious, mechanism of action remains an open question.

It's beginning to look like neurohormonal suppression of inflammatory response may be the answer. Given the associated risk of hypotension, the best we might hope for with Carvedilol is its wider tolerance.

For post-STEMI patients without hypertension, further research could suggest an alternate anti-inflammatory regimen.

As a physician with PAF, I've been on metoprolol before. I didn't continue it, because it made me feel like crap. I can't comment on how patients in failure feel on metoprolol compared to how they feel untreated nor can I comment on the newer med. I do know that quality of life is as important, if not more important, than quantity of life. I think far too many physicians ignore this side of the issue.

Very useful and excellent article. I think the take home lessons are all in the table that compares Vest 1 & 2 and Praise 1 &2. I think the lessons are obvious and commonsensical--no PHD required---and have little to do with specific end point selection.

1) Relatively small studies can give wildly inaccurate results.

2) Studies of relatively short duration can give similarly inaccurate results.

3) The more end points included, the more likely one will find some that are "statistically significant".

Vest 2 has a sample size more than 5 times that of Vest 1 and completely erases the "highly significant" mortality differences of Vest 1; in fact, by conventional criteria (p=0.02) placebo outperforms the drug. No duration of follow up was given for Vest 1 & 2, but I would guess #2 was longer.

Praise 2 had sample sizes about 3 times that of Praise 1 and the duration of follow up was about 2.5 times longer. It also showed a reversal in favor of placebo but the difference was not statistically significant. An ancillary lesson here is not to give too much credence to measures of statistical significance.

The third point about multiple end points is intuitively obvious.

I appreciate that you focus on “first do no harm.” I’m seeing less and less of that in (omg I sound like an old fart) young docs. I sometimes wonder the way they are (and not) taught if some of the harms done are due to ignorance or mixed messaging.

John, what a great question. I believe there’s no clear answer, but I’m inclined to let data speak for itself - so long as that data is accurate. Finding important data that was no considered as a primary or secondary endpoint means you have to verify its accuracy, though we should be doing that anyway (insert discussion of importance of reproducibility here).

There’s a flip side, too. Am I the only one who finds that most trials do NOT let data speak for themselves? Instead, we obsfuscate with p values, ARR, and confidence intervals. We lean on statistical significance, ignoring that clinical significance is what matters (but is subjective). when my patients ask me why they have to be on a statin/Plavix/empagliflozin/Repatha, why should they have a screening mammogram or colonoscopy - I share the raw data. I tell them to ignore these funny sounding stats terms and just look at raw numbers. This group had this happen, that group had that happen.

Notably, patients typically find these changes to be clinically insignificant and question why their doctors think these things are so important.

You ask if data can speak for itself. I believe that's a very good question. And I believe with AI and in the scientific and analytic communities, we are so enamored of numbers that no one really spends much time wondering about the nature of the choices of data biases results. In essence, no study lacks some sort of bias by virtue of what is omitted in the selection of the data to use.

I don’t really see the difficulty here to be honest. Is it really so expensive to run another trial with an updated primary endpoint? Or is it SO time consuming that so many patients would suffer in the interim?

The alternative of accepting results like this could see pharmaceutical companies scouring trials (past present and future?!) for every false positive (primary endpoint or not!) that they think they can slide by regulators. Data absolutely cannot speak for itself. This is literally the first lesson of statistics but it’s amazing how many really smart people are able to convince themselves otherwise!

Let's get philosphical Dr. Mandrola.

Here is a study:

standard of care plus placebo "b" (blue and smaller pill)

vs.

standard of care plus placebo "R" (red and bigger pill)

Placebo "b" 7.8% deaths (out of 400)

Placebo "R" 3.1% deaths (out of 700)

Is it a "no brainer" to start using Placebo "R"? Why? I suppose that depends on what we mean by "no brainer".

Placebo "b" costs $0.01, Placebo "R" costs $20.00. Now what?

edited Feb 5placebo 7.8% (out of 400) 8.0% 4.8% 1.6%

carvedilol 3.1% (out of 700) 7.2% 4.8% 2.4%

If you recast this as drawing samples from bin that was about 4.8% "bad" (all we know from this study) the only thing this study tells us is that carvedilol is no worse than a placebo.

But "no worse than a placebo" is not the same as "better than a placebo" (discounting questions of cost of proposed alternative to placebo.)

And IF a placebo ("doing nothing") is the standard of care, then one is free to use alternative to placebo, since there is no evidence from this study that it worse than the standard of care.

If you have a theory, power the study to find a theoretical difference between the two arms with 99.7% CIs.

Does medicine want to be "scientific" or "sort of scientific"?

I think that there were many papers about this case in Statistics in Medicine. This case is all about the illogic of multiplicity adjustments when one wants to ask specific questions about specific endpoints. Other than in frequentist statistics, rules of evidence are much clearer than this discussion would have you believe. For example, if the police decide that the first suspect in a crime is not the best suspect and they arrest a second person, there is no logic in downweighting the fingerprint and motives of the 2nd suspect just because there was an earlier suspect. Evidence of guild for the 2nd suspect must come from evidence about THAT suspect and the detective’s prior beliefs. The logical Bayesian approach would be to formulate prior distributions of effects of the drug, separately for each endpoint, before data are available. Then apply those priors to the data and don’t look back. Even better: create a hierarchical ordinal endpoint for judging which group of patients fared better overall. Death would be at the top of the scale, and the ordinal analysis would penalize or reward for death even though the sample size may have been inadequate for judging mortality on its own.

I certainly don't have all the answers either, but it seems to me that this part of Dr. Moye's answer can be rightly disregarded as statistical flourish: "He explained that an experiment has a glassful of false positive chances. If you use part of it up in the primary endpoint, there is only so much left to use in subsequent endpoints." The rate of false positives is approximately (1-p) for each outcome independently. Having said that, the more end points you evaluate the higher will be the likelihood that one or more will show a false positive; however, this does not change the validity of any particular end point. If I evaluate 20 end points using a p-value of 0.05, and one shows statistical significance, in one once sense that is completely expected and consistent with a random occurrence. On the other hand, for that one end point, surely it warrants further study.