When Studies Don’t Answer Their Question

May 15, 2023

The Study of the Week delves into the frustrating matter of inconclusive trials

46 Comments

Dr. Mandrola writes in this post:

"The Results

At 60 days, 17.3% of those in the active arm died vs 21.3% in the placebo arm. That is an absolute risk reduction of 4%. The relative risk reduction in death equaled 33%."

I believe there is an error here. Based on the facts as stated, the relative risk reduction in death equaled 19%, not 33%.

Expand full comment

I think it is more important to know the make up of the two groups. Often times, these participants are cherry picked or negative results are weeded out over time. Also, I think that since this study might have showed good results, there should be 3 or 4 more done.

We have to face the fact that drug companies are more interested in selling drugs, especially high priced newer ones, than they are in curing any diseases or helping people live without drugs. That means most of these studies are going to be massaged and manipulated to extract the best outcomes with the happiest smile faces on them.

It still does not matter because we no longer have any agencies that are honest in their assessments of new drugs. The FDA ain't it no more....not by a long, long shot. They have given up any mission that includes protecting the public from harmful drugs...witness the recent mRNA substance fiasco.

Expand full comment

Thanks for boiling this down so well. I wonder whether research ethics boards have biostatisticians as members. Can they truly even evaluate scientific validity without these?

Expand full comment

It seems fair to say this study failed to show that the antibiotic combo was at least 14% better than placebo. The authors truncated their conclusion statement, but this seems to be what they are saying. I’m not sure what the problem is with that.

What this study does not answer is whether the combo may have been 13% better. Or 12 % better. Or some other smaller but still clinically relevant percentage better (however that might be defined or determined).

This is a common issue these days, where estimates of control arm event rates are excessively high, and estimates of effect size are way too large, resulting in underpowered studies. Does that mean potentially useful (but less useful than hoped) therapies are being abandoned? Quite possibly. But I’m not sure what the alternative is. As you say, having much larger studies may be dubious ethically, and prohibitive financially. Perhaps IRBs need to be more active in the design of studies it reviews and approves, to somehow increase likelihood that a real and relevant active arm treatment effect will be found by the study as designed and proposed.

Expand full comment

"It’s like determining whether a coin is fair with only 10 flips."

No need to figure out if it is a fair coin.

Von Neumann gave a simple solution: flip the coin twice. If it comes up heads followed by tails, then call the outcome HEAD. If it comes up tails followed by heads, then call the outcome TAIL. Otherwise (i.e., two heads or two tails occured) repeat the process.

Expand full comment

Maybe they should be flipping bit coins.

Expand full comment

Dr. John, how many flips do you need to determine if a coin is fair?

Should you presume a coin to be fair unless shown not to fair? Or the other way around?

How do you operationally define what you mean by fair?

Expand full comment

This sort of freqeuntist analysis is fundamentally flawed from the git-go. Any study can only determine P(D|H), the probability of getting data D given hypothesis H. What you really want to know is P(H|D), the probability of the hypothesis given the data. But you can get from P(D|H) to P(H|D) only if you also have a prior P(H) on your set of hypotheses. Then you can use Bayes' Theorem, P(H|D) = P(D|H)P(H)/P(D) (where P(D) is a normalization constant that does not need to be known separately) to get P(H|D). Any mental translation of a confidence interval on DATA to a confidence interval on the HYPOTHESIS is completely illegal in frequentist statistics, even though every human being without exception mentally does this. If you do it (and you do), you have inadvertently introduced some prior, and you don't know what it is. Far better to use Bayesian analysis with specified priors. But alas, medicine has been completely conquered by the frequentists.

Expand full comment

rationalnuggets

I don’t see this as implicating RCTs as your final paragraph implies. The case you are making is for RCTs designers error on the side of confidence in case their pre-study assumptions are wrong.

Expand full comment

Thank you for looking at this. I had seen on Twitter and had forwarded to sensible medicine.

There were people on Twitter that imagined that: it just needed more power.

No. It just doesn't work.

"Statistical" thinking and understanding variation does not come naturally to most human beings. Physicians are not immune from this natural deficit.

I also think there is a linguistic problem.

"At 60 days, 17.3% of those in the active arm died vs 21.3% in the placebo arm. That is an absolute risk reduction of 4%. The relative risk reduction in death equaled 33%."

These last two sentence should never be written. You should NEVER write about ARR or RRR or a hazard ratio until AFTER you decide (with risk of error) whether there is anything meaningful to consider.

To write about absolute or relative risk in advance of deciding that there is something meaningful to consider creates an implicit bias.

ARR, RRR and HR must to be operationalized to only have meaning if and only there is a meaningful reason (with risk of error) to consider an assignable cause is at work.

The default to 95% confidence intervals is also really leading us astray. We should be using 3 standard deviations: 99.7%

I hate to be harsh but medicine hides its lack of understanding of really fine tuned biological causal mechanisms in the space between 2 and 3 standard deviations. Failure to know that one doesn't know plus hopefulness plus hubris plus lack of real statical thinking leads to clinical, social and cultural iatrogenesis. Parhessia.

Here is thought: do or simulate Deming's Red Bead experiment. Then do it again with UCL and LCL calculated with +/- 2 stds.

Expand full comment

I respectfully disagree with these recommendations. For one thing "deciding that there is something meaningful" is a very difficult task and implies that point estimates have a lot of meaning.

Expand full comment

Point estimates have NO independent meaning outside of sampling theory, the understanding of systemic variation, and underlying substantive knowledge of the process producing the data.

So I think we are actually on the same page regarding point estimates. Without CIs they have no meaning. Which is why claiming "there is an absolute risk reduction of 4%" or that "there is a relative risk reduction in death of 33%" make no sense whatsoever.

With what exact recommendations are you disagreeing?

How do we reduce (but not eliminate because that is not possible) both type I and type II errors? Using 95% CI does not do this. Shewhart showed this empirically; it is better to use +/- 3 stds. But even this is problematic because nature gives as long fat tails more often than we realize.

See also. Deming, On Probability as a Basis for Judgment, (1975)

Expand full comment

Let me just comment on one remaining piece of it. The use of 3 SDs can correspond to a criterion so stringent that good opportunities will be missed. But I'd rather jump to a Bayesian approach where you might require a high posterior probability of any benefit and a moderately high (>= 0.8 for example) probability of non-trivial benefit.

Expand full comment

I don't think good opportunities would actually be missed.

And if one practices "conservative medicine" (or minimalism as I would put it), as Dr. John advocates, the theoretical missed opportunities should not be bothersome. A more stringent standard would make medicine more scientific.

I am sympathetic to the Bayesian approach you are proposing but it is problematic in that it can't be operationalized (or it is difficult to operationalize in advance) and so we risk reversion to eyeballing and trust me I have white coat and stethoscope.

Expand full comment

Conservative from whose standpoint? Conservative in terms of not approving useful therapies can be harmful. And there is not much of a problem in operationalizing the Bayesian approach in advance. See https://hbiostat.org/bayes/bet

Expand full comment

Trying my first comment on substack. Yikes.

Though the paper describes the power calculation in terms of absolute risk reduction, I think it's more informative to note that they were powering the study for a 50% relative risk reduction. 50% RRR is fairly standard in trials, no?

The fixation on that magic 2-fold risk reduction is a scientific cultural phenomenon, and it means we are stuck looking for silver bullets.

Expand full comment

A 50% RRR is not standard. It is wildly optimistic. It's more typical to choose a RRR of 20%-25% but note that we are not talking about relative risk but instead are referring to relative hazard, which is a relative instantaneous risk measure (as opposed to cumulative risk). If you choose a relative effect of 50% that is not to be missed you are saying that a 25% reduction is not clinically relevant, which is seldom the case.

Expand full comment

I understand your point here, but this is also where statistics drives people a bit batty.

1) They study both endpoints of survival and absolute mortality.

2) The power calculation is based on a 25% improvement of survival (using a log-rank test as the basis of the power calculation)

3) When translated into absolute mortality, this corresponded to an expected 50% RRR.

Now, if you think 50% RRR in absolute mortality is wildly optimistic, then you as the consulting statistician has to say to the study authors: "Though we decided on an RRR of 25% in terms of survival, when I reinterpret this calculation in terms of cumulative mortality, this would correspond to a 50% RRR. Is that realistic?"

Expand full comment

Important points to discuss pre-study. "25% improvement of survival (using a log-rank test)" involves two clashing concepts but you are right we need to be clear. I would say that in general a hazard ratio of 0.75 (25% reduction in relative hazard) is pretty large. 50% reduction in RR is large too. Which one to emphasize is not clear.

Expand full comment

Because of our conversation, I re-read the "Sample Size Calculation.” section and worry we're both a bit wrong. Here's my current read:

1. A previous trial found “patients with a MELD score [Model for Endstage Liver Disease] greater than or equal to 21 had lower survival by 18.8% compared with patients with a MELD score of less than 21.” (direct quote from the paper)

2. The investigators “hypothesized” that antibiotics would reduce *overall mortality* by 75% of the effect size of this previous trial. That is, they expected 75% of 18.8%, which is 14%. Given that the rate of death in the control group was 28%, this meant they were hypothesizing a 50% RRR for antibiotics (27%/14% = 1.9), but they don't explicitly discuss RRR.

3. They then generated their sample size calculation using a log-rank test powered to detect this hypothesized difference in mortality.

I know this is not what you would recommend they do! But I'm just quoting the paper directly here. Very bizarre. Did they not consider the implied RRR? Did they only have a budget for about 300 patients and came up with some hand-wavy reasoning in their power calculation to justify the small trial size? It is impossible to know, unfortunately. But I agree with you that this conversation should have been had before the study.

Expand full comment

Oh my. NEVER use an observed effect size in another trial in a power calculation. A recipe for disaster. Power calculations should always use the clinical effect you would not like to miss. This is not a data-oriented specification.

Expand full comment

Nice article John. You say that the journal's interpretation is technically correct. I do not believe that to be the case. IMHO it is grossly inaccurate due to a massive misunderstanding of p-values and hypothesis tests. Sir Ronald Fisher himself said that a large p-value means "get more data". Any journal wanting to say that there is evidence that a treatment doesn't work should be forced to base that on a Bayesian probability that the treatment effect is clinically negligible. For this study you'd find such a probability to be near 0.5 so we haven't a clue about the ineffectiveness of the therapy.

Expand full comment

Ugh. Here is difficult theoretical disagreement of approach.

"Any journal wanting to say that there is evidence that a treatment doesn't work should be forced to base that on a Bayesian probability that the treatment effect is clinically negligible."

I understand what you are saying but I think our assumption must be that treatments do NOT work UNLESS they are reasonable shown to work.

Expand full comment

I could not disagree more with that. If that were the case you would conclude from a study with n=2 that the treatment doesn't work. Also, if you took that approach there would be less incentive to force studies to have some likelihood of being informative before they are approved for launch. We need to get the idea across that futile studies are actually futile.

Expand full comment

A study with n =2, is not a study. It is a conversation in which anecdotes are shared.

A futility study is for showing that "the standard of care" is ineffective, not for showing that a hypothesis better treatment doesn't work.

We have different viewpoints. Your approach, if I understand it correctly, leads to kitchen sink medicine. Let's try this until someone shows it doesn't work. How can that be progress? Maybe you don't mean that - maybe you have assumption that bioplausibility is a prerequisite? But even that is problematic because medicine really knows a lot less than it imagines.

🤷‍♂️ It is possible I mistaken. If things were easy there would be only good studies.

Expand full comment

So right! My main philosophy is that we need to get away from methods such as p-values and null hypothesis testing that can only show evidence against an assertion and move towards methods that provide evidence in favor of any possible assertion. One such assertion is that the treatment matters little. Bayes provides direct evidence for that, e.g. Pr(0.925 < hazard ratio < 1/0.925). Anyone wanting to draw a conclusion such as a drug doesn't work needs supporting evidence, not just failure to reject the null which can happen for a variety of reasons.

Expand full comment

Maybe, they designed the trial to fail and that way be able to (wrongly) conclude that the cheap abd safe treatment doesn't work. I would not be surprised if a very expensive alternative was coming soon

Expand full comment

It has been my experience that most physicians today only read the title and abstract. Few read conclusions and nobody (but us old guys) read the entire paper including references. It’s embarrassing listening to physicians quote papers that do not substantiate their points. Yet, I have no problem pointing out their inadequacies and getting a chuckle.

Expand full comment

Here is another example that I just read .... and it is so true !!! https://open.substack.com/pub/sensiblemed/p/its-hopeless-the-prospect-of-reforming?r=izvy6&utm_medium=ios&utm_campaign=post

Expand full comment

And I’m always surprised that few apply critical thinking to which pharma company funded the study!

Expand full comment

That would require reading the paper ….

Expand full comment

Keith Dudleston

So it seems you shouldn't try to estimate the sample size until you know the likely outcome in the placebo arm?

Expand full comment

Sample size and the need for a fixed budget are at the root of many problems with the final usefulness of studies. A more scientific approach that will lead to better decisions and more honest conclusions would be to do a fully sequential Bayesian study, stopping with there is sufficient evidence for any of the following: efficacy, harm, similarity, futility, or a safety problem unrelated to the efficacy endpoint.

Expand full comment

Exactly, as that estimate makes a difference in what will be an adequate sample size. As this article says, "The variability in the outcome variable is the population variance of a given outcome that is estimated by the standard deviation. Investigators can use an estimate obtained from a pilot study or the reported variation in previous studies." It's impossible to know from the information presented here whether they did that but one would hope so. With small sample sizes, you might miss a "true" effect of an intervention. With really large samples, you might find a statistically significant effect that is clinically meaningless, so then looking at ARR's is helpful. It's kind of subjective as to what is "meaningful," which might differ depending on the outcome measured, especially mortality.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017493/

Expand full comment

Brian O’Neill

Great article! It shows just how manipulative the big pharma have been in duping the public re the vaxxes!

Expand full comment

No, actually the article was about small sample size, and the Covid vaccine trials had very large sample sizes, but perhaps not large enough. Pfizer overestimated the number of people in the placebo group who would get Covid. This is not surprising given that the studies took place when society was still taking precautions, and the sample was volunteer (not random) and thus likely biased towards the cautious and risk avoidant. (Remember, the study was blinded, so everybody likely took precautions). Given the underestimation in the placebo group, the results would likely have been even better if nobody took any precautions during the study. (Ethics precluded this). The clinical trials were limited by their short duration, and the only outcome measured was infection. The more important data has been collected during subsequent waves of infection in which the vaccines, especially when boosted, showed good efficacy against the most important outcome, death, especially among those in high risk groups. The 2nd link is one such study, and there are many others, but I chose this one because it was not funded by Pharma. One criticism often leveled at the Covid vaccine clinical trials was that only RRR was reported, and not ARR. (A topic which Dr. Mandrola addressed last week). I always like to see both, and usually go more by the ARR combined with safety data. But as the article in the first link below explains, RRR has an important function when making public health decisions in a pandemic. Also, overall ARR is not meaningful when outcomes vary greatly by subgroup, as is the case with Covid. One then needs to look at the ARR of each subgroup. So even though I hear you about Big Pharma, there are plenty of other sources of data to demonstrate that the risk-benefit ratio is good for the vaccines, especially in high risk populations.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9115787/

https://www.sciencedirect.com/science/article/pii/S1201971222005938

Expand full comment

Brian O’Neill

Damned lies and statistics then!

Expand full comment

Keith Dudleston

Some criticise vaccine studies (such as the one you quote here) because they do not measure the reduction in "all cause mortality" but rather concentrate only on mortality due to the disease. Is this a valid concern?

Expand full comment

Vaccine trials aren't the only RCT's that fail to measure all cause mortality. Whether that makes sense probably depends on a number of things, including the length of the trial (i.e., a few months vs. a few years). See first link below for more info. on that. As for the Covid vaccines, subsequent research found that non-Covid deaths were lower in those who were vaccinated. (2nd link). But that was most likely due to selection bias, which can happen when participants are not randomly selected from the general population (i.e., health user bias). After correcting for this as much as they could, at least they did not find any excess deaths among the vaccinated.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7024667/

https://www.sciencedirect.com/science/article/pii/S0264410X22015614

Expand full comment

Keith Dudleston

So it seems to me that if one determined the standardised mortality in a population and decided on a clinically significant mortality reduction (say 5%) we could determine the size of a random sample of this population which should be given the vax. In this way we could show that the anti-vax narrative is false. I worry that if we fail to do these studies people will become reluctant and the only way you will get them to comply is to mandate.

Expand full comment

I don't think there is the political will to fund such a study, nor would pharma want to spend the money. I also don't think devout anti-vaxers are convinced by data or logic. I don't think the Supreme Court would ever uphold mandates for all, especially now that the data shows the vaccine does not prevent transmission. At this point, the only studies likely to be funded are retrospective, not RCT's, and/or antibody titers in animals for newer versions, like what's currently done for the flu vaccines when modified.

Expand full comment

Keith Dudleston

Thanks for taking the time to answer my question. You have been very helpful.

Expand full comment

Continue thread →

Thank you for an easy to follow explanation.

Expand full comment

You're welcome, and I'm happy to hear it was easily understood. Sometimes I'm not sure! :)

Expand full comment

I’m a masters degreed addiction therapist who was taught to read EVERY bit of EVERY article by a great internist - we would critically examine journals together for decades.

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts