beautiful. (tho, we could just do them right *once* the first time. &, heck, add a replication-esk element. i mean it is *DEATH* & potentially *EVERYONE* we are talking about w some of these interventions. all treatments best be trials, periodically reviewed, forever.)
if the alt is just "guess forever" or "side in favor of industry" - that is no answer at all. for all the known context reasons...still, another gr8 entry. =) _JC
Thank you John Mandrola and all the others for all the thoughtful pieces, so useful for a clinician as me that as to apply this in the wards in the real practice.
Apart from the more statistical discussion, I have two thoughts / questions to add:
- How do you put the hypothesis that aspirin could be less safe with the very recent ARTESIA trial, where apixaban was associated with more bleeding (than aspirin)?
- Do you think clopidogrel would be safer than aspirin in stroke as it seems it is in coronary disease, as the more recent evidence shows?
I would agree with the other commenters who pointed out that negative evidence is really the essence of science. It may be that the reason left atrial cardiomyopathy didn't pan out is that those abnormalities don't lead to clot formation within the atrium or that there are never clots that embolize from there i.e. that there is another pathophysiological cause that may have little or nothing to do with blood clotting. I admit that there may be evidence supporting the embolization model that I am not aware of and would welcome correction on this point as I have not followed the literature closely for quite some time. But during my years of practice I always thought of the embolization theory as possible but lacking real proof.
On the question of pragmatism I would say that we probably don't need to do any more studies comparing different anticoagulants and/or platelet inhibitors. It is useful to know that low event levels are roughly equivalent without determining precise statistical significance numbers that may not have any practical significance. Most useful here would be to keep up long term follow up on recurrence of strokes and, should the incidence remain low, consider that none of the conventional therapy is worthwhile. First do no harm is even more important in this age of abundant pharmaceuticals and technological gadgets.
I’m aware that I’m the irritating univited layperson and I did end my subscription so I can’t keep commenting. I only follow you guys so I can learn for myself. Speaking of which, thank you for teaching me how to interpret confidence intervals. Correct or incorrect, these are my thoughts:
How the heck are seven brain bleeds a non factor? If offered two comparable products, IDk about you, but I’m probably not going to choose the brain bleed chaser. If the study is under powered, does that mean no findings are significant? I would think 7 vs 0 is all the more significant with fewer people. And I’m sure I’m not supposed to call them people when referencing human beings participating in a study, but that’s what they are.
That said, good ol’ aspirin feels safer than a new medication.
The other thought I have is, this is exactly how pharmaceutical companies “own” the science. I’m much more inclined to trust a non biased, honest Doctor. I know Pharma can get sued (for medications) but I think they do the math and allow for margins, so they’re going to be biased. I also think, Doctors are supposed to ignore that when prescribing.
So idk, brain bleed or maimed couch potato calling the number on the screen so I can enrich a law firm(?)
This is where I go back to non pharmaceutical alternatives and letting nature take its course. But idk because I’m not in the hot seat yet. I’m going to be quiet and just be an observer now so when the day comes, I can feel confident in my decisions.
The brain bleeds were one of two ‘safety outcomes’ in the study. The other safety outcome was death, i.e. all-cause mortality. Here's the result: there were 12 deaths in the Apixaban group and 8 among patients receiving aspirin.
This was a small but interesting study. But one shouldn’t use it to draw conclusions about benefits and risks.
And also, I know every Doctor and scientist would tell me not to draw conclusions from this study but…
A thousand or so with no clear benefit and 50% more death IS actually good enough for me. To the point where, I’m going to create a post it note for myself just incase I lose my mental faculties: Do not take Abixaben no matter what.
One time I was driving a few cars behind a semi carrying logs when a log fell off and tumbled off the side of the highway. I was the only car to immediately change lanes and pass. All I could wonder is, how many logs need to fall off the truck before other drivers will take action to protect themselves(?). Why isn’t the answer ‘one’ for everyone? Are they just going to wait until a log comes through their windshield??? I don’t need more logs or more people.
In the situation where the logical mind has not been ‘retrained to interpret science’, it’s very mysterious how 50% more death in the test group is not the real reason they stopped the study, and also how the response to these findings is that another study with more people is needed. I suppose there are not clear guidelines/limits around how many deaths are permissible before you have to shut down a study. I just think, if I were the person responsible for running study, that’s the point where I would become very uncomfortable continuing and I certainly wouldn’t want to go to my grave having carried this out at a grander level.
John thank you very much for writing the excellent follow-up piece. There are several aspects worthy of further exploration related to new sequential designs and choice of endpoints. On the latter, if the really needed number of patients using real clinical effects is too large, either the study needs to be abandoned or an endpoint with reasonable information content needs to replace the low information binary endpoint. If a higher-resolution endpoint that would affect practice cannot be found then again the study might best be abandoned. I want to concentrate though, for the moment, on the choice of the effect size with the existing low-information endpoint. One of the most common mistakes made in clinical trial design is assuming that the effect size not to miss comes from previous studies. This is simply not the case. The effect not to miss, i.e., the one going into a power calculation (Bayesian power or traditional power), comes SOLELY from clinical expertise and patients’ opinions. Previous studies are not informative about what effect matters to patients. While choosing the effect size, consideration of both the relative (HR, OR) scale and the absolute scale (increase in mean number of years alive and well; absolute risk reduction using cumulative incidence at plan end of follow-up) should be examined as in https://hbiostat.org/bbr/ancova#fig-ancova-ordiff (this example also accounts for a differential treatment effect, i.e., an interaction between treatment and one of the baseline covariates).
Great discussion John and Frank. It probably won't surprise you both that this is an area of interest to me as an educator of evidence-based health care.
The current/leading paradigm in medicine/healthcare research (and education thereof) is that point estimates from single studies = evidence of effect ("P<0.05") or evidence of no effect ("P>0.05"). "Point estimology" taught by "point estimologists"!
I try to follow folk in this space who want to move the field away from this way of thinking for exactly the reasons this case highlights. But it comes at a price. Academia still rewards the existing (poor) way of reporting/interpreting findings. To go against this means being willing to take a hit to your academic record/progress etc.
"Meta‐analysis of five of these trials observed improvement in reported symptoms following a physical activity intervention (SMD –0.93, 95% CI –1.44 to –0.42). A minimal clinically important difference (MCID) of 0.5 SD for bowel symptoms in people with IBS has been suggested (Spiegel 2009). The observed range of estimates are consistent with both clinically important and non‐important improvements in IBS symptoms following a physical activity intervention. When considered on an original item scale, the observed difference approximates to an absolute mean reduction of 69 points (95% CI 31 to 106) for total symptoms of the IBS‐SSS (based on the mean control group baseline SD of 74 from Fani 2019; Hajizadeh Maleki 2018; Jia 2016). The MCID for the total symptoms score of the IBS‐SSS has been defined as a difference of 50 points or more (Francis 1997). The observed range in estimates are consistent with both clinically important and non‐important improvements in symptoms as measured by the IBS‐SSS following a physical activity intervention.
We considered the certainty of the evidence as very low, providing very little confidence in the observed estimates. In a preplanned sensitivity analysis including only the studies deemed at low risk of selection bias, we found that the effect estimate reduced to a moderate size (SMD –0.62, 95% CI –1.44 to 0.21) still in favour of physical activity but estimates were now consistent with both improvement and worsening of symptoms. Larger studies at low risk of bias are needed to increase confidence in the observed estimates."
I could so easily have opted for the (positive) take of the point epistomologist - and the plaudits from the field (physical activity for health/chronic disease) that would have come with this.
As with many other important issues on "evidence-based", this is a slow tanker to turn. Discussions like this are another tug boat in that collective effort.
One minor reaction: Standard deviations should not go into MCID consideration. The SD measures how patients disagree with each other on their response. MCID needs to consider only one patient at a time.
I think FH has rit what I thought. If the sig. requires a much larger N, the study is not worth repeating. Unless morbid/mortality is so great a sl. diff. is important.
I’m only a mediocre physicist, so may completely not understand.
Very educative and kudos to dr Mandrola for his lack of prestige! This just cements my confidence in his assessments. Not everyone is so wise to be humble.
So good to see this king of discussion here. During COVID times and the search for treatment with repurposed drugs, “absence of evidence was equal evidence of absence”. Thanks for bringing truth to the scientific field. Let’s wait for so many specialists to come clean and debate a lot of mistakes made during COVID. That may have prevented millions from having a chance of treatment. One must be reminded that the mortality rate on ventilators was above anything close to acceptable.
But wait. We are ALWAYS referring to absence of evidence. That should go without saying.
To require evidence of absence would be to require proof of a negative. That’s a logical impossibility. That’s not how trials work…ever.
So a trial starts with a hypothesis, and pre-specified statistical parameters upon which one can reject the null and accept the hypothesis. In this case, the statistical analysis showed we could not reject the null and accept the hypothesis; so instead, we must reject the hypothesis and accept the null: ie. accept that “X” does NOT work (in this case; in how the trial was done in this instance). That is of course different from stating one has PROOF that X doesn’t work. This is how it is, and how it has always been.
Hence, there is always the possibility that a larger trial with more endpoints MIGHT show that X in fact does work (and we can reject the null, that it doesn’t). Or that an identical study using different power assumptions may have yielded different statistical results. The onus is on the proponents, of whatever, to do that work, and show that proof.
You can always say…under different trial and statistical circumstances….”X might have worked”….but that still absolves precisely nobody from having to go out and proving it.
That said, the interesting thing with this trial to me is still the metrics for determining atrial cardiopathy as a predictor of future overt AF. We already know DOAC is better than ASA in clinical AF. That’s news to nobody. What I take from this is that the way this study tried to determine atrial cardiopathy has not been shown to be effective (ie absence of evidence). I would not, on the basis of this study, yet conclude that the entire concept of atrial cardiopathy is without merit (ie. evidence of absence).
You are unwittingly having a purely traditional frequentist world view. With Bayes you don’t need hypotheses at all and can rely on estimation and generating evidence IN FAVOR of infinitely many assertions (all possible levels of efficacy). Everything changes when you think this way. With Bayes you CAN bring evidence for a “negative”, e.g. Pr(HR between 0.9 and 1/0.9).
I’m a clinician and mere end user insofar as statistics apply to clinical trials. So I am clearly out of my depth here.
But with regards to the study in question, what should be considered the pre-test likelihood or prior probability of the atrial cardiopathy metrics used here as being predictive of future clinical AF (to the extent of producing a differential clinical effect between the use of apixaban vs ASA)? It seems to me that there is a dearth of clinical data to show that atrial cardiopathy (however measured) is definitely a causal predictor, and if that is the case, I’m unsure how one would arrive at a reasonable “prior”.
A prior might encapsulate current pre-study knowledge in a very simple way such as “we know the drug isn’t a cure so we’ll put a high probability on the true unknown hazard ratio being between 1/2 and 2”. Then to conclude similarity of effects of the two treatments would require a sharp posterior probability that the HR is between, say 0.9 and 1/0.9, e.g. a probability >= 0.9. The study being discusses probably has a fairly low probability that the treatments are similar on outcomes.
Nice. A comment on one part: The sample size needed to incorporate interactions is so huge that we can rarely do what you suggested. In that case more accurate estimates for individual patients may be had by pretending the interactions are absent, and just using a covariate-specific absolute risk reduction estimate.
beautiful. (tho, we could just do them right *once* the first time. &, heck, add a replication-esk element. i mean it is *DEATH* & potentially *EVERYONE* we are talking about w some of these interventions. all treatments best be trials, periodically reviewed, forever.)
if the alt is just "guess forever" or "side in favor of industry" - that is no answer at all. for all the known context reasons...still, another gr8 entry. =) _JC
Thank you John Mandrola and all the others for all the thoughtful pieces, so useful for a clinician as me that as to apply this in the wards in the real practice.
Apart from the more statistical discussion, I have two thoughts / questions to add:
- How do you put the hypothesis that aspirin could be less safe with the very recent ARTESIA trial, where apixaban was associated with more bleeding (than aspirin)?
- Do you think clopidogrel would be safer than aspirin in stroke as it seems it is in coronary disease, as the more recent evidence shows?
Bernardo
As always, your candor and humility is much respected. 👍🏽
I would agree with the other commenters who pointed out that negative evidence is really the essence of science. It may be that the reason left atrial cardiomyopathy didn't pan out is that those abnormalities don't lead to clot formation within the atrium or that there are never clots that embolize from there i.e. that there is another pathophysiological cause that may have little or nothing to do with blood clotting. I admit that there may be evidence supporting the embolization model that I am not aware of and would welcome correction on this point as I have not followed the literature closely for quite some time. But during my years of practice I always thought of the embolization theory as possible but lacking real proof.
On the question of pragmatism I would say that we probably don't need to do any more studies comparing different anticoagulants and/or platelet inhibitors. It is useful to know that low event levels are roughly equivalent without determining precise statistical significance numbers that may not have any practical significance. Most useful here would be to keep up long term follow up on recurrence of strokes and, should the incidence remain low, consider that none of the conventional therapy is worthwhile. First do no harm is even more important in this age of abundant pharmaceuticals and technological gadgets.
An explanation that invites me to explore further. Thank you.
I’m aware that I’m the irritating univited layperson and I did end my subscription so I can’t keep commenting. I only follow you guys so I can learn for myself. Speaking of which, thank you for teaching me how to interpret confidence intervals. Correct or incorrect, these are my thoughts:
How the heck are seven brain bleeds a non factor? If offered two comparable products, IDk about you, but I’m probably not going to choose the brain bleed chaser. If the study is under powered, does that mean no findings are significant? I would think 7 vs 0 is all the more significant with fewer people. And I’m sure I’m not supposed to call them people when referencing human beings participating in a study, but that’s what they are.
That said, good ol’ aspirin feels safer than a new medication.
The other thought I have is, this is exactly how pharmaceutical companies “own” the science. I’m much more inclined to trust a non biased, honest Doctor. I know Pharma can get sued (for medications) but I think they do the math and allow for margins, so they’re going to be biased. I also think, Doctors are supposed to ignore that when prescribing.
So idk, brain bleed or maimed couch potato calling the number on the screen so I can enrich a law firm(?)
This is where I go back to non pharmaceutical alternatives and letting nature take its course. But idk because I’m not in the hot seat yet. I’m going to be quiet and just be an observer now so when the day comes, I can feel confident in my decisions.
The brain bleeds were one of two ‘safety outcomes’ in the study. The other safety outcome was death, i.e. all-cause mortality. Here's the result: there were 12 deaths in the Apixaban group and 8 among patients receiving aspirin.
This was a small but interesting study. But one shouldn’t use it to draw conclusions about benefits and risks.
And also, I know every Doctor and scientist would tell me not to draw conclusions from this study but…
A thousand or so with no clear benefit and 50% more death IS actually good enough for me. To the point where, I’m going to create a post it note for myself just incase I lose my mental faculties: Do not take Abixaben no matter what.
One time I was driving a few cars behind a semi carrying logs when a log fell off and tumbled off the side of the highway. I was the only car to immediately change lanes and pass. All I could wonder is, how many logs need to fall off the truck before other drivers will take action to protect themselves(?). Why isn’t the answer ‘one’ for everyone? Are they just going to wait until a log comes through their windshield??? I don’t need more logs or more people.
Ok, I can’t resist.
In the situation where the logical mind has not been ‘retrained to interpret science’, it’s very mysterious how 50% more death in the test group is not the real reason they stopped the study, and also how the response to these findings is that another study with more people is needed. I suppose there are not clear guidelines/limits around how many deaths are permissible before you have to shut down a study. I just think, if I were the person responsible for running study, that’s the point where I would become very uncomfortable continuing and I certainly wouldn’t want to go to my grave having carried this out at a grander level.
That’s ok. You don’t have to explain this to me.
John thank you very much for writing the excellent follow-up piece. There are several aspects worthy of further exploration related to new sequential designs and choice of endpoints. On the latter, if the really needed number of patients using real clinical effects is too large, either the study needs to be abandoned or an endpoint with reasonable information content needs to replace the low information binary endpoint. If a higher-resolution endpoint that would affect practice cannot be found then again the study might best be abandoned. I want to concentrate though, for the moment, on the choice of the effect size with the existing low-information endpoint. One of the most common mistakes made in clinical trial design is assuming that the effect size not to miss comes from previous studies. This is simply not the case. The effect not to miss, i.e., the one going into a power calculation (Bayesian power or traditional power), comes SOLELY from clinical expertise and patients’ opinions. Previous studies are not informative about what effect matters to patients. While choosing the effect size, consideration of both the relative (HR, OR) scale and the absolute scale (increase in mean number of years alive and well; absolute risk reduction using cumulative incidence at plan end of follow-up) should be examined as in https://hbiostat.org/bbr/ancova#fig-ancova-ordiff (this example also accounts for a differential treatment effect, i.e., an interaction between treatment and one of the baseline covariates).
I just added a flowchart of such a study design at https://hbiostat.org/bayes/bet/design
Great discussion John and Frank. It probably won't surprise you both that this is an area of interest to me as an educator of evidence-based health care.
The current/leading paradigm in medicine/healthcare research (and education thereof) is that point estimates from single studies = evidence of effect ("P<0.05") or evidence of no effect ("P>0.05"). "Point estimology" taught by "point estimologists"!
I try to follow folk in this space who want to move the field away from this way of thinking for exactly the reasons this case highlights. But it comes at a price. Academia still rewards the existing (poor) way of reporting/interpreting findings. To go against this means being willing to take a hit to your academic record/progress etc.
In my last Cochrane review I was proud to apply some of this thinking to the interpretation of the results: https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD011497.pub2/full:
"Meta‐analysis of five of these trials observed improvement in reported symptoms following a physical activity intervention (SMD –0.93, 95% CI –1.44 to –0.42). A minimal clinically important difference (MCID) of 0.5 SD for bowel symptoms in people with IBS has been suggested (Spiegel 2009). The observed range of estimates are consistent with both clinically important and non‐important improvements in IBS symptoms following a physical activity intervention. When considered on an original item scale, the observed difference approximates to an absolute mean reduction of 69 points (95% CI 31 to 106) for total symptoms of the IBS‐SSS (based on the mean control group baseline SD of 74 from Fani 2019; Hajizadeh Maleki 2018; Jia 2016). The MCID for the total symptoms score of the IBS‐SSS has been defined as a difference of 50 points or more (Francis 1997). The observed range in estimates are consistent with both clinically important and non‐important improvements in symptoms as measured by the IBS‐SSS following a physical activity intervention.
We considered the certainty of the evidence as very low, providing very little confidence in the observed estimates. In a preplanned sensitivity analysis including only the studies deemed at low risk of selection bias, we found that the effect estimate reduced to a moderate size (SMD –0.62, 95% CI –1.44 to 0.21) still in favour of physical activity but estimates were now consistent with both improvement and worsening of symptoms. Larger studies at low risk of bias are needed to increase confidence in the observed estimates."
I could so easily have opted for the (positive) take of the point epistomologist - and the plaudits from the field (physical activity for health/chronic disease) that would have come with this.
Another example of such published only yesterday - https://www.bmj.com/content/384/bmj-2023-075847
As with many other important issues on "evidence-based", this is a slow tanker to turn. Discussions like this are another tug boat in that collective effort.
One minor reaction: Standard deviations should not go into MCID consideration. The SD measures how patients disagree with each other on their response. MCID needs to consider only one patient at a time.
I think FH has rit what I thought. If the sig. requires a much larger N, the study is not worth repeating. Unless morbid/mortality is so great a sl. diff. is important.
I’m only a mediocre physicist, so may completely not understand.
Very educative and kudos to dr Mandrola for his lack of prestige! This just cements my confidence in his assessments. Not everyone is so wise to be humble.
So good to see this king of discussion here. During COVID times and the search for treatment with repurposed drugs, “absence of evidence was equal evidence of absence”. Thanks for bringing truth to the scientific field. Let’s wait for so many specialists to come clean and debate a lot of mistakes made during COVID. That may have prevented millions from having a chance of treatment. One must be reminded that the mortality rate on ventilators was above anything close to acceptable.
But wait. We are ALWAYS referring to absence of evidence. That should go without saying.
To require evidence of absence would be to require proof of a negative. That’s a logical impossibility. That’s not how trials work…ever.
So a trial starts with a hypothesis, and pre-specified statistical parameters upon which one can reject the null and accept the hypothesis. In this case, the statistical analysis showed we could not reject the null and accept the hypothesis; so instead, we must reject the hypothesis and accept the null: ie. accept that “X” does NOT work (in this case; in how the trial was done in this instance). That is of course different from stating one has PROOF that X doesn’t work. This is how it is, and how it has always been.
Hence, there is always the possibility that a larger trial with more endpoints MIGHT show that X in fact does work (and we can reject the null, that it doesn’t). Or that an identical study using different power assumptions may have yielded different statistical results. The onus is on the proponents, of whatever, to do that work, and show that proof.
You can always say…under different trial and statistical circumstances….”X might have worked”….but that still absolves precisely nobody from having to go out and proving it.
That said, the interesting thing with this trial to me is still the metrics for determining atrial cardiopathy as a predictor of future overt AF. We already know DOAC is better than ASA in clinical AF. That’s news to nobody. What I take from this is that the way this study tried to determine atrial cardiopathy has not been shown to be effective (ie absence of evidence). I would not, on the basis of this study, yet conclude that the entire concept of atrial cardiopathy is without merit (ie. evidence of absence).
You are unwittingly having a purely traditional frequentist world view. With Bayes you don’t need hypotheses at all and can rely on estimation and generating evidence IN FAVOR of infinitely many assertions (all possible levels of efficacy). Everything changes when you think this way. With Bayes you CAN bring evidence for a “negative”, e.g. Pr(HR between 0.9 and 1/0.9).
I’m a clinician and mere end user insofar as statistics apply to clinical trials. So I am clearly out of my depth here.
But with regards to the study in question, what should be considered the pre-test likelihood or prior probability of the atrial cardiopathy metrics used here as being predictive of future clinical AF (to the extent of producing a differential clinical effect between the use of apixaban vs ASA)? It seems to me that there is a dearth of clinical data to show that atrial cardiopathy (however measured) is definitely a causal predictor, and if that is the case, I’m unsure how one would arrive at a reasonable “prior”.
A prior might encapsulate current pre-study knowledge in a very simple way such as “we know the drug isn’t a cure so we’ll put a high probability on the true unknown hazard ratio being between 1/2 and 2”. Then to conclude similarity of effects of the two treatments would require a sharp posterior probability that the HR is between, say 0.9 and 1/0.9, e.g. a probability >= 0.9. The study being discusses probably has a fairly low probability that the treatments are similar on outcomes.
Thank you for speaking up.
Nice. A comment on one part: The sample size needed to incorporate interactions is so huge that we can rarely do what you suggested. In that case more accurate estimates for individual patients may be had by pretending the interactions are absent, and just using a covariate-specific absolute risk reduction estimate.