Thanks for the article and the discussion around it.
I want to focus on your conclusion #4 because I think it's where things tend to go astray, in part because it extends what looks like an error in conclusion in #2. (moving from an effect was captured to inferring it was from the intervention - I don't think subjective confidence is enough, but getting beyond that requires recognition, discussion, etc., effectively making explicit Bayesian-style priors).
The use of CIs, p-scores, etc. all depend on certain assumptions about how the underlying reality actually looks. If, for example, the underlying spread isn't a normal distribution, I don't think any of those assertions are justified, much less hold. But even on their own terms, neither the CI nor the p-score are making a claim about the underlying reality, but rather about what the results of repeated testing would/should look like.
As a non-normal distribution that I hold near-and-dear, first year salaries for lawyers are generally distributed bi-modally.
For kicks, I think it's possible to think of pharmacogenetic responses to drugs as potentially working that way to some extent in at least some circumstances. (E.g. to keep it simple, responsiveness tracks just a particular variant with a discrete spread in the population).
Making any population-wide inferences from any given RCT study's results becomes particularly problematic if the underlying assumptions about the model's ties to "true values" don't hold. And those assumptions should be disclosed and subject to real discussion as to why it's believed to be the case (or not). Which ends up coming back to Bayesian approaches.
In other words, #4 seems to be asking the statistical tools to do something that they can't do and don't purport to do on their own. If it holds in the real world, it does due to a number of assumptions that should be in some sense empirically tested and/or justified before just assuming they apply.
But I think the point made by Bayesians and some of the other comments is that -- especially if the signal is clear -- you can get to the causation question without relying or getting caught up in the statistical math reflected in the CI or p-scores. More to the point, that one should go straight to the causation question without getting hung up on the CI or p-scores.
At this point, I've probably stepped it in somewhere.
That said, as an industry, I think medicine might benefit from looking back at the historical experience and disagreements between Ronald Fisher and William Gossett (aka Student).
As I understand it, Gossett wasn't a fan of the large RCT because it wasn't economically feasible in his role at the beer-maker for Guiness. There is room for it, but a fetish-like focus on the RCT, CI and p-scores tends to distract from getting effective answers to the questions that matter for getting things done. A properly designed RCT can be a powerful piece of evidence but isn't conclusive in and of itself and can actually end up more biased than other approaches in some circumstances.
When I read a medical paper, I give little weight to the statistical analysis. I look at the raw figures and percentage differences and use common sense to determine whether the differences have any practical significance. Almost anyone with an average education can do the same. Other factors such as the quality of the data may greatly outweigh or even negate that of statistical analysis. No better illustration of that than the recent "Covid" episode where the label was attached to a common clinical syndrome for which there was no definitive test. This led to the enumeration of "cases" and absurd diagnostic categories like asymptomatic cases, "Covid" carriers and "spreaders", "long Covid", etc.
1. Does a p-value automatically translate into a confidence interval? For example, does a p-value of 0.01 immediately define a 99% confidence interval?
2. Would it be correct to say that a confidence interval tells us that "the true effect/association lies within the confidence interval with a likelihood that is not definable, but is reasonably suspected to be typically sufficient enough to warrant subjective consideration that the true efficacy/association exists at a level contained within the interval." ?
3. What is the difference between likelihood and probabilty?
Very interesting stuff. I originally read about the problem with P-values and one of their problems was about the ability to test for a viral infection. Which made me somewhat skeptical of the CoVid narrative. I found the article before CoVid but remembered it when I started hearing people talk about testing. There are fundamental problems with statistical analysis which pre-date many people's ideas from 2020 onward.
"with this single poll, all we can say is the true result is likely somewhere between 37% and 43% but we will be wrong with that statement 5% of the time."
This is false.
Frequentist statistics tells us NOTHING AT ALL about the true result. Frequentist statistics is only able to make statements ABOUT THE EXPERIMENT assuming, as a hypothesis, some actual true result.
But this is not what we care about. We care about what the "true result" is.
To make a statement about the "true result", we MUST introduce a prior and then do Bayesian updating to a posterior from the results of the experiment.
That medicine still has not adopted Bayesian analysis in the 21st century is perhaps the greatest medical scandal.
Hi Mark - thanks for your comment. Just responded to Frank Harrell who also had an issue with our statement that "with this single poll, all we can say is the true result is likely somewhere between 37% and 43% but we will be wrong with that statement 5% of the time.". Have a read, and would be more than happy to hear back what you think.
I am dubious about the difference between two options:
1. There is an undetected source of bias or confounding
2. The result really was due to chance
I think that in randomized trial these are the same, and any "confounding" or "bias" is also due to chance. I take it in quotes because I think it is not correct to talk about confounding in randomized settings: from causal point of view disbalanced covariate is not a confounder because there is no causal link between this covariate and randomization. It is just yet one "due to chance".
2. F. Harrell's comment makes some very important points.
3. Contrary to the author's suggestion: 95% CIs have NOT stood the test of time. In my judgment (which could be wrong), the test of time has shown the opposite!
99% CIs are what medicine should be using! Medical research has time and time reveal that it is hiding ignorance in the space between 2 and 3 sigmas.
If we went back for 50 years and re-examined all of the literature and used 99% CIs, we would see very well the missteps and wasted dead ends of misunderstanding.
Studies that purport to show something meaningful with a 95% CI but fail with a 99% CI are an opportunity to investigate heterogeneity, i.e what really is the particular class of patients who might benefit from the proposed medical intervention.
1) agree that Frank had some important points - working on a response
2) when I said "stood the test of time" I didn't mean that 95% is the CI we should use because it has proven to be the best but rather that 95% is still by far the most common CI used and/or reported
3) I wouldn't agree with the statement that "99% CIs are what medicine should be using!". Replacing one arbitrary threshold with another arbitrary threshold still leads to the dichotomous problem we discussed in the post. One could then argue why not then use a 99.9% CI? The purpose of the post was not to debate 90 vs 95 vs 99 etc - because no matter what you choose it is still arbitrary. And, if you go to 99% there are all sorts of logistically implications. A great discussion on all this can be found here. https://eurradiolexp.springeropen.com/articles/10.1186/s41747-020-0145-y
Yes "arbitrary" in the sense of a mathematical construct - you could pick 50% CIs. But would you?
The aim is to not be fooled by randomness. I am with Benjamin et al. and Ioannidis.
95% CI is just enough to be "scientistic" but not enough to be "scientific" especially when coupled poorly internalized "statistical thinking".
Your article highlights (with the caveats from Harrell) the need for medicine to under go remediation in "statistical thinking", but having cross-examined physicians for 3+ decades, I am not sure there is easy way to remediate hubris.
On the whole physicians are not very good at statistical thinking BUT they think they are. This is a problem. Same problem with diagnostic skills: the harder the case the more confident they are in their certitude of correctness! I am convinced that these two things are related.
I look forward to your response/follow up regarding Bayesian approach.
Thank you for the nice discussion of an always tricky topic.
I want to comment on a side issue you discuss: that of using the word "trend". I want to defend its use when summarizing the results of a trial.
Imagine the sentence, "Patients who received drug X had no reduction in all-cause mortality, but CV mortality was reduced by 12% and there were trends toward reductions in MI, stroke, and CABG."
Statisticians (and many clinical epidemiologist) will correctly point out that no test for trend was done and will then conclude that the word trend is ambiguous and unhelpful. While it is slightly vague, it is actually quite helpful in quickly communicating an interpretation of this trial.
The author of such a sentence is saying two things with the word "trend": the point estimates for reduction in MI, stroke, and CABG were clinically meaningful, and each p-value was more than 0.05 but not so big as to make the statement laughable. No one would laugh if a p-value were 0.07 and everyone would laugh if a p-value were 0.48. We can be pretty sure that most of us, who dislike being laughed at, would not risk writing such a sentence for a p-value bigger than about 0.15 or maybe 0.20.
The word "trend" is a shorthand here. If you want to know more, you need to go actually look at the trial results, but this is the interpretation of those results by the person writing the sentence. It is quick, clean, and easy to understand.
In contrast, let's consider the statistically and epidemiologically correct version of the sentence:
"Patients who received drug X had no statistically-significant reduction in all-cause mortality (HR 0.98; 95% CI 0.94 to 1.03) although the CI includes values that could be clinically meaningful; CV mortality was statistically significantly reduced (HR 0.88; 95% CI 0.75 to 0.98) but at the upper extreme of the HR, this would likely not be clinically meaningful; there was no statistically significant reduction in MI (HR 0.86; 95% CI 0.71 to 1.03), stroke (HR 0.90; 95% CI 0.70 to 1.08), or CABG (HR 0.85; 95% CI 0.70 to 1.12), but the point estimates for each of these would be clinically meaningful if real, while the upper ends of the confidence intervals include the possibility of harm."
There are absolutely times when I want to write that longer, more accurate sentence. But the shorthand of "trend" has value too.
I totally appreciate what you are trying to say but I still think if one uses words like "trend" there is way to much substantial subjectivity with such a term and as much as you think it might inform it can also equally misinform - and there is no way to know what it means to an individual person reading the word. Kind of why I think we should also stop using the terms high and low risk when it comes to describing a person's chance of a heart attack. We have to use ball park numbers. Check out this paper if you want to see all the diverse phrases used to describe statistical results PLoS Biol 2022: e3001562. https:// doi.org/10.1371/journal.pbio.3001562
Here is the list of terms that have been used - some really make me laugh and trend is in there a lot - which ones do you like?
I have probably encountered every one of these phrases in pieces I have edited, and you are, of course, correct that this is problematic.
That said, I think there's a difference between implying that the data are struggling to reach statistical significance (e.g., "almost reached statistical significance"), and having a standard shorthand phrase that says, basically, this result wasn't statistically significant, but it's quite possible there was an important effect anyway.
I've resisted pressure to beat the word "trend" out of papers I've edited in my various roles and, at least for now, plan to continue to do so!
I didn't realize that when you used the word trend you were trying to get across that it is "quite possible there was an important effect anyway" - which is one of the problems of using a single word like trend to get across a thought. How about saying one of these 2 phrases (instead of trend) for a non-statistically significant difference depending on the confidence interval.
The confidence interval reported (then show that interval) and then say one of the following - assuming a minimal clinically important difference is a 20% reduction – point estimate of 0.8
1) CI = 0.75 - 1.05
Includes the possibility of an important effect but doesn’t rule out the chance of harm
2) CI = 0.90 – 1.05
Doesn’t include the possibility of an important effect and also doesn’t rule out the chance of harm
A useful post here -- thank you. The noun "chance" was used many times in this overview and this innocent bit of verbiage signals a communication problem that's rarely mentioned in any textbook.
My major beef for a long time (I'm a retired academic surgeon) has been that folks never explicitly articulate "the chance of WHAT". This issue never failed to arise whenever I asked (during journal clubs) our residents/staff what some given P-value signified in a research paper. My best understanding is that what's "actually measured" with a P-value (in a project where conventional null hypothesis testing is used) will be the probability that the particular observed data in our sample [whether counts or measured continuous variables] arose *merely because of imperfect sampling* of two compared populations given, and this is important: four ordinarily tacit, but very crucial, contingencies: IF there was no fraud, IF there were uniformly accurate measurements, IF the two sampled populations were indeed identical, and IF there was a perfectly executed experiment in all possible respects. A toy example can be used to easily illustrate: Presume that a large jar contains a huge (e.g. let's say 4,416 beads) number of beads that are either red or black. There are exactly equal numbers of red and black beads in the jar but WE DO NOT KNOW THIS. The beads are perfectly spherical and all have the same diameter. There is presumed thorough mixing of the beads before sampling and nobody cheats in any way at any time. While wearing a blindfold I use a special scoop that removes 100 beads -- we then tally the number of red beads and black beads removed in the scoop. Our task is to estimate jar composition from what is found in that single sample.
Of course, we would not necessarily find 50 black and 50 red beads in our single sample of 100 beads. Use of the noun "chance" in this context reflects the fact that it would not be impossible to find (for example) 47 black and 53 red beads in some given single sample of 100 beads **EVEN IF* the jar truly contained equal numbers of blacks and reds. A computed P-value would provide us the probability of drawing the sample that we presently have in hand IF the jar actually contained equal proportions of red and black beads, and IF all other contingencies listed above were satisfied. If the P-value were sufficiently small, however, we could draw the inference that the population DID NOT contain equal numbers of red and black beads.
Thanks for a simple refresher course in stats. "When trial results are presented, we need not worry too much about the complex mathematical computations underlying the statistical presentations." Um, maybe, but if researchers are making such silly mistakes as using p values on descriptive statistics, maybe we DO need to take a closer look at the statistical tests being used, since there are different types of tests for different types of data, and we should make sure they match. (We need to know if the scale on which the variable(s) are measured is ordinal, categorical, or interval, and also whether they are normally distributed - see link #1 below). Also, some researchers dichotomize continuous variables, which creates several problems (link #2). Personally, I think all studies should report the absolute risk reduction in addition to the relative risk reduction, since the latter is meaningless without knowing the baseline absolute risk, and can be very misleading. For example, a 50% RRR might mean an ARR of 50% (i.e., risk cut from 100% to 50%, very unlikely), or it might mean an ARR of 1% (i.e, risk cut from 2% to 1%). Also, relative risk reduction seems to be misunderstood even by some doctors and Ph.D. scientists. I have heard a Ph.D. researcher who is an expert in infectious disease say that a vaccine with a 96% RRR means that it will be effective in preventing disease in 96% of the people who get the vaccine. Um, really? No, it means that in the clinical trial, the Control Event Rate minus the Experimental Event Rate divided by the Control Event Rate = 96%. The lower the event rate in the control group, the larger the difference between relative and absolute risk reduction, which is why it's really important to know the event rate in the control group. I'm pro-vaccine but it bothers me when I hear such statements from people who should know better, because they just give ammunition to people who are against vaccines for being misleading. We should also choose outcomes that matter, such as hospitalization and death, rather than say, mere infection.
Thank you! Regression analysis aged me. Wish I came across this summary last year - it’s a solid overview of concepts that appeals to all students wherever they are on the learning continuum. Statistics is a blood sport not for the faint of heart nor online orphans.
There is much, much good in this article. The authors started out with great pains to interpret a confidence interval exactly correctly. Then they made a mistake:
"So, with this single poll, all we can say is the true result is likely somewhere between 37%
and 43% but we will be wrong with that statement 5% of the time."
No. Both parts of this sentence are incorrect. In frequentist statistics the true value is either in or outside the interval; there is no probability attached to this. The probability statement does not apply to 0.37 and 0.43 but to the process that generated this interval.
The extreme difficulty in interpreting confidence intervals should drive more people to Bayes, as described in my Bayesian journey at https://fharrell.com/post/journey.
Later the authors say
"Inferential statistics actually do NOT help us test a research hypothesis about whether an intervention worked or not. They assume the observed difference was solely due to chance and simply give us an estimate of the probability of such an occurrence over many potential repetitions of the study."
This is incorrect, as the statement applies only to classical frequentist inferential statistics. Any article on statistics that doesn't acknowledge the existence of Bayes is problematic.
Now take a look at
"No statistics can tell us if the medication worked or if the differences seen were clinically important. These decisions are clinical judgments--not statistical judgements. The ONLY reason we do inferential statistics is to singularly deal with the issue of chance. This concept is key to understanding inferential statistics."
That is false as again it applies only to classical frequentist statistics. With Bayesian posterior probabilities you are not needing to deal with "chance" in the sense above, and you obtain direct evidence measures such as the probability the treatment has any effectiveness and the probability of clinically meaningful effectiveness. And Bayesian uncertainty intervals are so much easier to interpret than confidence intervals.
An article about statistics should be exactly correct to not mislead readers, and researchers should stop pretending that the p-value/confidence limit form of interence is the only form that exists. Otherwise, new confusions will arise.
Frank - thanks for your response – very much appreciated. If I was asked before I posted this article, what is the probability someone will bring up the issue of a Bayesian approach to statistics, I would have guessed that probability to be 110% 🤣
You are correct we were just commenting from a frequentist’s perspective – we should have acknowledged that and we certainly didn’t intend to suggest a Bayes approach did not exist. Maybe we could collaborate in the future on a post to help people simply understand and contextualize a Bayesian approach to statistics – I think that could be very valuable.
It appears the main issue you had was with our phrase “So, with this single poll, all we can say is the true result is likely somewhere between 37% and 43% but we will be wrong with that statement 5% of the time”. We did not intend the word “likely” to suggest a specific probability – although I understand why you might think that. What we wanted to get across was simply a way a reader could possibly start to think about what information a single point estimate and a single confidence interval might provide. So, we were trying to make it “as simply as possible but not simpler”. There is always a fine balance when one does that.
The issue here is a common one. We are using probability in our article in a frequentist sense which is still the most common approach used in clinical research. We actually agree with the key elements of your comments. Clinicians normally have to make an estimate of the probability of a patient's disease before deciding on diagnosis and treatment. Frequentist probabilities do not apply here as there is no such thing as a single case frequentist probability. Our article is actually about trying to explain this with regard to p values and CI's. The only way to deal with this issue would be to explain in detail to people the difference between frequentist and Bayesian probabilities, which is beyond the scope of this article.
Nonetheless, if 95% of the generated CIs will contain the true result, then 5% will not contain the true result. So, is it not reasonable to say it is likely that the one CI we have contains the true result given that 95% of them do BUT as we said we will be wrong with that statement with 5% of the CIs we see?
In essence, saying that 95% CI's will contain the true value 95% of the time, is another way of saying you can have reasonable confidence in this range. Frequentist statisticians use the term confidence rather than a degree of belief because single case probabilities do not have meaning to them.
We do state quite clearly elsewhere in our post that “Inferential statistics don’t give us a probability” and that “Knowing this nuance is key to understanding statistics”
A REQUEST
To help our readers understand the difference between a frequentist approach and a Bayesian approach I have a question as to how you think might be the best way to simply interpret the result of a single trial.
THE SCENARIO
1) Let’s say there was only a single placebo-controlled trial of a new medication to see if it would reduce the risk of a heart attack. A single trial is not infrequently all we have in medicine when it comes to answering a clinical question.
2) Let’s assume the trial was well done and the findings were reported as a point estimate of 0.8 with a 95% confidence interval of 0.7-0.9 with the absolute risks of a heart attack in the placebo group being 10% and the risk in the new medication group being 8%.
How would you interpret this scenario? Thanks again for your response and interest.
Thanks very much for the nice reply James. I understand. Yes we should do something in the future about Bayes. I think this goes a long way towards both approaches: https://discourse.datamethods.org/t/language-for-communicating-frequentist-results-about-treatment-effectshttps://www.fharrell.com/post/bayes-freq-stmts/https://discourse.datamethods.org/t/bayesian-vs-frequentist-statements-about-treatment-efficacy . The first link has my proposed interpretations of trials like the one you described just now . In terms of confidence intervals, Sander Greenland has published much related work and shows advantages of using the term 'compatibility interval'. For now the important points are (1) the frequentist approach ties to be simple but it does so by not giving you what you want and (2) the probability to attach to compatibility intervals is the long-term probability that the PROCESS used to generate the interval covers the true unknown treatment effect. The probability is not attached to a single realization of that process.
We need the actual numbers of patients in each group to do a proper Bayesian analysis. It cannot be backed out with just the info you have given.
"So, is it not reasonable to say it is likely that the one CI we have contains the true result given that 95% of them do BUT as we said we will be wrong with that statement with 5% of the CIs we see?"
No, this is an illegal inference by frequentist rules. Frequentism gives no metric for what is "reasonable". Frequentism only and ever gives P(E|H): the probability that the evidence E would be obtained in the experiment if the hypothesis H is true.
But no one cares about this. What we care about is P(H|E), the probability that the hypothesis H is true given the evidence E that has been obtained. But, according to frequentism, P(H|E) does not exist, as a matter of principle.
This is an insurmountable barrier for frequentist statistics. But since the barrier must nevertheless be surmounted, all sorts of dodges are made (all illegal by frequentism's own rules), and this is why the subject is so damn hard to understand. Because, at a fundamental level, it does not make sense.
This would be of mere academic curiosity if lives were not at stake.
1) I imagine you are "correct" if you strictly follow the "rules" etc - however, the entire purpose of our post was to try to give people who get exposed to results and statistics presented in a frequentist way on a regular basis a way to more "correctly" interpret them. If we get rid of the words reasonable and likely - do you at least agree that "95% of the confidence intervals will include the true result"?
2) The example I gave was simply theoretical. Would you be willing to use the following real-life example and explain what you say about the results in a way that clinicians might be able to use? The numbers come from the EMPA-REG trial https://pubmed.ncbi.nlm.nih.gov/26378978/. The abstract states "The primary outcome occurred in 490 of 4687 patients (10.5%) in the pooled empagliflozin group and in 282 of 2333 patients (12.1%) in the placebo group (hazard ratio in the empagliflozin group, 0.86; 95.02% confidence interval, 0.74 to 0.99; P=0.04" . Really look forward to hearing what you have to say. Thanks.
Here is a plot of the posterior distribution for the rate of the primary outcome (which I understand is BAD) in the drug group (blue) and the placebo group (orange), assuming a flat prior (choice of prior makes very little difference because there is a lot of data):
Now you could do some more fancy math and compute confidence intervals ("credible intervals" in Bayesian lingo) or whatever, but I think it's enough to just look at the picture. I sure would want the drug, wouldn't you?
But if a number is needed, I think the most relevant one is the probability that the true value of the primary outcome rate is lower with the drug than with the placebo. That probability is 0.978. Very convincing, IMO.
Of course the drug may have other possible bad consequences (including cost) that would complicate the decision, but that's not part of the trial as I understand it.
The formula for each curve is simple:
P(x) = C x^np (1-x)^(n-np)
where n is the total number in the group (blue 4687, orange 2333) and np is the number with the primary outcome (blue 490, orange 282). The constant C is chosen so that the total probability is one; C =(n+1)! / (np! (n-np)!) where the exclamation point denotes the factorial function.
Note that, if it's not already obvious, I am NOT a medical person: my expertise is in a different hard science. I can barely get through medical jargon at all.
Just so I understand are you saying that the probability that the drug has an effect is 97.8%. If so that is great - but I think I already pretty much know that by using a frequentists approach, because we have ruled out chance - the p value is <0.05. However what I really need to know is the benefit large enough to take the drug every day for the next three years. Here is what I would do by looking at the confidence interval. I believe the relative benefit is somewhere between a 26% relative benefit (0.74) and a 1% relative benefit (0.99) and the observed relative benefit was 14% (0.86). So the absolute benefit seen in this trial was 12.1% minus 10.5% = 1.6% - so a 1.6% benefit and therefore 98.4% get no benefit - or approximately 60 people need to take this drug for three years for 1 to benefit. However, because we don't know the true effect all I can say is that the effect is likely - sorry I know Bayesians don't really like that word - somewhere as large as a 26% relative benefit or as small as 1%. So the absolute benefit might be as large as ~3% or close to no benefit at all. Then I have to add in that the cost of the medication is about CA $1000 a year and 5-10% of people will get a genital infection because of the drug. Then I have to somehow explain this to a patient using percentages to help them make a shared decision.
So my main question is now, what additional clinically useful information could I get that I could use in the decision making process by using a Bayesian approach instead of a frequentist's approach? And then is it something I could easily do by looking at the results presented in the paper?
Watch out. 98.4% get no benefit has nothing to do what the probabilities being considered here. To interpret things that way you'd need a 6-period randomized crossover study, which allows one to estimate benefit on a patient-by-patient basis. You can't get the fraction benefitting from the probability about a group effect.
"what I really need to know is the benefit large enough to take the drug every day for the next three years."
That requires first quantifying the downside in some way that allows it to be meaningfully compared to the upside, eg, by assigning a dollar value to every potential outcome, good or bad. I don't see how your "relative benefit" and "absolute benefit" numbers are meaningful without that sort of quantification first.
As for the more basic point, is frequentist p<0.05 a good criterion? In high data situations, as we have here, yes, it will mimic the (fundamentally more sound) Bayesian posterior probability of there being an effect well enough not to matter. In situations with less data, I would not trust this to be the case.
As in "the evidence is compatible with the hypothesis".
E can be compatible with multiple hypotheses simultaneously, including hypotheses that would explain some or all of the result by bias/confounding, reverse causation, fraud, or randomness.
Hi Dr. Harrell. I know too little about Bayes (and stats in general) to ask this effectively.
I use Bayes implicitly when ordering any diagnostic test, as I must have a pre-test likelihood of disease in order for any test result to inform my post-test likelihood, and hopefully affect my downstream management decisions. But my pre-test seems entirely subjective, informed by formal teaching (“textbooks”) as well as clinical experience. Another clinician may have a different prior for the exact same patient. How does one deal with such differences in prior probabilities?
And for trials of therapeutics (esp “new” agents or first in class therapies), how does one even arrive at an informed prior probability? Thanks.
There are many good answers to that question, which I've dealt with at https://hbiostat.org/bayes/bet . Briefly, we always know something and classical statistics does not even make use of such minimal knowledge, e.g., that a treatment is incremental rather than curative. An incremental therapy may entail use a prior for an odds ratio, for example, such that the probability that the odds ratio >4 or < 1/4 is 0.05. In some cases we have actual trustworthy data on which to base a prior. In a majority of cases a reasonable sample size makes the prior very less relevant. Having a prior is the price of being able to make probability statements about the unknown of true interest. Just as with medical diagnosis.
Thanks for the article and the discussion around it.
I want to focus on your conclusion #4 because I think it's where things tend to go astray, in part because it extends what looks like an error in conclusion in #2. (moving from an effect was captured to inferring it was from the intervention - I don't think subjective confidence is enough, but getting beyond that requires recognition, discussion, etc., effectively making explicit Bayesian-style priors).
The use of CIs, p-scores, etc. all depend on certain assumptions about how the underlying reality actually looks. If, for example, the underlying spread isn't a normal distribution, I don't think any of those assertions are justified, much less hold. But even on their own terms, neither the CI nor the p-score are making a claim about the underlying reality, but rather about what the results of repeated testing would/should look like.
As a non-normal distribution that I hold near-and-dear, first year salaries for lawyers are generally distributed bi-modally.
https://www.nalp.org/salarydistrib
For kicks, I think it's possible to think of pharmacogenetic responses to drugs as potentially working that way to some extent in at least some circumstances. (E.g. to keep it simple, responsiveness tracks just a particular variant with a discrete spread in the population).
Making any population-wide inferences from any given RCT study's results becomes particularly problematic if the underlying assumptions about the model's ties to "true values" don't hold. And those assumptions should be disclosed and subject to real discussion as to why it's believed to be the case (or not). Which ends up coming back to Bayesian approaches.
In other words, #4 seems to be asking the statistical tools to do something that they can't do and don't purport to do on their own. If it holds in the real world, it does due to a number of assumptions that should be in some sense empirically tested and/or justified before just assuming they apply.
But I think the point made by Bayesians and some of the other comments is that -- especially if the signal is clear -- you can get to the causation question without relying or getting caught up in the statistical math reflected in the CI or p-scores. More to the point, that one should go straight to the causation question without getting hung up on the CI or p-scores.
At this point, I've probably stepped it in somewhere.
That said, as an industry, I think medicine might benefit from looking back at the historical experience and disagreements between Ronald Fisher and William Gossett (aka Student).
As I understand it, Gossett wasn't a fan of the large RCT because it wasn't economically feasible in his role at the beer-maker for Guiness. There is room for it, but a fetish-like focus on the RCT, CI and p-scores tends to distract from getting effective answers to the questions that matter for getting things done. A properly designed RCT can be a powerful piece of evidence but isn't conclusive in and of itself and can actually end up more biased than other approaches in some circumstances.
Curious as to your thoughts.
When I read a medical paper, I give little weight to the statistical analysis. I look at the raw figures and percentage differences and use common sense to determine whether the differences have any practical significance. Almost anyone with an average education can do the same. Other factors such as the quality of the data may greatly outweigh or even negate that of statistical analysis. No better illustration of that than the recent "Covid" episode where the label was attached to a common clinical syndrome for which there was no definitive test. This led to the enumeration of "cases" and absurd diagnostic categories like asymptomatic cases, "Covid" carriers and "spreaders", "long Covid", etc.
This is true about 40% of the time.
Hello!
Please take a look to the talk that Dr. McCullogh had with his peer Dr. Karina Acevedo Whitehouse (It can be found in his substack) about her new book! https://www.amazon.com/dp/9694392675 https://petermcculloughmd.substack.com/p/adverse-effects-a-critical-review
1. Does a p-value automatically translate into a confidence interval? For example, does a p-value of 0.01 immediately define a 99% confidence interval?
2. Would it be correct to say that a confidence interval tells us that "the true effect/association lies within the confidence interval with a likelihood that is not definable, but is reasonably suspected to be typically sufficient enough to warrant subjective consideration that the true efficacy/association exists at a level contained within the interval." ?
3. What is the difference between likelihood and probabilty?
Hi David - great questions - I just responded to Frank Harrell - I think my response to him may answer some of your questions. Check it out. Thanks.
Very interesting stuff. I originally read about the problem with P-values and one of their problems was about the ability to test for a viral infection. Which made me somewhat skeptical of the CoVid narrative. I found the article before CoVid but remembered it when I started hearing people talk about testing. There are fundamental problems with statistical analysis which pre-date many people's ideas from 2020 onward.
"with this single poll, all we can say is the true result is likely somewhere between 37% and 43% but we will be wrong with that statement 5% of the time."
This is false.
Frequentist statistics tells us NOTHING AT ALL about the true result. Frequentist statistics is only able to make statements ABOUT THE EXPERIMENT assuming, as a hypothesis, some actual true result.
But this is not what we care about. We care about what the "true result" is.
To make a statement about the "true result", we MUST introduce a prior and then do Bayesian updating to a posterior from the results of the experiment.
That medicine still has not adopted Bayesian analysis in the 21st century is perhaps the greatest medical scandal.
https://www.ahajournals.org/doi/full/10.1161/CIRCOUTCOMES.117.003563
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10117244/
https://www.nature.com/articles/s41573-023-00638-0
Hi Mark - thanks for your comment. Just responded to Frank Harrell who also had an issue with our statement that "with this single poll, all we can say is the true result is likely somewhere between 37% and 43% but we will be wrong with that statement 5% of the time.". Have a read, and would be more than happy to hear back what you think.
Great article. Very informative, and effectively verbalizes concepts that are not intuitively obvious for an end-user of evidence like me.
I am dubious about the difference between two options:
1. There is an undetected source of bias or confounding
2. The result really was due to chance
I think that in randomized trial these are the same, and any "confounding" or "bias" is also due to chance. I take it in quotes because I think it is not correct to talk about confounding in randomized settings: from causal point of view disbalanced covariate is not a confounder because there is no causal link between this covariate and randomization. It is just yet one "due to chance".
1. For many decades, I have kept coming back to Deming (1975) on the difference between enumerative studies and analytical studies. https://deming.org/wp-content/uploads/2020/06/On-Probability-As-a-Basis-For-Action-1975.pdf
2. F. Harrell's comment makes some very important points.
3. Contrary to the author's suggestion: 95% CIs have NOT stood the test of time. In my judgment (which could be wrong), the test of time has shown the opposite!
99% CIs are what medicine should be using! Medical research has time and time reveal that it is hiding ignorance in the space between 2 and 3 sigmas.
If we went back for 50 years and re-examined all of the literature and used 99% CIs, we would see very well the missteps and wasted dead ends of misunderstanding.
Studies that purport to show something meaningful with a 95% CI but fail with a 99% CI are an opportunity to investigate heterogeneity, i.e what really is the particular class of patients who might benefit from the proposed medical intervention.
Thanks for the comment
1) agree that Frank had some important points - working on a response
2) when I said "stood the test of time" I didn't mean that 95% is the CI we should use because it has proven to be the best but rather that 95% is still by far the most common CI used and/or reported
3) I wouldn't agree with the statement that "99% CIs are what medicine should be using!". Replacing one arbitrary threshold with another arbitrary threshold still leads to the dichotomous problem we discussed in the post. One could then argue why not then use a 99.9% CI? The purpose of the post was not to debate 90 vs 95 vs 99 etc - because no matter what you choose it is still arbitrary. And, if you go to 99% there are all sorts of logistically implications. A great discussion on all this can be found here. https://eurradiolexp.springeropen.com/articles/10.1186/s41747-020-0145-y
Yes "arbitrary" in the sense of a mathematical construct - you could pick 50% CIs. But would you?
The aim is to not be fooled by randomness. I am with Benjamin et al. and Ioannidis.
95% CI is just enough to be "scientistic" but not enough to be "scientific" especially when coupled poorly internalized "statistical thinking".
Your article highlights (with the caveats from Harrell) the need for medicine to under go remediation in "statistical thinking", but having cross-examined physicians for 3+ decades, I am not sure there is easy way to remediate hubris.
On the whole physicians are not very good at statistical thinking BUT they think they are. This is a problem. Same problem with diagnostic skills: the harder the case the more confident they are in their certitude of correctness! I am convinced that these two things are related.
I look forward to your response/follow up regarding Bayesian approach.
Thank you for the nice discussion of an always tricky topic.
I want to comment on a side issue you discuss: that of using the word "trend". I want to defend its use when summarizing the results of a trial.
Imagine the sentence, "Patients who received drug X had no reduction in all-cause mortality, but CV mortality was reduced by 12% and there were trends toward reductions in MI, stroke, and CABG."
Statisticians (and many clinical epidemiologist) will correctly point out that no test for trend was done and will then conclude that the word trend is ambiguous and unhelpful. While it is slightly vague, it is actually quite helpful in quickly communicating an interpretation of this trial.
The author of such a sentence is saying two things with the word "trend": the point estimates for reduction in MI, stroke, and CABG were clinically meaningful, and each p-value was more than 0.05 but not so big as to make the statement laughable. No one would laugh if a p-value were 0.07 and everyone would laugh if a p-value were 0.48. We can be pretty sure that most of us, who dislike being laughed at, would not risk writing such a sentence for a p-value bigger than about 0.15 or maybe 0.20.
The word "trend" is a shorthand here. If you want to know more, you need to go actually look at the trial results, but this is the interpretation of those results by the person writing the sentence. It is quick, clean, and easy to understand.
In contrast, let's consider the statistically and epidemiologically correct version of the sentence:
"Patients who received drug X had no statistically-significant reduction in all-cause mortality (HR 0.98; 95% CI 0.94 to 1.03) although the CI includes values that could be clinically meaningful; CV mortality was statistically significantly reduced (HR 0.88; 95% CI 0.75 to 0.98) but at the upper extreme of the HR, this would likely not be clinically meaningful; there was no statistically significant reduction in MI (HR 0.86; 95% CI 0.71 to 1.03), stroke (HR 0.90; 95% CI 0.70 to 1.08), or CABG (HR 0.85; 95% CI 0.70 to 1.12), but the point estimates for each of these would be clinically meaningful if real, while the upper ends of the confidence intervals include the possibility of harm."
There are absolutely times when I want to write that longer, more accurate sentence. But the shorthand of "trend" has value too.
I totally appreciate what you are trying to say but I still think if one uses words like "trend" there is way to much substantial subjectivity with such a term and as much as you think it might inform it can also equally misinform - and there is no way to know what it means to an individual person reading the word. Kind of why I think we should also stop using the terms high and low risk when it comes to describing a person's chance of a heart attack. We have to use ball park numbers. Check out this paper if you want to see all the diverse phrases used to describe statistical results PLoS Biol 2022: e3001562. https:// doi.org/10.1371/journal.pbio.3001562
Here is the list of terms that have been used - some really make me laugh and trend is in there a lot - which ones do you like?
Marginally significant
All but significant
A nonsignificant trend
Failed to reach statistical significance
A strong trend
Nearly significant
A clear trend
An increasing trend
Only marginally significant
A significant trend
Potentially significant
Significant tendency
A positive trend
A decreasing trend
Marginal significance
A slight trend
Almost significant
A statistical trend
Approaching significance
Nominally significant
Quite significant
Near significant
An overall trend
Likely to be significant
Difference was apparent
Uncertain significance
Did not quite reach statistical significance
A weak trend
Marginally statistically significant
Tended to be significant
Possible significance
Not quite significant
A favorable trend
Just failed to reach statistical significance
A negative trend
Almost reached statistical significance
A possible trend
Fell short of significance
Not as significant
A small trend
A numerical trend
Slightly significant
Reached borderline significance
Near significance
Weakly significant
Moderately significant
An apparent trend
Barely significant
Practically significant
A definite trend
An interesting trend
Almost statistically significant
Marginally nonsignificant
Possibly significant
Significantly significant
"Potentially significant". Sheesh!
I have probably encountered every one of these phrases in pieces I have edited, and you are, of course, correct that this is problematic.
That said, I think there's a difference between implying that the data are struggling to reach statistical significance (e.g., "almost reached statistical significance"), and having a standard shorthand phrase that says, basically, this result wasn't statistically significant, but it's quite possible there was an important effect anyway.
I've resisted pressure to beat the word "trend" out of papers I've edited in my various roles and, at least for now, plan to continue to do so!
I didn't realize that when you used the word trend you were trying to get across that it is "quite possible there was an important effect anyway" - which is one of the problems of using a single word like trend to get across a thought. How about saying one of these 2 phrases (instead of trend) for a non-statistically significant difference depending on the confidence interval.
The confidence interval reported (then show that interval) and then say one of the following - assuming a minimal clinically important difference is a 20% reduction – point estimate of 0.8
1) CI = 0.75 - 1.05
Includes the possibility of an important effect but doesn’t rule out the chance of harm
2) CI = 0.90 – 1.05
Doesn’t include the possibility of an important effect and also doesn’t rule out the chance of harm
"non significant trend" and "significant tendency" take the prize for me!
Very nice to have this refresher review---much appreciated.
A useful post here -- thank you. The noun "chance" was used many times in this overview and this innocent bit of verbiage signals a communication problem that's rarely mentioned in any textbook.
My major beef for a long time (I'm a retired academic surgeon) has been that folks never explicitly articulate "the chance of WHAT". This issue never failed to arise whenever I asked (during journal clubs) our residents/staff what some given P-value signified in a research paper. My best understanding is that what's "actually measured" with a P-value (in a project where conventional null hypothesis testing is used) will be the probability that the particular observed data in our sample [whether counts or measured continuous variables] arose *merely because of imperfect sampling* of two compared populations given, and this is important: four ordinarily tacit, but very crucial, contingencies: IF there was no fraud, IF there were uniformly accurate measurements, IF the two sampled populations were indeed identical, and IF there was a perfectly executed experiment in all possible respects. A toy example can be used to easily illustrate: Presume that a large jar contains a huge (e.g. let's say 4,416 beads) number of beads that are either red or black. There are exactly equal numbers of red and black beads in the jar but WE DO NOT KNOW THIS. The beads are perfectly spherical and all have the same diameter. There is presumed thorough mixing of the beads before sampling and nobody cheats in any way at any time. While wearing a blindfold I use a special scoop that removes 100 beads -- we then tally the number of red beads and black beads removed in the scoop. Our task is to estimate jar composition from what is found in that single sample.
Of course, we would not necessarily find 50 black and 50 red beads in our single sample of 100 beads. Use of the noun "chance" in this context reflects the fact that it would not be impossible to find (for example) 47 black and 53 red beads in some given single sample of 100 beads **EVEN IF* the jar truly contained equal numbers of blacks and reds. A computed P-value would provide us the probability of drawing the sample that we presently have in hand IF the jar actually contained equal proportions of red and black beads, and IF all other contingencies listed above were satisfied. If the P-value were sufficiently small, however, we could draw the inference that the population DID NOT contain equal numbers of red and black beads.
Thanks for a simple refresher course in stats. "When trial results are presented, we need not worry too much about the complex mathematical computations underlying the statistical presentations." Um, maybe, but if researchers are making such silly mistakes as using p values on descriptive statistics, maybe we DO need to take a closer look at the statistical tests being used, since there are different types of tests for different types of data, and we should make sure they match. (We need to know if the scale on which the variable(s) are measured is ordinal, categorical, or interval, and also whether they are normally distributed - see link #1 below). Also, some researchers dichotomize continuous variables, which creates several problems (link #2). Personally, I think all studies should report the absolute risk reduction in addition to the relative risk reduction, since the latter is meaningless without knowing the baseline absolute risk, and can be very misleading. For example, a 50% RRR might mean an ARR of 50% (i.e., risk cut from 100% to 50%, very unlikely), or it might mean an ARR of 1% (i.e, risk cut from 2% to 1%). Also, relative risk reduction seems to be misunderstood even by some doctors and Ph.D. scientists. I have heard a Ph.D. researcher who is an expert in infectious disease say that a vaccine with a 96% RRR means that it will be effective in preventing disease in 96% of the people who get the vaccine. Um, really? No, it means that in the clinical trial, the Control Event Rate minus the Experimental Event Rate divided by the Control Event Rate = 96%. The lower the event rate in the control group, the larger the difference between relative and absolute risk reduction, which is why it's really important to know the event rate in the control group. I'm pro-vaccine but it bothers me when I hear such statements from people who should know better, because they just give ammunition to people who are against vaccines for being misleading. We should also choose outcomes that matter, such as hospitalization and death, rather than say, mere infection.
https://stats.oarc.ucla.edu/spss/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-spss/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1458573/
Thank you! Regression analysis aged me. Wish I came across this summary last year - it’s a solid overview of concepts that appeals to all students wherever they are on the learning continuum. Statistics is a blood sport not for the faint of heart nor online orphans.
There is much, much good in this article. The authors started out with great pains to interpret a confidence interval exactly correctly. Then they made a mistake:
"So, with this single poll, all we can say is the true result is likely somewhere between 37%
and 43% but we will be wrong with that statement 5% of the time."
No. Both parts of this sentence are incorrect. In frequentist statistics the true value is either in or outside the interval; there is no probability attached to this. The probability statement does not apply to 0.37 and 0.43 but to the process that generated this interval.
The extreme difficulty in interpreting confidence intervals should drive more people to Bayes, as described in my Bayesian journey at https://fharrell.com/post/journey.
Later the authors say
"Inferential statistics actually do NOT help us test a research hypothesis about whether an intervention worked or not. They assume the observed difference was solely due to chance and simply give us an estimate of the probability of such an occurrence over many potential repetitions of the study."
This is incorrect, as the statement applies only to classical frequentist inferential statistics. Any article on statistics that doesn't acknowledge the existence of Bayes is problematic.
Now take a look at
"No statistics can tell us if the medication worked or if the differences seen were clinically important. These decisions are clinical judgments--not statistical judgements. The ONLY reason we do inferential statistics is to singularly deal with the issue of chance. This concept is key to understanding inferential statistics."
That is false as again it applies only to classical frequentist statistics. With Bayesian posterior probabilities you are not needing to deal with "chance" in the sense above, and you obtain direct evidence measures such as the probability the treatment has any effectiveness and the probability of clinically meaningful effectiveness. And Bayesian uncertainty intervals are so much easier to interpret than confidence intervals.
An article about statistics should be exactly correct to not mislead readers, and researchers should stop pretending that the p-value/confidence limit form of interence is the only form that exists. Otherwise, new confusions will arise.
Frank - thanks for your response – very much appreciated. If I was asked before I posted this article, what is the probability someone will bring up the issue of a Bayesian approach to statistics, I would have guessed that probability to be 110% 🤣
You are correct we were just commenting from a frequentist’s perspective – we should have acknowledged that and we certainly didn’t intend to suggest a Bayes approach did not exist. Maybe we could collaborate in the future on a post to help people simply understand and contextualize a Bayesian approach to statistics – I think that could be very valuable.
It appears the main issue you had was with our phrase “So, with this single poll, all we can say is the true result is likely somewhere between 37% and 43% but we will be wrong with that statement 5% of the time”. We did not intend the word “likely” to suggest a specific probability – although I understand why you might think that. What we wanted to get across was simply a way a reader could possibly start to think about what information a single point estimate and a single confidence interval might provide. So, we were trying to make it “as simply as possible but not simpler”. There is always a fine balance when one does that.
The issue here is a common one. We are using probability in our article in a frequentist sense which is still the most common approach used in clinical research. We actually agree with the key elements of your comments. Clinicians normally have to make an estimate of the probability of a patient's disease before deciding on diagnosis and treatment. Frequentist probabilities do not apply here as there is no such thing as a single case frequentist probability. Our article is actually about trying to explain this with regard to p values and CI's. The only way to deal with this issue would be to explain in detail to people the difference between frequentist and Bayesian probabilities, which is beyond the scope of this article.
Nonetheless, if 95% of the generated CIs will contain the true result, then 5% will not contain the true result. So, is it not reasonable to say it is likely that the one CI we have contains the true result given that 95% of them do BUT as we said we will be wrong with that statement with 5% of the CIs we see?
In essence, saying that 95% CI's will contain the true value 95% of the time, is another way of saying you can have reasonable confidence in this range. Frequentist statisticians use the term confidence rather than a degree of belief because single case probabilities do not have meaning to them.
We do state quite clearly elsewhere in our post that “Inferential statistics don’t give us a probability” and that “Knowing this nuance is key to understanding statistics”
A REQUEST
To help our readers understand the difference between a frequentist approach and a Bayesian approach I have a question as to how you think might be the best way to simply interpret the result of a single trial.
THE SCENARIO
1) Let’s say there was only a single placebo-controlled trial of a new medication to see if it would reduce the risk of a heart attack. A single trial is not infrequently all we have in medicine when it comes to answering a clinical question.
2) Let’s assume the trial was well done and the findings were reported as a point estimate of 0.8 with a 95% confidence interval of 0.7-0.9 with the absolute risks of a heart attack in the placebo group being 10% and the risk in the new medication group being 8%.
How would you interpret this scenario? Thanks again for your response and interest.
Thanks very much for the nice reply James. I understand. Yes we should do something in the future about Bayes. I think this goes a long way towards both approaches: https://discourse.datamethods.org/t/language-for-communicating-frequentist-results-about-treatment-effects https://www.fharrell.com/post/bayes-freq-stmts/ https://discourse.datamethods.org/t/bayesian-vs-frequentist-statements-about-treatment-efficacy . The first link has my proposed interpretations of trials like the one you described just now . In terms of confidence intervals, Sander Greenland has published much related work and shows advantages of using the term 'compatibility interval'. For now the important points are (1) the frequentist approach ties to be simple but it does so by not giving you what you want and (2) the probability to attach to compatibility intervals is the long-term probability that the PROCESS used to generate the interval covers the true unknown treatment effect. The probability is not attached to a single realization of that process.
We need the actual numbers of patients in each group to do a proper Bayesian analysis. It cannot be backed out with just the info you have given.
"So, is it not reasonable to say it is likely that the one CI we have contains the true result given that 95% of them do BUT as we said we will be wrong with that statement with 5% of the CIs we see?"
No, this is an illegal inference by frequentist rules. Frequentism gives no metric for what is "reasonable". Frequentism only and ever gives P(E|H): the probability that the evidence E would be obtained in the experiment if the hypothesis H is true.
But no one cares about this. What we care about is P(H|E), the probability that the hypothesis H is true given the evidence E that has been obtained. But, according to frequentism, P(H|E) does not exist, as a matter of principle.
This is an insurmountable barrier for frequentist statistics. But since the barrier must nevertheless be surmounted, all sorts of dodges are made (all illegal by frequentism's own rules), and this is why the subject is so damn hard to understand. Because, at a fundamental level, it does not make sense.
This would be of mere academic curiosity if lives were not at stake.
Interesting stuff Mark.
1) I imagine you are "correct" if you strictly follow the "rules" etc - however, the entire purpose of our post was to try to give people who get exposed to results and statistics presented in a frequentist way on a regular basis a way to more "correctly" interpret them. If we get rid of the words reasonable and likely - do you at least agree that "95% of the confidence intervals will include the true result"?
2) The example I gave was simply theoretical. Would you be willing to use the following real-life example and explain what you say about the results in a way that clinicians might be able to use? The numbers come from the EMPA-REG trial https://pubmed.ncbi.nlm.nih.gov/26378978/. The abstract states "The primary outcome occurred in 490 of 4687 patients (10.5%) in the pooled empagliflozin group and in 282 of 2333 patients (12.1%) in the placebo group (hazard ratio in the empagliflozin group, 0.86; 95.02% confidence interval, 0.74 to 0.99; P=0.04" . Really look forward to hearing what you have to say. Thanks.
Here is a plot of the posterior distribution for the rate of the primary outcome (which I understand is BAD) in the drug group (blue) and the placebo group (orange), assuming a flat prior (choice of prior makes very little difference because there is a lot of data):
https://i.postimg.cc/LXhDnkQm/trial.jpg
Now you could do some more fancy math and compute confidence intervals ("credible intervals" in Bayesian lingo) or whatever, but I think it's enough to just look at the picture. I sure would want the drug, wouldn't you?
But if a number is needed, I think the most relevant one is the probability that the true value of the primary outcome rate is lower with the drug than with the placebo. That probability is 0.978. Very convincing, IMO.
Of course the drug may have other possible bad consequences (including cost) that would complicate the decision, but that's not part of the trial as I understand it.
The formula for each curve is simple:
P(x) = C x^np (1-x)^(n-np)
where n is the total number in the group (blue 4687, orange 2333) and np is the number with the primary outcome (blue 490, orange 282). The constant C is chosen so that the total probability is one; C =(n+1)! / (np! (n-np)!) where the exclamation point denotes the factorial function.
Note that, if it's not already obvious, I am NOT a medical person: my expertise is in a different hard science. I can barely get through medical jargon at all.
Thanks so much for doing this.
Just so I understand are you saying that the probability that the drug has an effect is 97.8%. If so that is great - but I think I already pretty much know that by using a frequentists approach, because we have ruled out chance - the p value is <0.05. However what I really need to know is the benefit large enough to take the drug every day for the next three years. Here is what I would do by looking at the confidence interval. I believe the relative benefit is somewhere between a 26% relative benefit (0.74) and a 1% relative benefit (0.99) and the observed relative benefit was 14% (0.86). So the absolute benefit seen in this trial was 12.1% minus 10.5% = 1.6% - so a 1.6% benefit and therefore 98.4% get no benefit - or approximately 60 people need to take this drug for three years for 1 to benefit. However, because we don't know the true effect all I can say is that the effect is likely - sorry I know Bayesians don't really like that word - somewhere as large as a 26% relative benefit or as small as 1%. So the absolute benefit might be as large as ~3% or close to no benefit at all. Then I have to add in that the cost of the medication is about CA $1000 a year and 5-10% of people will get a genital infection because of the drug. Then I have to somehow explain this to a patient using percentages to help them make a shared decision.
So my main question is now, what additional clinically useful information could I get that I could use in the decision making process by using a Bayesian approach instead of a frequentist's approach? And then is it something I could easily do by looking at the results presented in the paper?
Hope my approach and questions make sense.
Thanks again.
Watch out. 98.4% get no benefit has nothing to do what the probabilities being considered here. To interpret things that way you'd need a 6-period randomized crossover study, which allows one to estimate benefit on a patient-by-patient basis. You can't get the fraction benefitting from the probability about a group effect.
"what I really need to know is the benefit large enough to take the drug every day for the next three years."
That requires first quantifying the downside in some way that allows it to be meaningfully compared to the upside, eg, by assigning a dollar value to every potential outcome, good or bad. I don't see how your "relative benefit" and "absolute benefit" numbers are meaningful without that sort of quantification first.
As for the more basic point, is frequentist p<0.05 a good criterion? In high data situations, as we have here, yes, it will mimic the (fundamentally more sound) Bayesian posterior probability of there being an effect well enough not to matter. In situations with less data, I would not trust this to be the case.
Following this thread with interest. As Mark described:
P(E|H) = "the probability that the evidence E would be obtained in the experiment if the hypothesis H is true."
P(H|E) = "the probability that the hypothesis H is true given the evidence E that has been obtained."
The latter term being incalculable. The prior term bringing to mind the term "compatibility interval":
https://www.bmj.com/content/366/bmj.l5381
As in "the evidence is compatible with the hypothesis".
E can be compatible with multiple hypotheses simultaneously, including hypotheses that would explain some or all of the result by bias/confounding, reverse causation, fraud, or randomness.
Thus is my pleb understanding.
Hi Dr. Harrell. I know too little about Bayes (and stats in general) to ask this effectively.
I use Bayes implicitly when ordering any diagnostic test, as I must have a pre-test likelihood of disease in order for any test result to inform my post-test likelihood, and hopefully affect my downstream management decisions. But my pre-test seems entirely subjective, informed by formal teaching (“textbooks”) as well as clinical experience. Another clinician may have a different prior for the exact same patient. How does one deal with such differences in prior probabilities?
And for trials of therapeutics (esp “new” agents or first in class therapies), how does one even arrive at an informed prior probability? Thanks.
There are many good answers to that question, which I've dealt with at https://hbiostat.org/bayes/bet . Briefly, we always know something and classical statistics does not even make use of such minimal knowledge, e.g., that a treatment is incremental rather than curative. An incremental therapy may entail use a prior for an odds ratio, for example, such that the probability that the odds ratio >4 or < 1/4 is 0.05. In some cases we have actual trustworthy data on which to base a prior. In a majority of cases a reasonable sample size makes the prior very less relevant. Having a prior is the price of being able to make probability statements about the unknown of true interest. Just as with medical diagnosis.