Gosh was I lucky to speak with Professor Erik Van Zwet from Leiden University in the Netherlands. He is the first author on a recent NEJM Evidence paper looking at more than 23,000 trials in the Cochrane Database. (I linked to an URL that should get by the paywall.)

Im sure this is in the paper and I’m just not seeing it, but how did you choose what value the “true effect” is in each case? Is the “true effect” the meta-analytic estimate produced by Cochrane for a given effect of interest (e.g. mean difference in systolic blood pressure between ACE levels inhibitor and placebo)? Is it the smallest difference each study aimed to detect, as specified in the study’s power calculation? Is it the difference in means that was observed (regardless of the associated p value)? Thanks!

Final p-value focused query. Question is at the very end.

In the discussion, there seemed to be some caution but also some vacillation in the treatment of p-values. Sometimes they're referenced almost as if they tie into real world meaning, while in other places the discussion seemed well-hedged on that front.

Part of the challenge in coming originally from another area is that I just don't remember either polling or correlational studies being used to make the kind of statistical causal assertions that appear common in medicine. I'm struggling some to understand how of why such strong claims are made in medicine.

Correct me if I'm wrong, but my understanding has always been that p values only "work" for a normally distributed population. But in most cases, we don't actually know what that underlying distribution is. In polling -- like it most statistics -- the trick to get around that is the Central Limit Theorem.

As I understand the theorem, essentially if we transform the focus from the population directly to an abstracted numerical property from the population like the mean, repeated sample distributions will generally form a normal distribution around the true mean of the underlying population. But even then, that only holds under certain conditions (like independence and identical distribution of random variables).

It seems that most of the time medicine just boostraps the necessary assumptions in without either establishing that they exist or explicitly recognizing them. That seems problematic.

To put some meat on the skeleton, imagine a universe of 20000 Tribbles that can potentially receive a drug to treat measles. They're all that exist. And that as a real world matter, 50% will respond to the drug and survive, while 50% will not and will die. That underlying distribution isn't normal.

And you could think of alternative worlds where 50% have immune systems that can fight off measles unaided and 50% can't. In one world, the 50% is the same as the drug responders. In another, they're independent, leading to a different distribution between drug-takers and those who don't.)

Anyhow, in this 50% responders/50% non- world, think of a pen with all 2000, and we randomly draw from what's still in the pool.

If we're sampling 20 at a time, in one sample, for our 50% total responder world, 9 might die and 11 survive. Another 10/10, another 8/12, then 12/8, etc. As I understand it, by the time we get to 1000 draws, the results for each individual draw should map out to approximately a normal distribution clustered around 50% surviving and dying.

If the central limit theorem is correctly triggered, the p-value and z-statistics are appropriate statistically and do have a real world tie, but only as it relates to the mean value.

To get there, however, the ability to inferentially link to the actual underlying population distributions has been broken. (The underlying bi-modal distribution in which no one really has a 50% chance has been transformed into an abstracted normal distribution of means with a peak at 50%).

While there's potential value in knowing the mean values above, there's even more value in knowing the 50% who will respond and what, if anything, ties them togehter vs. those who don't.

I can imagine a world where one drug successfully treats 50% of people with a condition and another drug treats successfully treats the other 50%. Neither is "better" than the other comparatively, but together they end the disease altogether as long as targeted right. But if the kind of analysis necessary to find that out is avoided due to efforts to ensure "proper" application of normal distribution measures, much could be lost.

All of which is a long lead in to a pretty short question:

Shouldn't any reliance on p-values or other normal distribution-focused tools in medical papers be either established as empirically appropriate, a rationally justified assumption or heavily circumscribed with caveats?

In your paper on the statistical properties of RCTs, the histogram looks like there's a left skew n the z-statistics, but then that observed data is used to help create a normal distribution that shaves off the left lean. With that normalized curve used in the latest paper.

I'm curious as to what work that does.

Is it limited to making it possible to use normal distribution equations or does it do more?

Also, is the actual histogram sufficiently far-removed from normal distribution to render something like the Central Limit Theorem inapplicable?

I apologize, but all of this ties into a set of questions I have re: p-values generally from the discussion, which seems like it differs in important ways from how they are referenced, used, etc. in the papers.

Along those lines, table 2 looks like a descriptive use of statistics, but it looks like the paper used the Monte Carlo-generated 10*6 (!!) set to transform into a predictive model for figures 1 & 2 as well as table 2.

I hate to betray my ignorance but given the pretty sizable base set of over 23,000 studies in Cochrane, what is gained by use of the larger set? Or conversely what would be lost if limited to the actual data?

(Most of my statistics exposure was via poli. sci. where - esp. in polling -- there aren't very many data points (e.g. polls run), so running multiple scenarios makes a lot of sense to try and trigger or estimate the impact of the law of very large numbers.)

Also, I'd love to know if I'm totalling misunderstanding any of this.

I've got to break up my questions into a few different parts.

First, without getting into the weeds re: if and when use of p values and z statistics is appropriate -- or even if they were used appropriately by researchers -- would it be safe (or safer) to say either that your latest paper or the previous one with the z-statistic histogram for Cochrane assume that the z statistics are properly used or that simply as a descriptive matter, they ARE used?

Second, it seems as if both papers -- particularly the most recent -- are concerned with methodological prediction -- i.e. what repeated testing is or would be likely to show -- as opposed to, for lack of a better word, real-world prediction. Is that correct?

Thank you for the thorough and honest discussion of the limitations of statistical significance. It reminded me of why I decided to change my major after taking Statistics 101 in college.

I have an investment background where "risk" expects a "return" - and active managers are often a key success factor assessed via an "Information Ratio" of excess retur divided by standard deviation of excess return.

The signal to noise ratio is one that could be a important metric that should be included in assessing investment manager quality.

Investment products carry the "health warning" that past performance is OT a predictr of future performance.

I think there are a number of "cross-overs" between statistical analysis of investment and medical analysis.

Risk adjusted benchmark relative performance measurement and attribution is one. Skewness and kurtosis of probability curves is another, then there's Value at risk, option volatility surfaces but as importantly is the stability of risk/return measures over different time periods.

Returns in the investment context could be beneficial outcomes in the medical context, VaR - Value at risk could be redefined as Lives at risk with granularity dow to any chosen demographic.

I found myself wondering what the distribution curves would look like if broken down by te different phases of clinical trials/studies and what the tendencies were over the (decades?) of the 25,000 odd RCT's were.

Maybe look at 4 year "time packets".

Maybe it will reveal an increasing drop in quality of RCT's and highlight systemic issues with their construction with even more clarity than the under-powering of 80% of RCT's and the one in five to one in ten shot of results being down to luck!

Anyway, great interview ad thanks again for conducting it!

Lastly, measures such as number needed to vaccinate/treat to produce one targeted clinical outcome would certainly help me understand risks - as would metrics like life years gained lost and quality life years lost with regular comparisons against clinical trial estimated outcomes with actual outcomes over some sensible period.

Thanks for commenting! Unfortunately, the CDSR doesn't have information about the phases of trials. However, I'm working with a group of oncologists who have collected a large sets of phase III trial in oncology, see our preprint here:

It's clear that these phase III trials have much higher signal-to-noise ratio (or power, if you prefer) than the average trial, and that has implications for the interpretation of the results.

Really good podcast. Erik's "you got to have a rule" is not consistent with optimal decision making (e.g., maximizing expected utility) when the rule involves a fixed p-value cutoff such as 0.05. There are excellent papers where the Bayes optimal decision, with utility = disease burden, is then reverse engineered to find the optimal value of alpha. For example in a rare fatal disease the optimum alpha may be 0.25. In a common chronic cardiovascular disease it may be 0.01. See https://www.sciencedirect.com/science/article/abs/pii/S0304407618302380?via%3Dihub

Also, for any discussion of p-values I hope to hear at least brief mention that another paradigm exists---one that involves direct evidence and feeds directly into optimum decision making. The classical paradigm has to put too much emphasis on replication of results. Replication is a great thing whether you're frequentist or Bayesian, but the Bayesian paradigm works better when faced with a decision about "this trial". It also better supports evidence for the worthiness of doing a second, confirmatory trial.

I find discussions about effects in standard deviation or standard error units become very non-clinical. In the growth curve example I would like to know the quantile of a child's height given age, not how the height compares to some fixed quantile of normal kids. Even better would be to know the probability that the child's height in 5 years will be more than 2 inches shorter than an "ideal" height. In assessing clinical effectiveness I want to know, for example, the probability the treatment lowers systolic blood pressure by more than 5mmHg. The 2 and the 5 are what I would label as "very clinical".

Regarding back-engineering confidence intervals to get standard errors there is an assumption that the sampling distribution of the statistical estimator is symmetric on some scale. This is a very nitpicky point but in reality many symmetric confidence intervals that are used in biomedical research do not have exceptionally accurate confidence coverage for both limits, and an asymmetric interval should have been used. CIs for log odds ratios is one of the settings where our standard confidence interval formula is not always very good.

The discussion about insufficient power of many published RCTs is really good. Many of the trials should never have begun. Most of those that resulted in equivocal results or actual evidence of harm, a Bayesian sequential design would have stopped very early for inefficacy. How do we continue to justify completing over-optimistically-designed trials? More at https://hbiostat.org/bayes/bet/design which also points out that sequential designs are more logical because the effect size used in power calculation is just too arbitrary.

The idea that small trials are OK if they are all published and end up in a meta-analysis doesn't address the ethics of patients agreeing to undergo experimental therapies when there is little hope that they will contribute to generalizable knowledge from "this trial".

Erik discussed a trial "getting a haircut" to shrink the efficacy estimate when for example p=0.04. This makes a lot of sense. I don't know how to do that in a rigorous way without using a Bayesian analysis. The "haircut" is automatically handled by the prior distribution for the treatment effect, so we should spend more of our time in specifying prior distributions (most of which should be skeptical about large treatment benefits or harms).

I really enjoyed the discussion, and learning more about Erik's views on clinical trial design and interpretation.

Thanks for commenting! I agree completely that deciding if a treatment "works" or not based on a single p-value being larger or smaller than 0.05 is a disaster. Of course, at some point a decision has to be made about a treatment, but that should be much later after many other other factors have been considered (costs, utilities, side effects, etc).

I didn't mention Bayes during the interview because we tried hard to make our paper "ecumenical". We argue neither for or against the p-value; we just try to provide additional meaning in the context of clinical trials. From a Bayesian point of view, we're using an empirical prior (which handles the "haircut") while from a frequentist point of view we're just estimating certain conditional expectations.

Im sure this is in the paper and I’m just not seeing it, but how did you choose what value the “true effect” is in each case? Is the “true effect” the meta-analytic estimate produced by Cochrane for a given effect of interest (e.g. mean difference in systolic blood pressure between ACE levels inhibitor and placebo)? Is it the smallest difference each study aimed to detect, as specified in the study’s power calculation? Is it the difference in means that was observed (regardless of the associated p value)? Thanks!

Final p-value focused query. Question is at the very end.

In the discussion, there seemed to be some caution but also some vacillation in the treatment of p-values. Sometimes they're referenced almost as if they tie into real world meaning, while in other places the discussion seemed well-hedged on that front.

Part of the challenge in coming originally from another area is that I just don't remember either polling or correlational studies being used to make the kind of statistical causal assertions that appear common in medicine. I'm struggling some to understand how of why such strong claims are made in medicine.

Correct me if I'm wrong, but my understanding has always been that p values only "work" for a normally distributed population. But in most cases, we don't actually know what that underlying distribution is. In polling -- like it most statistics -- the trick to get around that is the Central Limit Theorem.

As I understand the theorem, essentially if we transform the focus from the population directly to an abstracted numerical property from the population like the mean, repeated sample distributions will generally form a normal distribution around the true mean of the underlying population. But even then, that only holds under certain conditions (like independence and identical distribution of random variables).

It seems that most of the time medicine just boostraps the necessary assumptions in without either establishing that they exist or explicitly recognizing them. That seems problematic.

To put some meat on the skeleton, imagine a universe of 20000 Tribbles that can potentially receive a drug to treat measles. They're all that exist. And that as a real world matter, 50% will respond to the drug and survive, while 50% will not and will die. That underlying distribution isn't normal.

And you could think of alternative worlds where 50% have immune systems that can fight off measles unaided and 50% can't. In one world, the 50% is the same as the drug responders. In another, they're independent, leading to a different distribution between drug-takers and those who don't.)

Anyhow, in this 50% responders/50% non- world, think of a pen with all 2000, and we randomly draw from what's still in the pool.

If we're sampling 20 at a time, in one sample, for our 50% total responder world, 9 might die and 11 survive. Another 10/10, another 8/12, then 12/8, etc. As I understand it, by the time we get to 1000 draws, the results for each individual draw should map out to approximately a normal distribution clustered around 50% surviving and dying.

If the central limit theorem is correctly triggered, the p-value and z-statistics are appropriate statistically and do have a real world tie, but only as it relates to the mean value.

To get there, however, the ability to inferentially link to the actual underlying population distributions has been broken. (The underlying bi-modal distribution in which no one really has a 50% chance has been transformed into an abstracted normal distribution of means with a peak at 50%).

While there's potential value in knowing the mean values above, there's even more value in knowing the 50% who will respond and what, if anything, ties them togehter vs. those who don't.

I can imagine a world where one drug successfully treats 50% of people with a condition and another drug treats successfully treats the other 50%. Neither is "better" than the other comparatively, but together they end the disease altogether as long as targeted right. But if the kind of analysis necessary to find that out is avoided due to efforts to ensure "proper" application of normal distribution measures, much could be lost.

All of which is a long lead in to a pretty short question:

Shouldn't any reliance on p-values or other normal distribution-focused tools in medical papers be either established as empirically appropriate, a rationally justified assumption or heavily circumscribed with caveats?

In your paper on the statistical properties of RCTs, the histogram looks like there's a left skew n the z-statistics, but then that observed data is used to help create a normal distribution that shaves off the left lean. With that normalized curve used in the latest paper.

I'm curious as to what work that does.

Is it limited to making it possible to use normal distribution equations or does it do more?

Also, is the actual histogram sufficiently far-removed from normal distribution to render something like the Central Limit Theorem inapplicable?

I apologize, but all of this ties into a set of questions I have re: p-values generally from the discussion, which seems like it differs in important ways from how they are referenced, used, etc. in the papers.

Along those lines, table 2 looks like a descriptive use of statistics, but it looks like the paper used the Monte Carlo-generated 10*6 (!!) set to transform into a predictive model for figures 1 & 2 as well as table 2.

I hate to betray my ignorance but given the pretty sizable base set of over 23,000 studies in Cochrane, what is gained by use of the larger set? Or conversely what would be lost if limited to the actual data?

(Most of my statistics exposure was via poli. sci. where - esp. in polling -- there aren't very many data points (e.g. polls run), so running multiple scenarios makes a lot of sense to try and trigger or estimate the impact of the law of very large numbers.)

Also, I'd love to know if I'm totalling misunderstanding any of this.

Goedemorgen! And thanks for the informative talk.

I've got to break up my questions into a few different parts.

First, without getting into the weeds re: if and when use of p values and z statistics is appropriate -- or even if they were used appropriately by researchers -- would it be safe (or safer) to say either that your latest paper or the previous one with the z-statistic histogram for Cochrane assume that the z statistics are properly used or that simply as a descriptive matter, they ARE used?

Second, it seems as if both papers -- particularly the most recent -- are concerned with methodological prediction -- i.e. what repeated testing is or would be likely to show -- as opposed to, for lack of a better word, real-world prediction. Is that correct?

Thank you for the thorough and honest discussion of the limitations of statistical significance. It reminded me of why I decided to change my major after taking Statistics 101 in college.

Thank you! The link works

May thanks for this!

I have an investment background where "risk" expects a "return" - and active managers are often a key success factor assessed via an "Information Ratio" of excess retur divided by standard deviation of excess return.

The signal to noise ratio is one that could be a important metric that should be included in assessing investment manager quality.

Investment products carry the "health warning" that past performance is OT a predictr of future performance.

I think there are a number of "cross-overs" between statistical analysis of investment and medical analysis.

Risk adjusted benchmark relative performance measurement and attribution is one. Skewness and kurtosis of probability curves is another, then there's Value at risk, option volatility surfaces but as importantly is the stability of risk/return measures over different time periods.

Returns in the investment context could be beneficial outcomes in the medical context, VaR - Value at risk could be redefined as Lives at risk with granularity dow to any chosen demographic.

I found myself wondering what the distribution curves would look like if broken down by te different phases of clinical trials/studies and what the tendencies were over the (decades?) of the 25,000 odd RCT's were.

Maybe look at 4 year "time packets".

Maybe it will reveal an increasing drop in quality of RCT's and highlight systemic issues with their construction with even more clarity than the under-powering of 80% of RCT's and the one in five to one in ten shot of results being down to luck!

Anyway, great interview ad thanks again for conducting it!

Lastly, measures such as number needed to vaccinate/treat to produce one targeted clinical outcome would certainly help me understand risks - as would metrics like life years gained lost and quality life years lost with regular comparisons against clinical trial estimated outcomes with actual outcomes over some sensible period.

Good stuff!

Thanks for commenting! Unfortunately, the CDSR doesn't have information about the phases of trials. However, I'm working with a group of oncologists who have collected a large sets of phase III trial in oncology, see our preprint here:

https://medrxiv.org/cgi/content/short/2024.04.01.24305158v1

It's clear that these phase III trials have much higher signal-to-noise ratio (or power, if you prefer) than the average trial, and that has implications for the interpretation of the results.

edited Apr 6I will check it out and try and keep up with the great work you do!

Post comment:

"53% of trials claiming superiority ... had a ≥ 90% probability of providing clinically meaningful benefits."

Sounds like a health and money maker for around half of the trials!

Reconciling trials with outcomes is not a simple task that could prove the best way to measure sample/trial error.

Neither would side effects ad their severty (takig a leaf out of the C19 mRNA debate).

...

Really good podcast. Erik's "you got to have a rule" is not consistent with optimal decision making (e.g., maximizing expected utility) when the rule involves a fixed p-value cutoff such as 0.05. There are excellent papers where the Bayes optimal decision, with utility = disease burden, is then reverse engineered to find the optimal value of alpha. For example in a rare fatal disease the optimum alpha may be 0.25. In a common chronic cardiovascular disease it may be 0.01. See https://www.sciencedirect.com/science/article/abs/pii/S0304407618302380?via%3Dihub

Also, for any discussion of p-values I hope to hear at least brief mention that another paradigm exists---one that involves direct evidence and feeds directly into optimum decision making. The classical paradigm has to put too much emphasis on replication of results. Replication is a great thing whether you're frequentist or Bayesian, but the Bayesian paradigm works better when faced with a decision about "this trial". It also better supports evidence for the worthiness of doing a second, confirmatory trial.

I find discussions about effects in standard deviation or standard error units become very non-clinical. In the growth curve example I would like to know the quantile of a child's height given age, not how the height compares to some fixed quantile of normal kids. Even better would be to know the probability that the child's height in 5 years will be more than 2 inches shorter than an "ideal" height. In assessing clinical effectiveness I want to know, for example, the probability the treatment lowers systolic blood pressure by more than 5mmHg. The 2 and the 5 are what I would label as "very clinical".

Regarding back-engineering confidence intervals to get standard errors there is an assumption that the sampling distribution of the statistical estimator is symmetric on some scale. This is a very nitpicky point but in reality many symmetric confidence intervals that are used in biomedical research do not have exceptionally accurate confidence coverage for both limits, and an asymmetric interval should have been used. CIs for log odds ratios is one of the settings where our standard confidence interval formula is not always very good.

The discussion about insufficient power of many published RCTs is really good. Many of the trials should never have begun. Most of those that resulted in equivocal results or actual evidence of harm, a Bayesian sequential design would have stopped very early for inefficacy. How do we continue to justify completing over-optimistically-designed trials? More at https://hbiostat.org/bayes/bet/design which also points out that sequential designs are more logical because the effect size used in power calculation is just too arbitrary.

The idea that small trials are OK if they are all published and end up in a meta-analysis doesn't address the ethics of patients agreeing to undergo experimental therapies when there is little hope that they will contribute to generalizable knowledge from "this trial".

Erik discussed a trial "getting a haircut" to shrink the efficacy estimate when for example p=0.04. This makes a lot of sense. I don't know how to do that in a rigorous way without using a Bayesian analysis. The "haircut" is automatically handled by the prior distribution for the treatment effect, so we should spend more of our time in specifying prior distributions (most of which should be skeptical about large treatment benefits or harms).

I really enjoyed the discussion, and learning more about Erik's views on clinical trial design and interpretation.

Thanks for commenting! I agree completely that deciding if a treatment "works" or not based on a single p-value being larger or smaller than 0.05 is a disaster. Of course, at some point a decision has to be made about a treatment, but that should be much later after many other other factors have been considered (costs, utilities, side effects, etc).

I didn't mention Bayes during the interview because we tried hard to make our paper "ecumenical". We argue neither for or against the p-value; we just try to provide additional meaning in the context of clinical trials. From a Bayesian point of view, we're using an empirical prior (which handles the "haircut") while from a frequentist point of view we're just estimating certain conditional expectations.

Thanks for the comments back Erik! Well said. Best regards.