A Major Win for Open Science
In five short chapters, the Study of Week shows you a shining example of how open science could improve knowledge. It's a beautiful story of humility and generosity.
Chapter 1:
In August, Sensible Medicine described a unique study from the group of Brian Nosek at the University of Virginia. They showed that choices made in how data was analyzed could greatly affect the results.
It was a provocative study because Nosek’s team had brought together 29 teams of professional statisticians to analyze one data set to answer one question. The teams chose 29 different ways to analyze the data. This led to two-thirds of the groups finding a statistically significant result and one-third of the group finding no significant difference.
What struck me as a reader of medical studies is that no study uses 29 different analytic methods. They use one.
So, now, when reading a study with a result that is pretty close, I think: what if they had analyzed the data in different ways? Would the result hold up?
Nosek’s study never really got much traction. Perhaps because it was answering a social science question using soccer statistics not a hard science or medical question.
Chapter 2:
This week, I will show you a shining example of how open science and different analytic methods clarified a potentially major finding in cardiac surgery.
Here is the back story. There is a debate in cardiac surgery as to what are the best conduits to use for the bypass. All surgeons agree that the left internal mammary artery (now called left internal thoracic artery or LITA) is the absolute best conduit.
But nearly all bypass surgeries require more than one conduit. Surgeons can use a radial artery (from the arm), saphenous vein (from the leg) or the right internal thoracic artery (RITA) for the additional conduits.
It turns out that there have been many studies comparing these conduits. This summer, a group led by surgeon Mario Gaudino, from Weill Cornell Medicine in New York, did what’s called a meta-analysis, wherein they combined these studies hoping to learn which conduit was best.
The Gaudino study wasn’t a typical meta-analysis. That’s because the included trials had many different combinations. So, what they did was to take the 10k plus patients in the trials and use a statistical method called propensity matching to make triplet groups.
They found similar patients in each of three groups—RA, SVG and RITA. Their method for finding these matches—with propensity matching—was complicated. I don’t pretend to understand it. But it turns out that it may have been quite important.
Their major findings were shown in this figure. The group that used the radial artery had a nearly 40% lower death rate than the other groups.
Wow. That was a major finding.
Before this study, surgeons had a huge debate about which conduit was best. Previous studies really did not show a clear winner.
Now they had this amazingly strong finding, published in a big journal, from a respected group of authors.
Gaudino and his co-authors wrote candidly about the limitations of their study, mainly, that it was not a randomized comparison, but still, a 40% lower death rate could not be ignored.
Chapter 3:
This part of the story begins with appropriate questions regarding the massive effect size. It just didn’t seem plausible.
Here comes the beautiful part.
The editors asked Prof Gaudino if he would share the dataset and code.
He said yes.
That generosity allowed another statistics group, led by Prof Nick Freemantle, from the University College London, to re-analyze the same dataset.
The first thing Freemantle’s team did was to analyze the data in the same way that Gaudino’s group did. Here they found that the results were identical. This exercise showed there were no errors in the analysis.
But now, as it was in the original Nosek study of 29 different teams, Freemantle’s team made different analytic choices from Gaudino’s team. They felt that there were more traditional ways to do propensity matching. Again, I can’t say which is better.
Freemantle’s analytic method yielded no significant differences in outcomes with the different conduits. No differences. None.
Again, it’s the same dataset. Just different choices in analyzing it.
Now you may be wondering…I don’t know a damn thing about statistics or bypass surgery, so who do we believe?
I will get to that in the final chapter.
Chapter 4:
The lack of difference in outcomes better fits the prevailing thinking amongst surgeons. In fact, another academic heart surgeon, David Taggert, along with colleagues, published an editorial on the original Gaudino paper and argued that a 40% reduction from a choice of conduit was implausibly large.
Taggert did not mention Freemantle’s group re-analysis. Perhaps he wrote it before Freemantle’s team had finished.
Taggert’s team had a super-important way to explain the implausibly large mortality reduction. It’s something all readers should plug into their memory banks.
He noticed that the reduction in major cardiac adverse outcomes (we abbreviate MACE) was less than the overall mortality reduction. Think about that. If a cardiac surgical technique lowers death rate, it must do so by reducing cardiac outcomes.
In any study, when the reduction in overall mortality is greater than the reduction in the specific outcomes of the treatment, this suggests bias—specifically, healthier patients got one type of intervention.
Chapter 5:
This is one of the year’s most important stories.
Had Prof Gaudino not shared his data set, and had it not been re-analyzed with different choices, surgeons may have moved toward more radial artery usage.
Based on the re-analysis, and the Taggert-led editorial, there looks to be great uncertainty regarding the choice of conduit.
So the answer to this important question should be obvious:
You do a randomized controlled trial wherein patients who require bypass are randomized to three groups—one with each conduit.
Randomization, not surgeons, choose the conduit. This balances the known (and unknown) patient characteristics, and in some years, there will be answer.
Epilogue:
Imagine a scientific world where authors of major medical studies were as generous as Professor Gaudino and his team.
Many, if not most, of the practice-changing studies in medicine are never re-analyzed. Or, if they are, it’s very late, after therapeutic fashions have been established.
I hope I live long enough to see a new approach to judging medical evidence. One in which any important findings are verified independently and transparently.
Regarding Brian Nosek's study: an MD is a limited expert, limited to one field. If a clinician does medical research, the clinical practitioner needs to rely on the experts in data science in order to do credible science. It's a rare clinician who is also a great statistician, since statistics and data analysis seem to be the worst taught subject in medical school. Nosek's study confirms what many of us have long suspected, that the statistician expert himself may not be enough. His proposal to be more transparent by publishing all the possible conflicting statistical approaches (with their associated diverging outcomes) suggests that data science experts may need a referee as well. Otherwise, an already skeptical lay public will become frankly cynical about science in general.. Data science, by seeming to support any conclusion on the same data set, will cause science to lose credibility.
My impression is that the field of statistics is just a child, whereas probability is the parent. Talk to any mathematician expert specializing in the theory of probability, and you will likely find out that many statisticians don't really understand probability on a rigorous, deep, mathematical and philosophical level. It's more than just knowing to use non-parametric methods when the underlying data distribution is non-normal.
Nosek's study did include adjusting for level of expertise in statistics, and didn't find much difference with his overall conclusion. I wonder if it would be possible to adjust for level of expertise in probability. For instance, compare the data experts who came to statistics after learning probability, versus those who learned only enough probability to do statistics.
If such a study could be done, let's just hope that the probabilists don't disagree among themselves!
After all, probability experts are also human, and may show th3 same variable opinions on the same data set as do the statisticians. In that case, us simple-minded clinicians will have to clear two credibility hurdles: with our limited understanding of statistics, can we trust the statistical experts? And how do we choose among the probability experts when we understand even less about probability?
Something similar has happened with this study that showed a correlation between Vitamin D levels and all-cause mortality based on a Mendelian randomization analysis, which is based on so many assumptions that when researches say it is almost as good as an RCT, you wonder. Lancet Diabetes Endocrinol 2021; 9: 837–46 was the original article. This week saw two letters arguing that the statistics and assumptions used were flawed and the authors have accepted this and finally said that their study does not support a causal relationship between vitamin D and outcomes. This is basically a retraction of the results, which were initially touted in the media...not sure if the journal has subsequently published an editorial on this whole mess.
I don't think they shared the data but based on what was published, the authors of the rebuttal letters posted their comments. Also, this is the UK biobank data, so I assume the data is public...not sure.