Well, the question matters too. And the hidden assumptions. Are the dark skinned player more likely to actually commit a foul because of playing style. Are there particular players that are much more likely to commit fouls and they happened to be dark-skinned. The implication is that the fouls are being awarded in a prejudiced manner. But, perhaps not. I could run a hypothetical study on violent crime arrest rates by sex. I’d find that men are much more likely to be arrested for violent crimes. Do I then make the argument that we need to arrest more women for violent crimes to achieve equity? Or that we should release men who are arrested for violent crime to achieve equity? Or do I recognize that men are arrested more often for violent crimes because they are more likely to commit violent crimes for a variety of reasons (physical strength, opportunity, societal expectation, testosterone, etc.,etc.)?
Balancing type 1 and type 2 errors in the quest for optimal significance testing seems important. Subject to the caveat that both these error types are important to a specific investigation.
These last 2.5 years have highlighted the most important need to find people who are smarter than you, more experienced than you, and whose judgment you implicitly trust to help sort through the sand that is modern Medicine. I am grateful to have discovered the sensible medicine group
In God we Trust. Everyone else bring data. And, expect that data to get the ‘wire brush treatment’. Check and cross-check. Are the instruments calibrated? When? What do the people closest to the situation think about it? What have they seen? Are there confounders in the situation? Does this line up with our theoretical understanding or contradict it? With prior experience or contradict it? Can we replicate the results? When errors are found are they admitted and corrected? Or covered up?
As I recall from long ago, we try to estimate some effect in a population from which we extract samples. We often assume that the population has some underlying distribution - normal mostly, but there are others as well. In the process of analysis the various techniques rely on assumptions about that population. Most of us simply take for granted that the assumptions are correct. In the red card case perhaps not enough samples?
To those who want the truth, things like the integrity of the researcher, motives and/or bias (realized or unknown), outside influences, preconceived notions, and well designed clarity of process are just a few of the things that need to be considered. To assume the results of a study are 100% correct and then present the findings to large populations causes untold harm. A list could be made of today’s commonly held “truths.”
While it's true (and comforting) that the effect sizes are all in the same ballpark of 1.2, the practical differences between them being statistically significant or not are huge: in one case, the referees in question may well be suspended (or worse) for racial bias, in the other no action will be taken. When Vinay reviews a paper online and the authors say something like: we observed a 2.5% decrease in mortality, though the result was not statistically significant, he always says: "you do not get to say that! your results are not significant - meaning they may be due to random noise!". Unfortunately, significance is far more fungible then the effect size (it is a second-order effect, while effect size is a first-order effect) - different, often very subtle, assumptions on the error term in the model will lead to different standard error estimates, and thus different significance levels. These assumptions are very hard for a non-statistician to track and evaluate. Thus I agree with your original interpretation of the paper - the results are very troubling. This is another, and subtler, form of p-hacking: in the "classic" form we examine different effects until we find one that is significant - here the significance measure (e.g., a t-test) stays the same, and the effect changes. An alternative is to keep the same effect (as in the cited paper), but keep adjusting modeling assumptions until you reach significance. Far harder to catch.
As someone with a graduate degree in one of the applied sciences (exercise science) I can tell you it is hopeless for the average American to be able to interpret scientific studies. Heck, most people with undergrads in an applied science can't do it, never mind people with zero training in it. This is essentially why the CDC & Biden admin was able to dupe most Americans into the idea that masks were effective and everyone needed to be vaccinated. The truth is murkier and requires the ability to analyze the studies rather than just take the CDCs word for it. The answer to almost every question in the applied science is almost always "it depends" and rarely a sweeping "yes" or "no".
Not sure what to make of this. What is the 21% figure supposed to show or represent? Does it account for player and/or national traits (ie style of play, tactics etc.)?
This article underscores that we should make teaching statistics a primary goal of our educational system. If our schools spent more time teaching statistics (and to a broader swath of the school population) we would have better educated patients - and a better health system. The mathematical options in high school seem to focus on geometry, algebra, and calculus - all extremely important for many people who are going to go on to become scientists or engineers. But everyone needs to understand statistics (and the statistics of prior probabilities (Bayes) in particular).
Research and Journal reviews was a significant part of our Residency Program. Given what John Mandrola presented here showing the large variation of odds ratios depending on the Analytic approach, supports a critical view and question we developed for reading peer reviewed articles.
'It may be statistically significant but is it clinically significant and relevant?"
In practice, doctors don't have enough time to examine the data. It is an effort that is only necessary when the conclusions are contradictory or questionable.
Well, the question matters too. And the hidden assumptions. Are the dark skinned player more likely to actually commit a foul because of playing style. Are there particular players that are much more likely to commit fouls and they happened to be dark-skinned. The implication is that the fouls are being awarded in a prejudiced manner. But, perhaps not. I could run a hypothetical study on violent crime arrest rates by sex. I’d find that men are much more likely to be arrested for violent crimes. Do I then make the argument that we need to arrest more women for violent crimes to achieve equity? Or that we should release men who are arrested for violent crime to achieve equity? Or do I recognize that men are arrested more often for violent crimes because they are more likely to commit violent crimes for a variety of reasons (physical strength, opportunity, societal expectation, testosterone, etc.,etc.)?
Thank you, this is great. I love it when different points of view, or interpretation of data or facts, are analyzed on substack.
Regarding statistical thresholds... https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0032734 Joe Mudge, et al. "Setting an Optimal alpha That Minimizes Errors in Null Hypothesis Significance Tests."
Balancing type 1 and type 2 errors in the quest for optimal significance testing seems important. Subject to the caveat that both these error types are important to a specific investigation.
Aren't you supposed to decide on which method of analysis you'll use BEFORE starting a study and REGISTER it beforehand? Precisely to avoid this?
Very good exercise! The studies say that statistics count and how they are interpreted matter.
However, IMO, the starting point is being avoided.
Were all teams from the UK?
Most of the judges I see on TV are not White.
What is the race of the referees? And spectators?
Is it Black or White, Hispanic, etc., referees that bias their referring?
Do Asian, Mediterranean, African, Central and South America et al., teams have the same findings?
What are the percent of White and other players and judges of these teams? Any connections?
In hockey there is an appointed tough guy whose job is to rough up opposing players. And might be penalized above average.
Does soccer have the same situation? 1 or 2 players on each team who play rough and known to be penalized.
Are these players White or Black, etc., and what is the racial makeup of the spectators, refs, for the team and their opponents?
Now analyze the 29 teams.
These last 2.5 years have highlighted the most important need to find people who are smarter than you, more experienced than you, and whose judgment you implicitly trust to help sort through the sand that is modern Medicine. I am grateful to have discovered the sensible medicine group
In God we Trust. Everyone else bring data. And, expect that data to get the ‘wire brush treatment’. Check and cross-check. Are the instruments calibrated? When? What do the people closest to the situation think about it? What have they seen? Are there confounders in the situation? Does this line up with our theoretical understanding or contradict it? With prior experience or contradict it? Can we replicate the results? When errors are found are they admitted and corrected? Or covered up?
As I recall from long ago, we try to estimate some effect in a population from which we extract samples. We often assume that the population has some underlying distribution - normal mostly, but there are others as well. In the process of analysis the various techniques rely on assumptions about that population. Most of us simply take for granted that the assumptions are correct. In the red card case perhaps not enough samples?
To those who want the truth, things like the integrity of the researcher, motives and/or bias (realized or unknown), outside influences, preconceived notions, and well designed clarity of process are just a few of the things that need to be considered. To assume the results of a study are 100% correct and then present the findings to large populations causes untold harm. A list could be made of today’s commonly held “truths.”
While it's true (and comforting) that the effect sizes are all in the same ballpark of 1.2, the practical differences between them being statistically significant or not are huge: in one case, the referees in question may well be suspended (or worse) for racial bias, in the other no action will be taken. When Vinay reviews a paper online and the authors say something like: we observed a 2.5% decrease in mortality, though the result was not statistically significant, he always says: "you do not get to say that! your results are not significant - meaning they may be due to random noise!". Unfortunately, significance is far more fungible then the effect size (it is a second-order effect, while effect size is a first-order effect) - different, often very subtle, assumptions on the error term in the model will lead to different standard error estimates, and thus different significance levels. These assumptions are very hard for a non-statistician to track and evaluate. Thus I agree with your original interpretation of the paper - the results are very troubling. This is another, and subtler, form of p-hacking: in the "classic" form we examine different effects until we find one that is significant - here the significance measure (e.g., a t-test) stays the same, and the effect changes. An alternative is to keep the same effect (as in the cited paper), but keep adjusting modeling assumptions until you reach significance. Far harder to catch.
As someone with a graduate degree in one of the applied sciences (exercise science) I can tell you it is hopeless for the average American to be able to interpret scientific studies. Heck, most people with undergrads in an applied science can't do it, never mind people with zero training in it. This is essentially why the CDC & Biden admin was able to dupe most Americans into the idea that masks were effective and everyone needed to be vaccinated. The truth is murkier and requires the ability to analyze the studies rather than just take the CDCs word for it. The answer to almost every question in the applied science is almost always "it depends" and rarely a sweeping "yes" or "no".
Not sure what to make of this. What is the 21% figure supposed to show or represent? Does it account for player and/or national traits (ie style of play, tactics etc.)?
This article underscores that we should make teaching statistics a primary goal of our educational system. If our schools spent more time teaching statistics (and to a broader swath of the school population) we would have better educated patients - and a better health system. The mathematical options in high school seem to focus on geometry, algebra, and calculus - all extremely important for many people who are going to go on to become scientists or engineers. But everyone needs to understand statistics (and the statistics of prior probabilities (Bayes) in particular).
Research and Journal reviews was a significant part of our Residency Program. Given what John Mandrola presented here showing the large variation of odds ratios depending on the Analytic approach, supports a critical view and question we developed for reading peer reviewed articles.
'It may be statistically significant but is it clinically significant and relevant?"
Excellent breakdown of the challenges associated with relying on statistical analysis without considering alternative interpretations.
Very interesting. Thanks for this.
In practice, doctors don't have enough time to examine the data. It is an effort that is only necessary when the conclusions are contradictory or questionable.