Diagnostic Reasoning and the Accuracy of Findings in the History and Physical Examination
A Quantitative Analysis
This is the second article that I am putting up on sensible medicine that was originally written for a medical journal but couldn’t find a home.[1] The first was our piece on the termination of the doctor/patient relationship.
The idea behind this article was something that Scott Stern and I noticed as we have worked on our Textbook, Symptom to Diagnosis: An Evidence-Based Guide that positive findings in the history and physical exam are more likely to be important than negative ones. A positive finding might rule in a diagnosis while the absence of a finding almost never rules out a diagnosis. This had become an important part of our teaching. I featured it on my “diagnostic reasoning yellow card” and it shows up in the textbook as “fingerprint” findings.
All that said, we never did the work to aggregate all the numbers until we teamed up with Kevin Kennedy, at the time a medical student at The University of Chicago and now a psychiatry resident at UCLA. Kevin did an amazing job pulling together heaps of data, organizing them, and finding ways to present them. He also, absolutely, refined out thinking on the issue.
We hope this article will be interesting to clinicians, educators and anyone interested in the diagnostic process.
Introduction
Diagnostic reasoning is an essential skill that medical trainees must develop. Diagnostic reasoning is often taught as Bayesian hypothetico-deducting reasoning. A trainee is encouraged to develop a hypothesis regarding a diagnosis. She then determines the likelihood of this diagnosis, the pre-test probability. This probability is then revised based on the presence or absence of findings on the history and physical examination. The impact of each finding depends on its test characteristics. Although experienced clinicians recognize that the diagnostic process is far more complex, this heuristic model is an accepted way to teach diagnostic reasoning.
The goal of this research is to better understand the accuracy and utility of clinical findings in order to appropriately train clinicians on how to use these findings. In this study, we examined the value of common clinical exam findings and consider how our results might inform diagnostic reasoning and its teaching.
Methods
We obtained data from The Rational Clinical Examination: Evidence-Based Clinical Diagnosis website. This site contains updated versions of the systematic reviews originally published as “The Rational Clinical Examination” series in JAMA. All data came from the “Make the Diagnosis” section of these reviews, where the most common and useful clinical exam findings are listed.
We included all likelihood ratios (LRs)[2] for individual symptoms and signs. Because we were interested in the test characteristics of discreet findings on the physical exam, we excluded data regarding test characteristics for laboratory tests or imaging procedures and for multivariate findings, including symptom scales, combinations of findings, and formal clinical decision rules. We excluded the rare exam finding whose test characteristics were not reported as a likelihood ratio (e.g. sensitivity, posterior odds of disease, etc.) and likelihood ratios given as ranges (indicating data that were too heterogeneous for the authors to combine). The only data set we excluded was from the analysis of a “family history of cancer” as this study did not evaluate a diagnostic test but rather the accuracy of a historical finding. Data are presented as means and medians.
Results
We identified positive likelihood ratios for 250 clinical findings and negative likelihood ratios for 241 findings. These findings were associated with 67 distinct diagnoses. For our analysis we considered a diagnostic test to be useful if it had at least a moderate effect on pre-test probabilities. Authors have often equated this with a positive LRs ≥5 or negative LRs ≤0.2. On average, individual clinical findings were not powerful diagnostic tests. The median LR+ was 3.4 and the median LR- was 0.59.
Positive findings were more informative than negative ones. 28% of positive findings had likelihood ratios greater than or equal to 5 while only 12% of negative findings had likelihood ratios less than or equal to 0.2. The mean positive and negative likelihood ratios were 6.4 and 0.57, respectively. Figure 1 shows the likelihood ratios of positive and negative findings on a logarithmic scale. Figure 2 shows the findings categorized by likelihood ratio: 1 to <2, ≥2 to <5, ≥5 to <10, and ≥10 for positive likelihood ratios and 1 to >0.5, ≤0.5 to >0.2, ≤0.2 to >0.1, and ≤0.1 for negative likelihood ratios. Considering the most impactful diagnostic findings, there were 31(12.4%) positive findings with likelihood ratios greater than or equal to 10 while only 13 (5.4%) negative findings had likelihood ratios less than or equal to 0.1. These most diagnostic findings are listed in Table 1.
Discussion
In this study we found that only 21% of findings were associated with likelihood ratios that cause at least moderate shifts in the probability of a diagnosis (≥5 or ≤0.2). When considering the findings with likelihood ratios ≥5 or ≤0.2, positive findings tend to be more powerful than negative ones. While 88% of negative findings were not diagnostically helpful (likelihood ratios between 1 and 0.2, exclusive), 28% of positive findings were associated with clinically important likelihood ratios of ≥5. In this data set, there were more than twice as many truly diagnostic positive clinical findings (LR ≥10) as there were negative ones (or ≤0.1).
We commonly teach diagnostic reasoning as a Bayesian hypothetico-deductive process in which each finding revises the probability of a diagnosis. Although this method of teaching is effective, our findings do suggest a weakness in this strategy: most signs and symptoms are not diagnostically powerful—they are associated with likelihood ratios close to one—and thus do little to alter disease probability. Few signs or symptoms are pathognomonic of a diagnosis and even fewer can definitively exclude a diagnosis.
Medical trainees often begin their education in diagnostic reasoning by learning “textbook presentations,” key signs and symptoms, of a disease. Appendicitis is abdominal pain that migrates to the RLQ, fever, and tenderness at McBurney’s point. We learn from our data that the findings that make up textbook presentations, at least when considered independently, rarely identify a disease state. Teaching students textbook presentations may cause them to overestimate the specificity of clinical exam findings: students may inappropriately rule in a diagnosis on the basis of a positive exam finding. Alternatively, students overestimate the sensitivity of a finding, excluding a disease because a textbook exam finding is not present. In the case of childhood appendicitis, for instance, no clinical exam finding has a likelihood ratio greater than 3.4 or less than 0.28. Right lower quadrant tenderness, probably the most widely taught finding for appendicitis, has likelihood ratios of only 1.3 and 0.45.
Learning textbook presentations of diseases is critical for trainees to build a repertoire of illness scripts. Our data suggest that accompanying this method of instruction should be the caution that disease presentations are variable and that exam findings are weaker predictors than many students expect. Except for a minority of signs and symptoms—which should be highlighted to students—the presence or absence of any specific finding does relatively little to change the likelihood of a diagnosis. A negative finding, in particular, should rarely lead to the exclusion of a diagnosis from a differential diagnosis.
Acknowledging the limitations of clinical exam findings—and clearly distinguishing between diagnostic and non-diagnostic findings—may help improve clinical reasoning training in three other ways. First, this instruction may help students become more aware of their cognitive biases. Heuristics can lead to diagnostic error, in part, because clinicians overestimate the value of exam findings. This is especially true for negative findings, which we found are seldom helpful in excluding a diagnosis. Second, there is some evidence that overconfidence in trainees’ diagnostic accuracy (particularly among residents) may be a source of diagnostic error. It is possible that overconfidence may be caused by an inflated estimate of the value of exam findings, giving trainees a false sense of certainty. Third, one of the few studies to investigate the cause of diagnostic errors among medical students found that the most common error was overvaluing non-discriminatory clinical exam findings. When most exam findings are taught to students without regard for their diagnostic value, students may end up making diagnoses based on weak findings rather than robust ones.
A final point about teaching diagnostic reason that can be inferred from our results is the critical importance of teaching students about the pretest probability of common diseases and presentations. Given the evidence that most exam findings have little, independent, effect on the probability of a disease, it is essential for physicians to know the pretest probability of diseases as it determines whether an accurate diagnosis can be made with weak exam findings.
There are limitations to this study. Most importantly, our data is by no means comprehensive, being limited to those conditions addressed in The Rational Clinical Examination series and the specific findings considered by the authors. We chose this approach because the test characteristics reported in The Rational Clinical Examination series have been rigorously assessed. This assures a level of quality in the test characteristic data that is probably lacking in the literature as a whole. Drawing upon the Make the Diagnosis section also ensured that our data reflected the most common and valuable clinical exam findings. For any diagnosis, there are dozens of potential tests that can be conducted, but only a handful are diagnostically meaningful and widely used. If anything, this strategy led us to overestimate the value of clinical findings. Comparison of our findings to data contained in a widely used textbook of medicine supported our contention that positive findings are more useful than negative ones. Using the same exclusion criteria, we identified 24 additional clinical findings, not included in this report, with LRs ≥10 and only 10 with LRs ≤0.1.
Two other points of importance: we considered likelihood ratios, ignoring issues of pretest probability, and we converted likelihood ratios, a continuous, logarithmic variable into categories of various levels of clinical utility. These decisions do not change the results of our study but are important in considering the applicability.
Despite limiting our source to The Rational Clinical Examination series, our findings are affected by variable data quality. Some test characteristics were defined in multiple studies while others were defined by a single study. Our decision to exclude combinations of clinical findings and clinical decision rules was based on a desire to keep our findings simple and easily applicable despite recognizing that most clinical assessments are made based on a full clinical picture rather than on a single finding.
We considered whether our finding that positive findings are generally more powerful than negative ones might be due to a systematic bias in the literature. Although clinical findings are always defined for a disease that is considered at least somewhat likely, it is unlikely like publication, evaluation, or spectrum biases would produce the divergence in our results.
This study reveals the complexity of interpreting the history and physical exam. Most clinical findings, taken as discreet entities, are not diagnostically powerful. Positive exam findings are generally more valuable than negative ones. When educating clinical trainees in diagnostic reasoning, our data highlight the importance of teaching pretest probabilities, cautioning trainees that discreet clinical findings are seldom diagnostically powerful, and emphasizing the subset of findings that definitively rule in or rule out a diagnosis.
[1] Isn’t that a nice way to say that it was rejected by the journals that I thought deserved it?
[2] For anyone who has not been beaten over the head about sensitivity, specificity and likelihood ratios in medical school, these are links to some useful definitions.
This is some excellent, interesting, and valuable work. I am stunned you couldn't find a home for it in the literature. Doubly stunned when I see the trivial nature of 80% of the material I see in the refereed literature.
In many ways, teaching of probabilistic diagnosis in medical schools focuses on an opportunity to teach Bayes' rule, sensitivity, specificity, and likelihood ratios. For many diseases the pre-test probability of disease is more important than the test result, and estimation of pre-test probability is de-emphasized. Decision making would be improved by teaching the logistic regression model instead, where multiple continuous test results can be easily handled and the pre-test variables get all the emphasis they deserve. We need to move past oversimplified diagnostic impact summaries such as sens, spec, LR and think multivariably. And keep in mind that LRs, more useful than sens and spec, are still oversimplifications that have turned continuous test outputs into binary all-or-nothing variables.