Prediction explainer: what should clinicians consider when considering a new clinical prediction model?
Guest post from two experts in prediction.
Sensible Medicine is excited to publish this brief explainer on prediction and prediction scores. I connected with the two authors on Twitter. This sort of educational content is exactly what we aim for here at Sensible Medicine.
Here are the Authors’ Bios:
Adrian Barnett is a professor of statistics who has worked for over 29 years in health and medical research. His current research concerns improving statistical practice to reduce research waste.
Nicole White is a statistician and Senior Research Fellow at Queensland University of Technology. She collaborates with academics, clinicians and decision-makers to improve health services delivery. Her independent research aims to improve research quality by bridging statistical and clinical expertise.
Clinical prediction models aim to predict a patient’s health risk and hence support clinical decision-making. Using information about a patient’s current health, predicted risks can be created for diagnosis and prognosis. The Framingham Risk Score is an example of an established prediction model, which estimates 10-year cardiovascular risk using a person’s age, smoking status, systolic blood pressure, and total and HDL cholesterol [1]. Early warning scores are another example, which are used in hospitals to predict clinical deterioration based on a patient’s vital signs [2].
The recent rise of big data has propelled the pursuit of personalized medicine. The promise is that well-constructed and validated prediction models will lead to diseases being diagnosed earlier and treatment for the right patients, in the right place, at the right time. Efforts to realise this potential have led to many new prediction models, with an average of 4,200 publications related to clinical prediction modelling being published weekly (MEDLINE search).
New “breakthrough” models can receive wide media attention. A recent model that used MRI to predict a person’s thoughts was featured in CNN, Fox News, The Atlantic and Vox [3]. The Vox headline was “Mind-reading technology has arrived”, although the study had just seven participants and the error rate in the three test participants was 0.92 to 0.94, with zero as a perfect score and one as the worst score possible.
Many peer-reviewed clinical prediction models fall short of the standards needed to consider their use in clinical practice. In some cases, models that appear to perform well may pose risks to patients if used to make clinical decisions. We highlight a few important considerations when reading about the latest prediction model in your clinical area.
A model’s accuracy is ideally assessed by comparing what the model predicted would happen with what actually happened, for example, predicted deaths following admission to hospital compared with actual deaths. This assessment should be done using “external validation”, where an existing model is applied to new patients whose data were not used to develop the model. External validation is the ideal validation to look for, because it shows that the model is more transportable to new situations, and you might be considering transporting the model into your patients. Unfortunately, external validations are rare, for example, a review of 71 published models found only three were external validations [4]. Instead, most published models use “internal validation”, where the same patients used to develop the model are used to validate predictions. Internal validation is tempting because no new data are required, however, the estimated accuracy of the model is likely to be overestimated compared with external validation.
A prediction model’s accuracy will likely fall when applied to new patient cohorts because of differences in the patients’ demographics and healthcare systems. Validation results should therefore be considered in your own context, including acknowledging important differences between your patient population and the patients used to validate the model [5]. For example, the Framingham Risk Score was developed in the USA and had poor accuracy in Aboriginal and Torres Strait Islanders [6].
Few prediction models are perfectly accurate and all will result in false positives (patients labelled as diseased that are healthy) and false negatives (patients labelled as healthy that are diseased). The costs and consequences of false negatives and positives need to be carefully weighed up [7]. For example, a model that assesses patients presenting to an emergency department with suspected acute coronary syndrome (ACS) was designed so that patients with ACS wrongly labelled as “low risk” and sent home would be just 1% (false negatives) [8]. False positives are of less consequence here, because they involve admission to hospital and further testing, and whilst these have costs and risks they are not considered as important as false negatives.
Larger sample sizes will give a better picture of a model’s true accuracy. Unfortunately, many models are developed based on small sample sizes. A review found that most prediction models were developed in just over 100 patients [9]. It is hard to get a clear picture of the accuracy of a prediction model from such a small sample. Even for large sample sizes some diseases may be too difficult to predict, and this is often the case for rare diseases.
We recommend that clinicians have a healthy scepticism about new prediction models. A new model may be genuinely helpful for your practice, but it needs to pass a number of strict quality checks before being used. Sadly, many models fail the minimum quality standard. A review of prediction models for COVID-19 found only 7 of the 606 published models were potentially useful for practice [10], and even the best prediction models might be no better than a clinician’s assessment [11].
When reading about a new prediction model, think if it would be genuinely helpful for your practice. Questions you should ask include: Can you easily get the data needed for the model? How does the model classify patients as high and low-risk? Will it be used as a strict diagnostic criterion or as a tool to encourage critical thinking [12]? What happens if your clinical assessment strongly disagrees with the model predictions? What happens if the model flags almost all your patients as high risk?
Given the number of prediction models that have been published, we also recommend thinking twice before building a new prediction model. Questions you should ask yourself include: Is there already a published model that you can use or adapt to suit your patients? What extra information could a new model provide beyond current clinical practice? If your proposed model has good accuracy, how will it complement current decision-making processes? Is it likely to be used in practice?
Twitter Handles
Nicole White @nicolem_white
Adrian Barnett @aidybarnett
References
[1] D’Agostino RB, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, et al. General Cardiovascular Risk Profile for Use in Primary Care. Circulation. 2008 Feb;117(6):743-53. Available from: https://doi.org/10.1161/circulationaha.107.699579.
[2] Alam N, Hobbelink EL, van Tienhoven AJ, van de Ven PM, Jansma EP, Nanayakkara PW. The impact of the use of the Early Warning Score (EWS) on patient outcomes: a systematic review. Resuscitation. 2014;85(5):587-94.
[3] Tang J, LeBel A, Jain S, Huth AG. Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience. 2023 May;26(5):858-66. Available from: https://doi.org/10.1038/s41593-023-01304-9.
[4] Bouwmeester W, Zuithoff NPA, Mallett S, Geerlings MI, Vergouwe Y, Steyerberg EW, et al. Reporting and Methods in Clinical Prediction Research: A Systematic Review. PLOS Medicine. 2012 05;9(5):1-13.
[5] Van Calster B, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC medicine. 2023;21(1):70.
[6] Hua X, McDermott R, Lung T, Wenitong M, Tran-Duy A, Li M, et al. Validation and recalibration of the Framingham cardiovascular disease risk models in an Australian Indigenous cohort. European Journal of Preventive Cardiology. 2017 Jul;24(15):1660-9. Available from: https://doi.org/10.1177/2047487317722913.
[7] Parsons R, Blythe R, Cramb SM, McPhail SM. Integrating economic considerations into cutpoint selection may help align clinical decision support toward value-based healthcare. Journal of the American Medical Informatics Association. 2023 Mar;30(6):1103-13. Available from: https://doi.org/10.1093/jamia/ocad042.
[8] Parsonage WA, Milburn T, Ashover S, Skoien W, Greenslade JH, McCormack L, et al. Implementing change: evaluating the Accelerated Chest pain Risk Evaluation (ACRE) project. Medical Journal of Australia. 2017 Sep;207(5):201-5. Available from: https://doi.org/10.5694/mja16.01479.
[9] Bachmann LM, Puhan MA, ter Riet G, Bossuyt PM. Sample sizes of studies on diagnostic accuracy: literature survey. BMJ. 2006 Apr;332(7550):1127-9. Available from: https://doi.org/10.1136/bmj.38793.637789.2f.
[10] Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. BMJ. 2020;369.
[11] Soto-Mota A, Marfil-Garza BA, de Obeso SC, Rodriguez EJM, Vazquez DAC, Tadeo-Espinoza H, et al. Prospective Predictive Performance Comparison between Clinical Gestalt and Validated COVID-19 Mortality Scores. Journal of Investigative Medicine. 2022 Feb;70(2):415-20. Available from: https://doi.org/10.1136/jim-2021-002037.
[12] Blythe R, Naicker S, White N, Donovan R, Scott I, Mckelliget A, et al. Clinician preferences for clinical prediction model design in acute care settings: A case study of early warning score implementation. 2023 Aug. Available from: https://doi.org/10.31219/osf.io/87my4.
Validation non a holdout sample is often not what it seems. It requires very large samples to have sufficient precision, and lowers the sample size available for model development. It hides instability due to variable selection and makes it difficult to incorporate geographical and temporal trends into model building. It the majority of cases, strong internal validation through honest resampling procedure is more accurate and more insightful. “Honest” here means that all supervised learning steps are repeated afresh within each resampling loop. 100 repeats of 10-fold cross-validation or several hundred bootstrap resamples is recommended. For elaboration on these issues see https://hbiostat.org/bbr/reg#internal-vs.-external-model-validation
Thanks for this. A very informative primer.
I’m amazed so many models get published without being vetted with a prospective validation cohort. If you’re only testing your model on the retrospective derivation cohort, you’re are asking for self-fulfilling prophecies. Such models would be worse than useless, as they instil unwarranted confidence in a model’s prospective utility.