Doctors vs Chatbot – The First Round
Critics of the JAMA-IM Chatbot vs Docs study miss the forest through the trees. Here are my initial thoughts
Before starting, I want to express thanks for the generous support Sensible Medicine has received. This newsletter will remain free of industry sponsorship. We hope that allows us to provide a more robust and open exchange of ideas. Thank you. JMM.
In the history of medicine there are before and after times. Before and after antibiotics. Before and after the Internet. A modest study published in JAMA-IM may indicate another before and after time.
How you feel about the study showing that ChatGPT outperformed doctors depends on whether you have used the large language model. Those who haven’t used ChatGPT will focus on the study’s limitations—which are substantial.
Sure, the study used responses of doctors writing on an online forum as a control. That’s a long way from your consultant at the World’s Greatest Hospital. But is it that far from the average doc or advanced practice clinician? Time and more data will speak to the control arm of this study.
I have used ChatGPT, and I am not surprised that doctor-judges favored the answers from the large language model. Shocking is the word that jumps to mind when you interact with this Chatbot.
Sensible Medicine is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber.
John Ayers is a computational epidemiologist and associate professor at UC San Diego. He and his team designed a clever way to assess ChatGPT’s ability to answer healthcare queries.
Their surrogate marker for an active clinical encounter was the AskDoctors subreddit. The datapoints for this study were 195 public questions that licensed doctors responded to. The research team then entered this question into the ChatGPT version 3.5.
Which response was better? To answer this question the research team showed the responses to a separate team of (expert) doctors who graded the responses on three criteria: overall better response, the quality of information provided and empathy.
Evaluators preferred the ChatGPT response 79% of the time.
Chatbot responses were rated of significantly higher quality than the online doctors.
Judges gave a rating of good or very good quality of information nearly 4-times more often to the Chatbot responses.
Chatbot was rated far more empathetic than doctors, earning a rating empathetic or very empathetic nearly 10x more often the doctors.
There was no shortage of online critique of this study. Best was from Ben Recht, a machine learning computer scientist at the University of California Berkley.
First, he noted that the “physician evaluators” are a subset of authors, which he felt was disqualifying since the comparison was a subjective evaluation. Second, Recht wrote that no one who goes to Reddit for medical advice expects a long empathetic response—that is not how things get upvoted. Third, he was especially put off by adding statistics to a subjective review. “As if adding some confidence intervals makes this farce into science.”
These are strongly worded but reasonable critiques. I normally stake out the skeptical/critical position of medical evidence. Here I am more positive, and see more forest than trees in this paper.
I laud the authors for their restraint. In the second sentence of the conclusion, they call for “further exploration of this technology in clinical settings.” They then call for randomized controlled trials. There were no causal inferences, no excessive spin. The authors also wrote an extensive, fair and candid limitations section.
I’d also defend the choice of the surrogate marker of doctors from Reddit. This is the first pitch in the first inning. A year ago, almost no one could have imagined Chatbot. Privacy concerns precluded use of actual patient data.
Lead author John Ayers also noted that doctors on Reddit are answering voluntarily and publicly, which creates a game of reputation. (Also, it’s not as if the current medical literature isn’t full of low-value surrogate markers--HbA1c, BNP, ejection fraction, and heart failure hospitalizations.)
The comparison here is not Chatbot vs in-person care from someone’s private doctor but to online responses from a partitioner that faces a daily deluge of patient requests. I am a specialist and spend increasing amounts of my time on patient messages. A primary care clinician surely faces manyfold more than I do.
Look at the responses in the paper. Yes, they are wordy, but this is from an earlier version of Chatbot. Future iterations will improve in accuracy and succinctness.
Another line of criticism, easily rebutted, is the notion that a doctor who has a relationship with a patient would be better able to answer these questions than any Chatbot.
While that may be true in theory, in practice, in 2023, I would argue that scant few people have relationships with their primary care clinician. And deterioration of patient-doctor relationships are on a steep decline. Exhibit A: the rise in walk-in clinics.
I, too, wish the authors wrote this as a subjective review rather than a scientific paper. Recht is right, there was no need for confidence intervals or p-values. But this does not diminish the significance of the observations. It doesn’t matter if the Chatbot responses were statistically significantly better. Even if the Chatbot got close it would be worthy of further study.
I stand by my contention that this is an inflection point. Chatbots can pass medical exams. Chatbots do not fatigue; they do not get annoyed; they can scan wordy requests, and answer at all hours. And they draw from the vastness of knowledge on the Internet.
When we were writing chart notes in pen, sending letters via slow-mail, putting nickels into machines to copy journal articles, we could not have imagined the world of email, electronic health records, computers-in-our-pockets and instant communication with a global audience.
We had no idea that one day soon we would be able to look up facts in seconds and learn to do procedures from YouTube videos.
This is the frame with which I see large language models. The concern is not whether doctors will be replaced with AI, but how best to use this new tool to help people.