Behind the Magic

Apr 21

When Medical AI Gets the Wrong Answer

30 Comments

Very interesting post and good to know. I'm a registered nurse who started working in hospitals when the crux of computer use (ie. terminals) was to check bloodwork and specimen results.

I'm actually glad I did because it taught me to think and question many aspects of what EHR and its programmed defaults and drop-down boxes that all depends solely on yes or no to the original question in whatever category I'm forced to answer despite little or nothing to do with the patients hospitalization. I see it in everything from the in-house programs like epic, automated medication dispensing systems like pixus and omnicell, some newer direct glucose monitoring methods and countless other examples. I use to learn far more reading a physical chart (once I learned to read / decipher 99% of most doctors scribbled progress notes:). More specific to this post I have a question. When searching anything, the first answer that is offered is under AI and answers seem to be informative. I understand using it as a tool and still thinking for myself; having the ability to constantly question and check sources. Here's my 2 questions; in your opinion is it best to start with those AI results or first read other search engine answers (which are sometimes more specific) and then review the AI results (which are often a bit more comprehensive)? And would you say it probably depends on the importance of your search (for example Healthcare vs. fixing a cabinet door)? Thanks.

Laurentiu Lupu MD

Apr 23

What stayed with me here is that the danger may begin even before the answer itself. It begins when retrieval gives the impression that the question has already been framed well.

In medicine, the central task is not only to gather information, but to decide what deserves priority under uncertainty, which possibility must move forward first, and what kind of evidence should carry the most weight in that moment. A system may retrieve plausible literature and still fail at that deeper level of orientation.

That is what makes this so important. The risk is not only hallucination or outdated citation. It is the illusion that medicine is mainly a problem of access to information, when in practice it is so often a problem of sequencing attention and judgment in the right order, with something real at stake.

PharmHand

Apr 22Edited

This essay is very important. I brought up this issue of limits to AI tools recently in a discussion of using AI tools to sort through the gushing fire hose of climate science papers (though I used only the obvious ‘garbage in garbage out’ rubric in that discussion). Using the sort of refinements of AI tools suggested in this essay is absolutely necessary to enabling AI tools to be reasonably trusted tools in clinical decision assistance.

Gemna

Apr 22

My recent experiences have made me think AI in healthcare will be an improvement. Multiple doctors dismissed my concerns, a significant lab result was ignored because they weren't familiar with that test, my symptoms were completely mischaracterized in the clinical notes, my request for a referral to a specialist was denied. PerplexityAI helped me figure out what could be going on, what my options were to get the help I needed, what symptoms to look for and what to emphasize when communicating with my doctors.

Dr. K

Apr 21

Alex, This is a great article. But this same issue occurs in all issues of generative AI with individual patient data. RAG'ing the patient data which is impossibly broken (duplicates, conflicts, missing data, multiple representations and coding sets -- you name it) creates a substrate that is repeatedly 20% wrong -- and sometimes more. Further, there is no training set for each individual -- only for the population which, as von Eye pointed out years ago, bears zero relationship to any individual.

So your points on article selection are spot on and will, over time, be remediated. The patient data problem ("I'll just put the chart into generative-AI and see what it says") is the one that is far more worrisome and for which solutions are difficult (essentially cognitive-AI...same foundation as the Semantic Web) and rare.

But thanks for this...every uptick in knowledge is a good one.

Steve Cheung

Apr 21

Nice post.

I agree that medical AI (OE is my go-to these days) is no different than any other tool, in that its effectiveness still relies on the skill of the user.

I think where that “skill” manifests is in how the query is posed. Kinda like garbage in/ garbage out. So it’s important to ask detailed precise questions. For important questions where the answer actually helps me make a clinical decision, I will ask it in more than 1 way (and often cross check with a separate AI).

But what you speak of (essentially a fully formed critical appraisal AI), I don’t think that actually exists yet. I hope you will work towards creating one.

In the meantime, my workaround to ensure that new information is not diluted by old data is to make a query, then repeat same query with a date restriction (eg. Only include information after 2020). This allows me to see whether recent data is being crowded out by “old thinking”.

toolate

Apr 21

Well done!!! And IMPORTANT!

When I first started using Open Evidence I was writing the "editors" often to correct some of this. They seemed interested at first and then they stopped responding,, likely recognizing that this was not easily solved.

Ithe problem is that recognition of the specific issues requires expert level mastery to use the tool which then becomes less useful since you are already expert.

For now, I think we need to iege great caution

LLMs are overconfident and often wrong!

toolate

Apr 21

Well done!!! And IMPORTANT!

Ithe problem is that recognition of the specific issues requires expert level mastery to use the tool which then becomes less useful since you are already expert.

For now, I think we need to iege great caution

LLMs are overconfident and often wrong!

Mark

Apr 21

Thanks for sharing this thought provoking piece. I am a fairly consistent user of open evidence and I had a sense that its parameters are very different from what you described so, I shared your article and asked it to comment on its own parameters. What I appreciated was that it was allowed to share this with its users as a legitimate question. Also, as you will see pasted below, it agrees with your words of caution but also does address how its parameters/strategies address many of the concerns you raise already. Here is its response:

The article raises valid concerns about **standard RAG systems, but OpenEvidence uses a fundamentally different architecture** that addresses many of the issues described. Here's what's different and what remains important to understand:

**How OpenEvidence differs from standard RAG**

The article correctly identifies the core problem with basic cosine similarity retrieval: it prioritizes topical overlap over evidence quality, recency, and clinical relevance. OpenEvidence doesn't use this approach. Instead:

- **Multi-stage retrieval with evidence prioritization**: The system doesn't rely on a single similarity score. It uses multiple retrieval stages that weight study design, source credibility (guidelines, major journals, FDA labels), and recency.

- **Guideline and high-quality source prioritization**: Clinical practice guidelines from major societies are systematically prioritized when available. These sources inherently reflect expert synthesis of evolving evidence, including reversals.

- **Contradiction surfacing**: Rather than fusing contradictory evidence into a single answer (the problem Javadi et al. identified), OpenEvidence is designed to surface disagreements when they exist in the literature. You'll see language like "guidelines differ" or "recent trials challenge earlier practice" when that's what the evidence shows.

- **Recency weighting**: Newer evidence isn't drowned out by volume of older literature. The retrieval system specifically accounts for publication date in ways that standard cosine similarity does not.

**Where vigilance still matters**

The article's practical advice remains sound even with these improvements:

1. **Check the citations**: If you're seeing only older papers on a topic you know has recent landmark trials, that's a signal to probe further or ask a more specific question about the recent evidence.

2. **Confidence calibration**: If you receive an unhedged answer on what you know is a contested topic, the system may have missed the debate. This can happen if the question doesn't trigger retrieval of the contradictory evidence.

3. **Evolving evidence**: For rapidly evolving areas (emerging infections, just-published guidelines, therapies approved after the database update), there may be a lag between publication and inclusion in the searchable database.

**The COURAGE/ISCHEMIA example specifically**

If you asked today about PCI for stable CAD, the system would retrieve and cite ISCHEMIA and current guidelines reflecting that evidence. The reversal is well-established enough that it's captured in guidelines and recent reviews. The vulnerable period the article describes—immediately after COURAGE but before the practice shift was complete—is exactly when standard retrieval would fail. OpenEvidence's guideline prioritization and recency weighting reduce (though don't eliminate) this risk.

**Practical strategy for your questions**

- **Be specific about recency when it matters**: If you want the latest evidence on a topic, mention that explicitly ("What does recent evidence show about..." or "What do current guidelines recommend...").

- **Ask about controversies directly**: If you suspect a topic is contested, ask "Is there controversy about..." or "How has the evidence evolved on...". This helps trigger retrieval of contradictory sources.

- **Use it as clinical decision support, not gospel**: The article's UpToDate analogy is apt. This is a starting point that should integrate with your clinical judgment and knowledge of the literature.

**The bottom line**: The concerns in the article are real for many AI clinical tools, but OpenEvidence's architecture specifically addresses the standard RAG limitations described. The system is designed to handle evidence hierarchies, surface contradictions, and weight recency. That said, no retrieval system is perfect, and the article's advice to maintain critical engagement with the citations and recommendations remains sound practice.

Reply (2)

DocH

Apr 22

This seems to support my personal experience using OE for clinical support. It does a great job! There are so many limitations to ALL the ways we humans have been getting information. As mentioned by several, any tool is most effectively used by an expert in that field - this tool included.

But it does so much better than the tools we've been using. It enhances what we've been doing till now. The admonition to all of us professionals to study and maintain mastery; to use a variety of tools, to maintain our skepticism and always question, is a good one.

But so many studies have been presented with certain biases or agendas - to the point where expert guidelines even miss the mark - I am hopeful AI can help overcome some of this. Relying on our own memory, interpretation, experience - all good, but with its own limitations.

For those of us who have been practicing for decades: using actual textbooks as reference when we first started practice; finally UpToDate showed up (at least for primary care); physically having to attend CME courses; spending our free time reading. Now that AI has entered the scene - life-changing level of being able to customize things more specifically to the patient before us. Cross reference with our own interpretation, with other sources, etc. - great tool.

Steve Cheung

Apr 21

This is great insight. Thanks.

I had experience with some of this yesterday, when I posed a question based on a recent clinical case I had. And the response incorporated many of the things listed here, including explicit identification of ongoing controversies and lingering areas of uncertainty.

Ernest N. Curtis

Apr 21

Judgement cannot be programmed and that is often the most important element in medical decisions. The danger is that computerized analyses and recommendations will become institutionalized and preclude good judgement. For example, a critical element in the analysis of any medical study is the quality of the data. Assessing the quality of the data requires judgement based on a number of factors in the methodology and using common sense in evaluating the meaning and accuracy of the end points. The same can be said for the statistical methods employed to define statistical significance. The potential medical- legal problems that may ensue from acceptance of AI as an authoritative base could also be horrendous.

Reply (1)

Anthony Michael Perry

Apr 22

Right on. You don't take AI responses as the gospel and your point about its dangerous potential for institutionalization is critical. It's a faster and more convenient way to research a topic but you need to question and cross examine the responses just as you hopefully did when you were doing it the old-fashioned way. For really important issues listen to an up to date human discussion.

Witsd

Apr 21

Well written, great summary paragraph and actionable steps. I have increased optimism that I can use AI searches better, but I’m also very scared that many medical people will not, and they will blindly follow (example: following blindly the 2026 ACC/AHA Lipid Lowering guidelines).

Sheila Crook-Lockwood

Apr 21

Alex, thank you for this article. I will use it in the fall with my junior nursing students in the informatics course.

Alan

Apr 21

Great article! It serves as a primer of AI apps in medicine, how they work and how complicated the pitfalls are. Most of us rely on OE without understanding how RAGs affect the final answer. This has given me a lot of insight with a product I rely on but could not understand the biases. Not that I am able to remember what RAG stands for … 🙄 Thank you.

Michael Plunkett

Apr 21

It ain’t ready for prime time. My residents brought me an AI program. We queried 3 topics I knew well. 0/3 for AI. It just picked the most cited trash, not today’s understanding. Unfortunately it will be adopted by today’s students because for them everything is teach to the test. Haven’t met thinkers in a long time.

Crixcyon

Apr 21

Medical mishaps are the third leading cause of death in the US. With A/i, it may move into position numero uno.

J Gamble, PhD RPh

Apr 21

Thank you! That was very insightful and motivating to look further into this issue!

Sensible Medicine

Behind the Magic