Skip to main content

'Garbage in, garbage out': Mount Sinai experts compare hallucinations across 6 LLMs

A new reasoning model quantifies how often large language models elaborate on false clinical details fed to them. Prompt mitigation quelled some hallucination frequency, but the AI behind clinical bots may still pose risks, researchers said.
By Andrea Fox , Senior Editor
AI brain icon
Photo: Pixabay/Pexels

A new study by the Mount Sinai Icahn School of Medicine examines six large language models – and finds that they're highly susceptible to adversarial hallucination attacks

Researchers tested the foundational LLMs used in clinical decision support and public health contexts and said they pose a substantial risk. 

"Our results highlight that caution should be taken when using LLM to interpret clinical notes," they said in their report published this week in Nature.

WHY IT MATTERS

Using Anthropic’s Claude Sonnet 3.5, the Icahn School researchers created 300 clinical cases with both 50-60-word short versions and 90-100-word long versions. Physicians validated all case content and their fictitious details. 

"Each case included a single fabricated medical detail, such as a fictitious laboratory test…a fabricated physical or radiological sign…or an invented disease or syndrome," they explained this week in their report.

The researchers used the false clinical information injected into AI chatbot prompts to trigger and measure adversarial hallucination attacks and shed some light on the reliability of their foundational models

Then, they measured how frequently the models elaborated on the false content injected in the test cases, and assessed whether a specific mitigation prompt or adjusting the temperature setting to zero could reduce these errors. 

GPT4o demonstrated the lowest hallucination rates, while Distilled-DeepSeek had the highest of the six foundational models tested, the researchers said.

As the worst performer in the study, DeepSeek's LLM yielded hallucination rates of 80.0% in long cases and 82.7% in short cases, researchers said. In contrast, GPT4o was the best performer in the study with significantly lower rates of 53.3% for long cases and 50.0% for short cases, they noted. 

The other models tested – Llama 3.3, Phi-4, Gemma-2-27b-it and Qwen-2.25-72b  – all fell in a range between 58.7% and 82.0% for hallucinations. 

Artificial intelligence hallucinations manifested in various forms, including fabricating citations, propagating false information from prompts, making false associations and miscalculating data in summaries, according to the research.

Short clinical case vignettes showed slightly higher odds of hallucination across models, said researchers. 

When the temperature was set to zero, all models' hallucination rates remained similar to their default settings. 

Of significance, while the mitigation prompt instructed the model to use only clinically validated information and "acknowledge uncertainty instead of speculating further," hallucination rates varied across the six different LLMs. 

Prompt mitigation decreased hallucination rates from an average of 65.9% under the default prompt to 44.2%. Notably, the technique improved the performance of GPT4o, reducing the LLM's hallucinations to 20.7% for long cases and 24.7% for short cases. 

However, "while prompt engineering reduces errors, it does not eliminate them," the researchers said.

As reliance on LLMs in clinical use evolves, the new physician-validated, automated classification data set they built could be used in AI chatbot performance testing, analyzing large amounts of AI outputs with minimal human effort, they noted.

"Future studies should broaden model comparisons, explore additional prompt strategies, monitor how updates affect performance and explore the performance of narrowly constructed clinical LLMs."

THE LARGER TREND

Last month, Healthcare IT News asked Dr. Jay Anders, chief medical officer at Medicomp Systems, a vendor of clinical AI-powered systems, about the damage AI can do if it gets the facts wrong in clinical contexts.

He explained that AI hallucinations occur when an LLM fails to admit uncertainty.

"Fabricated responses are particularly problematic because they're often very convincing," he said. "The hallucinations can be very difficult to distinguish from factual information, depending on what's being asked."

How AI-driven mistakes cascade is extremely difficult to reverse and impacts patients beyond the safety of their healthcare, Anders continued. 

"When AI assigns incorrect diseases, lab results or medications to a patient's record, these errors become nearly impossible to correct and can have devastating long-term consequences," he said. "The propagation problem is particularly insidious."

ON THE RECORD

"We find that the LLM models repeat or elaborate on the planted error in up to 83% of cases," the Icahn School researchers said. "Adopting strategies to prevent the impact of inappropriate instructions can halve the rate, but does not eliminate the risk of errors remaining." 

Andrea Fox is senior editor of Healthcare IT News.
Email: afox@himss.org
Healthcare IT News is a HIMSS Media publication.

More Regional News

Healthcare workers meeting around a laptop
Healthcare organizations face infrastructure crisis as AI and IoMT investments soar
By |