Research Article 2026-04-20 posted v1

Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection

J
Jiaee Cheong Harvard Medical School
D
Diana S. Chen Harvard College
W
William Leung Harvard College
C
Cody Chou Harvard College
C
Cheryl M. Corcoran Icahn School of Medicine at Mount Sinai
S
Sinead Kelly Brigham and Women's Hospital
C
Carrie E. Bearden University of California
G
Guillermo Cecchi Icahn School of Medicine at Mount Sinai
J
Justin T. Baker Brigham and Women's Hospital
J
John M. Kane Donald and Barbara Zucker School of Medicine
S
Scott W. Woods Yale University
M
Martha E. Shenton 6Massachusetts General Hospital, Harvard Medical School
B
Barnaby Nelson Orygen
J
John Torous Harvard Medical School

Abstract

Large language models (LLMs) are increasingly deployed to assess, diagnose, and predict clinical symptoms and outcomes from textual data. However, prior work has shown that LLMs are susceptible to hallucinations. To date, it remains unclear whether, how, and why such hallucinations arise when LLMs are applied to clinical data situated in diverse real-world contexts. In this work, we systematically quantify the prevalence and types of hallucinations across different LLMs and examine the utility of LLMs-as-judges (LLJs) for automated hallucination assessment using real-world clinical data situated in diverse real-world contexts. Specifically, we analyzed transcripts obtained from both community controls and populations at higher risk of developing schizophrenia. Across ground-truth hallucination analyses, we find that hallucination rates vary widely, from 0.3% to 76.3%, depending on transcript type, prompting strategy, and the LLM used for classification. Diagnostic hallucinations are the most prevalent subtype, accounting for up to 50.4% of observed hallucinations. Across automated hallucination detection, LLJs demonstrate reasonable alignment with human raters, achieving up to 65.7% agreement on hallucination presence and 64.7% agreement on hallucination subcategorization. Finally, we identify that hallucinations most commonly arise when there is insufficient contextual grounding.

Citation Information

@article{jiaeecheong2026,
  title={Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection},
  author={Jiaee Cheong and Diana S. Chen and William Leung and Cody Chou and Cheryl M. Corcoran and Sinead Kelly and Carrie E. Bearden and Guillermo Cecchi and Justin T. Baker and John M. Kane and Scott W. Woods and Martha E. Shenton and Barnaby Nelson and John Torous},
  journal={Research Square},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-9394250/v1}
}
Back to Top
Home
Paper List
Submit
0.026456s