Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection
Abstract
Large language models (LLMs) are increasingly deployed to assess, diagnose, and predict clinical symptoms and outcomes from textual data. However, prior work has shown that LLMs are susceptible to hallucinations. To date, it remains unclear whether, how, and why such hallucinations arise when LLMs are applied to clinical data situated in diverse real-world contexts. In this work, we systematically quantify the prevalence and types of hallucinations across different LLMs and examine the utility of LLMs-as-judges (LLJs) for automated hallucination assessment using real-world clinical data situated in diverse real-world contexts. Specifically, we analyzed transcripts obtained from both community controls and populations at higher risk of developing schizophrenia. Across ground-truth hallucination analyses, we find that hallucination rates vary widely, from 0.3% to 76.3%, depending on transcript type, prompting strategy, and the LLM used for classification. Diagnostic hallucinations are the most prevalent subtype, accounting for up to 50.4% of observed hallucinations. Across automated hallucination detection, LLJs demonstrate reasonable alignment with human raters, achieving up to 65.7% agreement on hallucination presence and 64.7% agreement on hallucination subcategorization. Finally, we identify that hallucinations most commonly arise when there is insufficient contextual grounding.
Citation Information
@article{jiaeecheong2026,
title={Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection},
author={Jiaee Cheong and Diana S. Chen and William Leung and Cody Chou and Cheryl M. Corcoran and Sinead Kelly and Carrie E. Bearden and Guillermo Cecchi and Justin T. Baker and John M. Kane and Scott W. Woods and Martha E. Shenton and Barnaby Nelson and John Torous},
journal={Research Square},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9394250/v1}
}
SinoXiv