Research Article 2026-04-20 posted v1

Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection

Jiaee Cheong Harvard Medical School

Diana S. Chen Harvard College

William Leung Harvard College

Cody Chou Harvard College

Cheryl M. Corcoran Icahn School of Medicine at Mount Sinai

Sinead Kelly Brigham and Women's Hospital

Carrie E. Bearden University of California

Guillermo Cecchi Icahn School of Medicine at Mount Sinai

Justin T. Baker Brigham and Women's Hospital

John M. Kane Donald and Barbara Zucker School of Medicine

Scott W. Woods Yale University

Martha E. Shenton 6Massachusetts General Hospital, Harvard Medical School

Barnaby Nelson Orygen

John Torous Harvard Medical School

Download PDF View Original Citation

Abstract

Large language models (LLMs) are increasingly deployed to assess, diagnose, and predict clinical symptoms and outcomes from textual data. However, prior work has shown that LLMs are susceptible to hallucinations. To date, it remains unclear whether, how, and why such hallucinations arise when LLMs are applied to clinical data situated in diverse real-world contexts. In this work, we systematically quantify the prevalence and types of hallucinations across different LLMs and examine the utility of LLMs-as-judges (LLJs) for automated hallucination assessment using real-world clinical data situated in diverse real-world contexts. Specifically, we analyzed transcripts obtained from both community controls and populations at higher risk of developing schizophrenia. Across ground-truth hallucination analyses, we find that hallucination rates vary widely, from 0.3% to 76.3%, depending on transcript type, prompting strategy, and the LLM used for classification. Diagnostic hallucinations are the most prevalent subtype, accounting for up to 50.4% of observed hallucinations. Across automated hallucination detection, LLJs demonstrate reasonable alignment with human raters, achieving up to 65.7% agreement on hallucination presence and 64.7% agreement on hallucination subcategorization. Finally, we identify that hallucinations most commonly arise when there is insufficient contextual grounding.

Keywords

automated text analysis clinical safety large language models

Citation Information

@article{jiaeecheong2026,
  title={Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection},
  author={Jiaee Cheong and Diana S. Chen and William Leung and Cody Chou and Cheryl M. Corcoran and Sinead Kelly and Carrie E. Bearden and Guillermo Cecchi and Justin T. Baker and John M. Kane and Scott W. Woods and Martha E. Shenton and Barnaby Nelson and John Torous},
  journal={Research Square},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-9394250/v1}
}

Jiaee Cheong et al. (2026). Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection. Research Square. https://doi.org/10.21203/rs.3.rs-9394250/v1

Jiaee Cheong, et al. \"Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection.\" Research Square, 2026.

[14]Jiaee Cheong, Diana S. Chen, William Leung, Cody Chou, Cheryl M. Corcoran, Sinead Kelly, Carrie E. Bearden, Guillermo Cecchi, Justin T. Baker, John M. Kane, Scott W. Woods, Martha E. Shenton, Barnaby Nelson, John Torous.Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection[Research Article].Research Square,2026.

Paper Details

Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection

Abstract

Keywords

Citation Information

Related Papers

Welcome to SinoXiv

Paper Details

Assessing Llm Hallucinations And The Reliability Of Using LLms For Automated Hallucination Detection

Abstract

Keywords

Citation Information

Related Papers