Article 2026-04-21 posted v1

Benchmarking eye health advice from generative artificial intelligence in terms of factual accuracy, safety, comprehensiveness and readability

Aleksander Stupnicki University College London

Bernardo Mendes

Maxwell Reinstein

Ariel Ong

Andrew Malem Cleveland Clinic Abu Dhabi

Pearse Keane UCL Institute of Ophthalmology

Arun Thirunavukarasu

Download PDF View Original Citation

Abstract

Background Generative artificial intelligence (genAI) chatbots are increasingly used for health advice despite lacking regulatory approval, raising concerns about their output quality and safety. This study assesses eye health advice from leading genAI platforms, benchmarking their quality against patient information leaflets.Methods We compared outputs from GPT-5 (OpenAI) and Gemini 3 (Google DeepMind) with clinical leaflets across nine eye conditions (41 questions, 123 texts total). Reference benchmark (547 items) was derived from patient materials produced by the Royal College of Ophthalmologists. Chatbot outputs were generated using verbatim leaflet subsection headings as prompts with word-count restrictions to match corresponding leaflet sections. All texts were evaluated using the Comprehensiveness, Accuracy, and Safety Evaluation Framework (CASEF). Two blinded ophthalmologists assessed genAI outputs for safety concerns. Readability was measured using Flesch-Kincaid Grade Level.Results Both genAI models showed higher factual alignment than clinical leaflets (GPT-5 = 37.4%, Gemini 3 = 36.2%, leaflets = 30.7%; both p < 0.001) and fewer omissions (GPT-5 = 6.7, Gemini 3 = 7.0, leaflets = 7.9; both p < 0.001). Safety scores were comparable across sources, but both models underreported treatment complications, exhibited guideline inconsistencies, and failed to include appropriate safety-netting. Moreover, genAI outputs required 2–3 more years of education to understand. High inter-rater reliability (ICC = 0.843 (95%CI:0.800–0.880)) validated the scoring methodology.Conclusions GenAI eye health advice matches clinical leaflets in accuracy and comprehensiveness. However, subtle yet clinically consequential errors remain, limiting the application of general-purpose genAI chatbots as a safe, standalone information source for ophthalmology patients.

Citation Information

@article{aleksanderstupnicki2026,
  title={Benchmarking eye health advice from generative artificial intelligence in terms of factual accuracy, safety, comprehensiveness and readability},
  author={Aleksander Stupnicki and Bernardo Mendes and Maxwell Reinstein and Ariel Ong and Andrew Malem and Pearse Keane and Arun Thirunavukarasu},
  journal={Research Square},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-9383173/v1}
}

Aleksander Stupnicki et al. (2026). Benchmarking eye health advice from generative artificial intelligence in terms of factual accuracy, safety, comprehensiveness and readability. Research Square. https://doi.org/10.21203/rs.3.rs-9383173/v1

Aleksander Stupnicki, et al. \"Benchmarking eye health advice from generative artificial intelligence in terms of factual accuracy, safety, comprehensiveness and readability.\" Research Square, 2026.

[7]Aleksander Stupnicki, Bernardo Mendes, Maxwell Reinstein, Ariel Ong, Andrew Malem, Pearse Keane, Arun Thirunavukarasu.Benchmarking eye health advice from generative artificial intelligence in terms of factual accuracy, safety, comprehensiveness and readability[Article].Research Square,2026.

Paper Details

Benchmarking eye health advice from generative artificial intelligence in terms of factual accuracy, safety, comprehensiveness and readability

Abstract

Citation Information

Related Papers

Welcome to SinoXiv

Paper Details

Benchmarking eye health advice from generative artificial intelligence in terms of factual accuracy, safety, comprehensiveness and readability

Abstract

Citation Information

Related Papers