Systematic Review 2026-04-23 posted v1

Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility

Alon Gorenshtein BRIDGE GenAI Lab, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA

Mahmud Omar BRIDGE GenAI Lab, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA

Yiftach Barash BRIDGE GenAI Lab, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA

Jonathan B. Kruskal Department of Radiology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA

Muneeb Ahmed Department of Radiology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA

Olga R. Brook Department of Radiology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA

Ben Illigens Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA

Girish N. Nadkarni The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, New York, NY, USA

Eyal Klang BRIDGE GenAI Lab, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA

Download PDF View Original Citation

Abstract

Large language models are increasingly used in clinical medicine, but their reliance on cloud servers conflicts with patient-privacy requirements and excludes resource-limited healthcare systems. Small language models (SLMs) of up to four billion parameters can run locally on a single commodity GPU, keeping data inside the institution while reaching performance comparable to much larger systems. Here we systematically review 14 studies that deploy SLMs for clinical prediction, information extraction, and medical question answering. Domain- adapted small models reached a median 91% of the best reported performance of larger baselines, and we found no significant correlation between parameter count and task accuracy. Only half of the studies evaluated hallucination rates, and none reported calibration or epistemic uncertainty. The computational case for on-premise clinical AI is therefore strong, but the safety engineering required for responsible deployment, particularly in agentic sub-agent pipelines, is largely absent.

Keywords

Small Language Models Agentic AI Large Language Models Systematic Review Clinical Natural Language Processing Edge Deployment PRISMA

Citation Information

@article{alongorenshtein2026,
  title={Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility},
  author={Alon Gorenshtein and Mahmud Omar and Yiftach Barash and Jonathan B. Kruskal and Muneeb Ahmed and Olga R. Brook and Ben Illigens and Girish N. Nadkarni and Eyal Klang},
  journal={Research Square},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-9488729/v1}
}

Alon Gorenshtein et al. (2026). Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility. Research Square. https://doi.org/10.21203/rs.3.rs-9488729/v1

Alon Gorenshtein, et al. \"Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility.\" Research Square, 2026.

[9]Alon Gorenshtein, Mahmud Omar, Yiftach Barash, Jonathan B. Kruskal, Muneeb Ahmed, Olga R. Brook, Ben Illigens, Girish N. Nadkarni, Eyal Klang.Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility[Systematic Review].Research Square,2026.

Paper Details

Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility

Abstract

Keywords

Citation Information

Related Papers

Welcome to SinoXiv

Paper Details

Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility

Abstract

Keywords

Citation Information

Related Papers