Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility
Abstract
Large language models are increasingly used in clinical medicine, but their reliance on cloud servers conflicts with patient-privacy requirements and excludes resource-limited healthcare systems. Small language models (SLMs) of up to four billion parameters can run locally on a single commodity GPU, keeping data inside the institution while reaching performance comparable to much larger systems. Here we systematically review 14 studies that deploy SLMs for clinical prediction, information extraction, and medical question answering. Domain- adapted small models reached a median 91% of the best reported performance of larger baselines, and we found no significant correlation between parameter count and task accuracy. Only half of the studies evaluated hallucination rates, and none reported calibration or epistemic uncertainty. The computational case for on-premise clinical AI is therefore strong, but the safety engineering required for responsible deployment, particularly in agentic sub-agent pipelines, is largely absent.
Keywords
Citation Information
@article{alongorenshtein2026,
title={Small Language Models in Clinical Medicine: A Systematic Review of Performance, Safety, and Deployment Feasibility},
author={Alon Gorenshtein and Mahmud Omar and Yiftach Barash and Jonathan B. Kruskal and Muneeb Ahmed and Olga R. Brook and Ben Illigens and Girish N. Nadkarni and Eyal Klang},
journal={Research Square},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9488729/v1}
}
SinoXiv