A comparative study on the accuracy of different machine learning methods for lung cancer prediction
Abstract
Background The incidence and mortality rates of lung cancer (LC) are extremely high and continue to rise. Early diagnosis combined with timely intervention can effectively reduce mortality in LC patients. Due to the asymptomatic nature of early-stage LC, most patients present at advanced stages upon clinical consultation. This study employs machine learning (ML) methods to develop a clinically applicable diagnostic model for LC based on clinical hematological data, thereby providing a theoretical basis for LC screening.Methods This study included LC patients and non-LC patients among hospitalized individuals from 2015 to 2024. After comparing six methods, the adaptive synthetic sampling method (ADASYN) was selected to address imbalanced data. Feature selection was performed using the Boruta algorithm, Least Absolute Shrinkage and Selection Operator (LASSO), and Random Forest-based Recursive Feature Elimination (RF-RFE). Based on this, independent machine learning (ML) models and stacked ensemble models were established. The clinical utility of these models was evaluated through receiver operating characteristic (ROC) curves, precision-recall (PR) curves, calibration curves, and decision curve analysis (DCA). Subsequently, the optimal model was employed to predict the best combination of serum tumor markers. Finally, the SHAP algorithm was applied to identify key features and interpret the predictive model.Findings The stacked ensemble model integrating support vector machines (SVM), extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost) outperformed all other comparative models in predictive performance, achieving an accuracy of 0.8833, precision of 0.8852, recall of 0.9971, F1-score of 0.9378, and area under the receiver operating characteristic curve (AUC-ROC) of 0.7286. For lung cancer tumor marker combinations, the NSE+SCCAg + CEA+CA-125 panel showed optimal diagnostic efficacy, significantly outperforming individual biomarkers. In addition, SHAP analysis identified albumin as a key factor driving model predictions.Interpretation The ensemble learning model developed in this study effectively integrates multi-center clinical datasets, with systematic optimization performed across four critical dimensions: high-dimensional feature selection, imbalanced sample handling, heterogeneous model fusion, and clinical interpretability enhancement. Through rigorous cross-validation and external validation, this approach successfully identifies an optimal panel of serum biomarkers for early lung cancer diagnosis and translates the algorithm into a user-friendly clinical decision support tool. This methodology demonstrates significant translational potential, as the integration of advanced ML techniques into routine clinical workflows can facilitate non-invasive early detection of lung cancer, enable timely intervention, and ultimately mitigate the risk of disease progression and mortality.Graphical abstract
Citation Information
@article{jiayizhao2026,
title={A comparative study on the accuracy of different machine learning methods for lung cancer prediction},
author={Jiayi Zhao and Yao Tong and Yunchao Huang and Chao Zhang and Hongjiang Zhang and Zhenghong Yang and Fang Li and Yang Luo and Dinglin Zhang},
journal={European Journal of Medical Research},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9145489/v1}
}
SinoXiv