Machine Learning Expected Goals Models in Soccer: Reproducible Shot-Outcome Prediction with StatsBomb Data
Abstract
A fully reproducible expected-goals (xG) modelling pipeline is presented as an interpretable machine learning approach using open football event data from StatsBomb for La Liga 2015/2016 and the 2018 FIFA World Cup. The dataset comprised 10,709 non-penalty shots, with features derived from shot distance and angle, body part (head vs. foot), and competition. Logistic regression and a generalized linear mixed-effects model with shooter-level random intercepts were estimated to predict goal probabilities at the shot level. Model performance was evaluated using information criteria and the area under the ROC curve (AUC). Distance strongly reduced scoring probability, headers were markedly less likely to result in goals than footed shots, and World Cup shots had lower baseline conversion than La Liga attempts at comparable locations. AUC values ranged from 0.75 (baseline location model) to 0.79 (mixed-effects model), providing competitive baselines for soccer prediction tasks. The pipeline illustrates how openly available event data can support transparent, interpretable, and practically useful xG models for research, teaching, and applied coaching and performance analysis in soccer. Trial registration: not applicable
Citation Information
@article{kofinyantakyiappiah2026,
title={Machine Learning Expected Goals Models in Soccer: Reproducible Shot-Outcome Prediction with StatsBomb Data},
author={Kofi Nyantakyi Appiah and Nathanael Adu and Divyanshu Kumar Singh and Edward Edem Nartey},
journal={International Journal of Data Science and Analytics},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9175702/v1}
}
SinoXiv