Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Dec 7, 2020
Date Accepted: Mar 8, 2021
Date Submitted to PubMed: Apr 15, 2021
Machine Learning Applied to Spanish Clinical Laboratory Data for COVID-19 Outcome Prediction: Model Development and Validation
ABSTRACT
Background:
The pandemic caused by the SARS-Cov2 virus will probably stand as the greatest health catastrophe of the modern era. The Spanish healthcare system has been exposed to uncontrollable numbers of patients in a short period of time, causing system collapse. Given that diagnosis is not immediate and there is no effective treatment, other tools have had to be developed to identify patients at risk of severe disease complications, and thus optimize material and human resources in health care. There are no tools to establish which patients have a worse prognosis than others.
Objective:
In this study, we aimed to process a sample of electronic health records of COVID-19 patients in order to develop a machine learning model to predict the severity of infection and mortality through clinical laboratory parameters. Early patient classification can help optimize material and human resources, and analysis of the most important features of the model could provide insights into the disease.
Methods:
After an initial performance evaluation based on a comparison with several other well-known methods, the extreme gradient boosting (XGBoost) algorithm was chosen as the predictive method for this study. In addition, SHAP (SHapley Additive exPlanations) was used to analyze the importance of the features of the resulting model.
Results:
After data preprocessing, 1823 confirmed COVID-19 patients and 32 predictor features were selected. On bootstrap validation, the XGBoost classifier yielded a value of 0.97 (95% CI 0.96-0.98) for the area under the receiver operator characteristic curve, 0.86 (95% CI 0.80-0.91) for the area under the precision-recall curve, 0.94 (95% CI 0.92-0.95) for accuracy, 0.77 (95% CI 0.72-0.83) for F-score, 0.93 (95% CI 0.89-0.98) for sensitivity, and 0.91 (95% CI 0.86-0.96) for specificity. The four most relevant features for model prediction were LDH, C-reactive protein, neutrophils, and urea.
Conclusions:
The predictive model obtained in this work achieved excellent results in the discrimination of COVID-19 dead patients, by mainly employing laboratory parameter values. The analysis of the resulting model identified a set of features with the most significant impact on the prediction, and so relating them to a higher risk of mortality.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.