Accepted for/Published in: JMIR Formative Research
Date Submitted: Nov 22, 2021
Date Accepted: Apr 27, 2022
Date Submitted to PubMed: Aug 24, 2022
‘Exploring socioeconomic status as a global determinant of COVID-19 prevalence, using statistical, exploratory data analytic, and supervised machine learning techniques.’
ABSTRACT
Background:
The COVID-19 pandemic represents the most unprecedented global challenge in recent times. As the global community attempts to manage the pandemic long-term, it is pivotal to understand what factors drive prevalence rates, and to predict the future trajectory of the virus.
Objective:
The aim of this study was to investigate whether socioeconomic indicators support in predicting year-on-year COVID-19 prevalence rates in a cross-sectional sample of 182 countries. Using a number of supervised machine learning techniques, results were evaluated and compared to determine the most accurate predictive algorithm.
Methods:
This research applied three supervised regression techniques: linear regression, random forest, and AdaBoost. Results were evaluated using k-fold cross validation and subsequently compared to analyse algorithmic suitability. The analysis involved two models. Firstly, the algorithms were trained to predict 2021 COVID-19 prevalence using only 2020 infection data. Following this, socioeconomic indicators were added as features and the algorithms were trained again. The Human Development Index metrics of life expectancy, mean years of schooling, expected years of schooling, and Gross National Income were used to approximate socioeconomic status.
Results:
Using 2020 infection prevalence rates as a lone predictor to predict 2021 prevalence rates, the average predictive accuracy of the algorithms was low (R2=0.562). When the socioeconomic indicators were added alongside 2020 prevalence rates as features, average predictive performance improved considerably (R2=0.724) and all error statistics decreased. This suggested that adding socioeconomic indicators alongside 2020 infection data optimised prediction of COVID-19 prevalence to a considerable degree. Linear regression was the strongest learner with R2=0.713 on the first model and R2=0.762 on the second model, followed by random forest (0.533 and 0.733) and AdaBoost (0.441 and 0.676).
Conclusions:
Understanding the impact of socioeconomic status at national level will assist with future pandemic management. This paper puts forward new considerations about the application of machine learning techniques to understand and combat the COVID-19 pandemic.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.