Abstract
South Africa was the most affected country in Africa by the coronavirus disease 2019 (COVID-19) pandemic, where over 4 million confirmed cases of COVID-19 and over 102,000 deaths have been recorded since 2019. Aside from clinical methods, artificial intelligence (AI)-based solutions such as machine learning (ML) models have been employed in treating COVID-19 cases. However, limited application of AI for COVID-19 in Africa has been reported in the literature. This study aimed to investigate the performance and interpretability of several ML algorithms, including deep multilayer perceptron (Deep MLP), support vector machine (SVM) and Extreme gradient boosting trees (XGBoost) for predicting COVID-19 mortality risk with an emphasis on the effect of cross-validation (CV) and principal component analysis (PCA) on the results. For this purpose, a dataset with 154 features from 490 COVID-19 patients admitted into the intensive care unit (ICU) of Tygerberg Hospital in Cape Town, South Africa, during the first wave of COVID-19 in 2020 was retrospectively analysed. Our results show that Deep MLP had the best overall performance (F1 = 0.92; area under the curve (AUC) = 0.94) when CV and the synthetic minority oversampling technique (SMOTE) were applied without PCA. By using the Shapley Additive exPlanations (SHAP) model to interpret the mortality risk predictions, we identified the Length of stay (LOS) in the hospital, LOS in the ICU, Time to ICU from admission, days discharged alive or death, D-dimer (blood clotting factor), and blood pH as the six most critical variables for mortality risk prediction. Also, Age at admission, Pf ratio (PaO2/FiO2 ratio), troponin T (TropT), ferritin, ventilation, C-reactive protein (CRP), and symptoms of acute respiratory distress syndrome (ARDS) were associated with the severity and fatality of COVID-19 cases. The study reveals how ML could assist medical practitioners in making informed decisions on handling critically ill COVID-19 patients with comorbidities. It also offers insight into the combined effect of CV, PCA, and SMOTE on the performance of ML models for COVID-19 mortality risk prediction, which has been little explored.
Similar content being viewed by others
Introduction
The effect of the coronavirus disease 2019 (COVID-19) pandemic on healthcare systems all over the world has been devastating1. As a result, various clinical intervention methods have been employed in the detection, diagnosis, and prognosis of COVID-19 cases, including clinical/laboratory methods and medical image-based diagnosis2,3. Lately, the use of artificial intelligence (AI) methods for tackling the challenges of COVID-19 has received significant attention. Some of the efforts so far reported include the application of machine learning (ML) as a subbranch of AI to detect COVID-19 infections2,4,5, the prognosis of COVID-19 progression6,7,8,9,10, and the determination of COVID-19 treatment outcomes11,12,13,14. However, limited instances of applying AI to the prognosis of COVID-19 cases in Africa have been reported in the literature.
South Africa recorded the highest number of COVID-19 cases in Africa (http://www.worldometers.info/coronavirus). A large percentage of the population also has pre-existing comorbidities such as human immunodeficiency virus (HIV), tuberculosis (TB), and diabetes. comorbidities in patients with COVID-19 complicate prevention strategies and are more difficult to manage
15. AI-enabled decision-making for treating COVID-19 can enhance healthcare quality, particularly in settings with a paucity of medical expertise16. However, this is not common in many African countries. Healthcare practitioners mostly rely on observations from clinical examinations of patients and their own experiences when making clinical decisions.
Recently, many instances of prediction and prognosis of COVID-19 using ML have been reported in the literature. ML algorithms have been shown to possess the ability to predict COVID-19 outcomes accurately. Specific areas of application of ML include the prediction of future situation of COVID-19 regarding emergence of new cases (infection), deaths within a country or across regions4,5,13,17,18; detection of COVID-19 infection, including prediction of COVID-19 outcome (death/survival) of critically ill patients12,19,20,21,22; the use of X-ray images2,23,24,25; and prediction of mortality risk and determination of the prognostic value of specific clinical variables6,7,8,9,10,11,14. Accurate COVID-19 mortality prediction is essential because it will enable health management systems to allocate adequate and appropriate healthcare resources to the most critical COVID-19 cases. However, most of the previous studies on COVID-19 mortality prediction adopted an approach that involves feature selection, model training, and comparison of the performance of different ML algorithms. Studies that investigate the effect of the selective application of treatments, such as cross-validation (CV), principal component analysis (PCA) based dimensionality reduction, and the synthetic minority oversampling technique (SMOTE), or their combinations, on the performance of ML models regarding COVID-19 mortality prediction are rare. Thus, an understanding of the effect of these treatments on the performance of ML models during predictive modelling of COVID-19 mortality is still lacking. This knowledge gap requires to investigate the effect of PCA, CV and SMOTE on the performance of ML models when they are used for COVID-19 mortality prediction.
CV assesses how much a trained ML model can generalise on an independent dataset26. It is a method designed to improve a predictive model's generalizability. PCA is a dimensionality reduction technique that enables large datasets to be transformed into one with reduced dimensions (fewer variables) that is easier to explore and can be processed more efficiently and faster by ML algorithms26. Generally, applying PCA can lead to accuracy trade-offs, but it could still be worthwhile in many cases. Hence, it is an option for ML researchers when dealing with large datasets. Also, we went further in this study to interpret the mortality risk predictions using the shapley additive exPlanations (SHAP) model to determine the most critical variables with prognostic value regarding COVID-19 mortality. This is to provide a basis for improved decision-making by medical experts.
Summarily, this study has two objectives. The first is to perform predictive modelling of the mortality risk of COVID-19 patients, which could guide decision-making on their treatment by experimenting with some specific treatments (CV, PCA, and SMOTE). The second is identifying the most critical variables determining mortality risk in critically ill COVID-19 patients.
The rest of this paper is organised as follows. Section "Methodology" presents the methodology of the study. Section "Results" provides the experimental results obtained from the three selected ML models when different options of CV and PCA were applied. Section "Discussion" discusses the results, while the paper concludes in Section "Conclusions" with a summary and plan for future work.
Methodology
Study design and setting
The dataset used in this study pertains to a cohort of COVID-19 patients admitted into the ICU of Tygerberg Hospital in Cape Town, South Africa, during the first wave of COVID-19 from March to November 202027. Figure 1 shows the process workflow of the experimentation that we did. Firstly, data was collected from the hospital. The dataset was then split into training and testing sets. The training set was used to train the Deep MLP, XGBoost and a SVM. Later, we evaluated the performance of the ML models on the test set using standard classification metrics. We benchmarked the performance of the models using the F1-score, accuracy, recall, precision and AUC score. The Classification Report method of Scikit Learn framework was used to generate the F1-score, accuracy, recall, and precision of the ML models. The report also contains the Macro Average (the average score across all classes for precision, recall, and F1) and the Weighted Average (the mean value of a metric per class (e.g., F1, recall, precision) while considering the support of each class).
After this, we applied the SHAP model, a unified framework for interpreting predictions by ML models to investigate the model interpretability28. SHAP is able to determine the level of the global importance of each feature to the prediction generated by a ML model. SHAP is a mathematical method based on game theory that can be used to explain any machine learning model's predictions by calculating each feature's contribution to the prediction28. We then compared the important features extracted through SHAP for each ML model to derive our conclusion.
Predictor and outcome variables
The dataset consists of several aspects, such as:
-
Socio-demographic and lifestyle characteristics (e.g. age, gender, ethnicity, residential area, occupation, workplace, smoking status, alcohol use, drug use etc.).
-
Exposure history and symptoms (previous contact with infected persons or places; symptom types, severity score, etc.). This may not be available for critically ill patients.
-
Comorbidities (such as Hypertension, chronic obstructive pulmonary disease (COPD), diabetes, heart diseases, TB, HIV, as well as the discharged alive/death outcome)
-
Laboratory data such as full blood count (FBC), Ferritin, Procalcitonin (PCT), MicroRNAs, N-terminal pro–B-type natriuretic peptide (NT-proBNP) and inflammatory cytokine)
-
Medication data (e.g., Steroids, antibiotics, antiviral, antifungal, medication for glycaemic control, antihypertensive etc.)
-
Clinical monitoring – status (i.e., response to treatment) of the patient at the time of clinical examination, mechanical ventilation (e.g. Pulmonary Oxygen, Carbon Dioxide (CO2), Potassium, Calcium etc.)
The dataset consists of 154 features and 490 rows (records). There is one target variable–Discharged alive or Death. A snapshot of the dataset is presented in Tables 1 and 2. The dataset was imbalanced in terms of the target outcome (death or discharge) in the ratio of 60:35.
Data pre-processing
The data pre-processing techniques were applied to clean up the data and transform the data to numerical form (the numeric data points were scaled [0–1] by using the min–max normalisation function, while categorical variables were encoded using one-hot encoding). The dataset contained some missing values; therefore, the multivariate imputation technique was applied to fix missing values. Features (variables) that had 30% or more missing values and those considered redundant due to duplication were removed. A total of 55 variables were removed from the dataset that was used for analytics. These consist of 44 variables with more than 30% missing values and 11 variables regarded as redundant. Thus, after pre-processing, we were left with 99 variables and 490 rows used for our experimentation. The numeric variables in the dataset are shown in Table 1, while the categorical variables are presented in Table 2.
Principal component analysis (PCA)
PCA is a statistical process that converts the observations of correlated variables into a set of linearly uncorrelated variables with the help of orthogonal transformation, leading to a smaller version of the dataset with fewer variables considered significant(dimensionality reduction). These newly transformed (significant) features are called the Principal Components. Although reducing the number of variables in a dataset could impact the accuracy of results, the trade-off in easier visualisation of results and less computational complexity makes the training of machine learning models faster29. Using PCA is particularly useful when dealing with a dataset with high dimensionality where the number of columns is more than the number of rows of data. The process of PCA involves:
-
I.
Ensuring that the ranges of continuous initial variables are standardised
-
II.
Identifying correlations by computing the covariance matrix
-
III.
Identify the principal components by determining the eigenvectors and eigenvalues of the covariance matrix
-
IV.
Creating a feature vector to decide which principal components to keep, and
-
V.
Recasting the data along the principal components’ axes.
Cross-validation
Cross-validation enables the training of ML models by using different subsets of the training dataset and evaluating them using a unique portion of that training dataset per time to enhance the generalisation of a trained ML model. Some of the more known techniques of cross-validation include30:
-
I.
Leave One Out Cross-Validation (LOOCV) This entails leaving out one data point from the dataset while the rest is used to train the ML model for each learning set. The process is repeated for each data point so that k samples will have k training sets and k test sets. The process can help to minimise bias since all data points are used.
-
II.
K-fold cross-validation This entails dividing the training set into k-folds (k equal groups), where each group is called a fold. At any given time, k-1 folds are used to train the model while the reserved fold is used as the test set. The process is repeated over k iterations and ensures that a unique fold is used as the testing set for variant compositions of k-1 folds that are used for training. For example, assume that k = 10 on the first iteration, fold-2 to fold-10 are used for training, while fold-1 is used as the test set. On the second iteration, fold − 1 + fold − 3 + fold − 4 + ‚Ķ fold-10 is used for training, while fold-2 is used as the test set. The same procedure of k-1 folds for training and a unique (reserved) fold for testing is repeated for all k (10) iterations.
-
III.
Stratified K-Fold Cross-validation This method is a variation of the k-fold cross-validation designed for cases of imbalanced datasets. It involves selecting the value k and splitting the dataset into k-folds while ensuring that each fold contains approximately the same proportional representation of target classes as the complete data set. Then, selecting k-1 folds as the training set and the remaining fold as the test set;, and training the model with a particular k-1 folds training set per time. Thereafter, validating the model with the unique test set, and continuing the process for k iterations. In the end, obtaining the final average score of the model's performance.
Models training
This section presents theoretical perspectives on the three MLP Deep MLP, XGBoost, and SVM used in this study.
To overcome above-mentioned challenges, three ML algorithms, including deep multilayer perceptron (Deep MLP), extreme gradient boosting trees (XGBoost), and support vector machines (SVM) were selected for our predictive analytics. Although no particular ML algorithm works best for every problem or dataset, the selected methods rank among the best classification algorithms when dealing with tabular datasets based on evidence from the literature and ML competitions like Kaggle31,32. For example, the multiple hidden layers in Deep ML allow it to model highly non-linear relationships in data. This is crucial for tasks where the relationship between inputs and outputs is complex and not easily captured by linear models32,33. SVM has good generalisation performance and error bounds in real-life application problems, while XGBoost supports k-fold CV during model training, helping to assess the model's performance and reduce the risk of overfitting32. The selected algorithms more detailed as follow:
XGBoost
Boosting is an ensemble technique that combines a set of weak learners and aggregates their predictions to deliver improved prediction accuracy. Generally, gradient boosting represents an ensemble of machine learning models composed of decision trees to solve classification and prediction tasks. The Extreme Gradient Boosted Trees (XGBoost) algorithm works by growing a set of classification and regression trees (CART) sequentially to ensure the misclassification rate is reduced during subsequent iterations32,33. Compared to general gradient boosting, XGBoost can prevent model over-fitting by incorporating two other features apart from the regularised loss function. Firstly, it uses the learning rate hyper-parameter to scale down the weights of each new tree to ensure that a single tree does not dominate leading to the final score. Also, it uses column sampling by using only column-wise data from the training data set to build each tree. Combining these three attributes makes XGBoost outperform traditional gradient boosting methods.
Deep MLP
The MLP is a model of an ANN. ANN emulates the human biological neural system to solve computational problems using adaptive learning. The nodes of an ANN called artificial neurons correspond to the human neuron. The MLP can be used to perform functional approximation to solve classification and regression problems.
Different variants of the backpropagation algorithm can be used to train the MLP network by exposing it to samples of input–output pairs. The difference between the network's output and the target output is fed back into the network to update the weights that connect the hidden-output layers and the input-hidden layers. The knowledge of the network is stored in the set of weights (w) that connect the neurons34,35. The computation at each neuron is the linear combination of all inputs and weights that feed into it. A Deep MLP is a feedforward ANN with multiple (more than one) hidden layers fully connected in a dense architecture. Various activation functions such as rectifier (ReLU), sigmoid, and hyperbolic activation functions can be used in the Deep MLP.
SVM
The SVM is a supervised learning model that can perform classification and regression tasks and is capable of handling both linear and non-linear problems. It establishes a margin to determine data points that fall into specific classes. A typical SVM operation entails finding the best hyperplane (separating line) that separates the classes in a dataset into distinct classes. The hyperplanes represent decision boundaries that determine the class to which specific data points are classified. The support vectors are the data points from different classes closest to a hyperplane. The distance between the support vectors and the hyperplane is called the margin. The SVM algorithm aims to find the optimal hyperplane (decision boundary) with the maximum (longest) margin to enable a maximum number of points to be correctly classified in the training set. According to34, the SVM is a function that correctly classifies two classes with a maximum margin. SVM works with two kernel parameters: the box constraint (c) and the curvature weight (Gamma). The c is a regularisation term set before training the model and used to control the penalty for misclassification. It is the trade-off between having a smooth decision boundary and being able to classify all data points in the training set correctly. The Gamma is a hyperparameter that determines the level of influence of a single training point.
Hyperparameters for training the ML models
The three ML models–DMLP, XGBoost, and SVM- were trained using optimal parameters generated by the grid search function in Scikit Learn. The optimal parameters that were automatically selected for training the models are shown in Table 3.
At first, we considered four options in our experimentation with the three ML models (Deep MLP, SVM, and XGBoost):
-
i.
PCA + No cross-validation
-
ii.
PCA + cross-validation
-
iii.
No-PCA + No cross-validation
-
iv.
No-PCA + cross-validation
After determining the two options that produced the best performance, we then applied the synthetic minority oversampling technique (SMOTE) to these two options to investigate the model performance for a balanced dataset. Thus, for the three ML models, we considered the options:
-
xxii.
No-PCA + No cross-validation + SMOTE
-
xxiii.
No-PCA + cross-validation + SMOTE
Metrics for evaluation of the machine learning models
After training machine learning models, it is always essential to measure the performance of the trained model on the test data set. In this study, we used standard classification metrics such as precision, recall, F1-score, and the Area Under the Receiver Operating curve (AUC) score for each ML model selected for the study. These metrics are explained as follows:
-
i.
Precision This is the measure of positive predictions made that are truly positive. It is the ratio of the number of true positives (TP) by the sum of true positives (TP) and false positives (FP). \(Precision\left(P\right)=\frac{TP}{TP}+FP\)
-
ii.
Recall This measures how well the ML model can predict the true positive instances. It is the total number of true positives (TP) divided by the sum of TP and false negatives (FN). It is a measure of how well the most important occurrence is predicted. \(Recall\left(R\right)=\frac{TP}{TP}+FN\)
-
iii.
Accuracy the percentage of prediction that is correct. It is measured by dividing the number of correct predictions by the total number of predictions. It is not a good measure to assess a ML model when the dataset is imbalanced4.
$$Accuracy=Noofcorrectlyclassified\frac{samples}{total}numberofsamples$$ -
iv.
F1-score The harmonic mean of precision and recall gives a more balanced description of model performance. It is a value between 0 and 1. The F1 score is a suitable metric for assessing model performance for an imbalanced dataset. \(F-Measure=2\frac{PR}{P}+R\). According to36, FI score is interpreted as follows: F1 > 0.9 (very good); 0.8–0.9 (good); 0.5–0.8 (okay); < 0.5 (not good).
-
v.
AUC score The Area Under the Curve (AUC) measures how well a classifier can distinguish between classes and is used as a summary of the receiver operating curve (ROC). The AUC score is rated as follows: excellent (0.9–1), good (0.8–0.9), fair (0.7–0.8), poor (0.6–0.7), and failed (0.5–0.6).
Ethics approval and consent to participate
All methods used in this study followed relevant guidelines and regulations. Ethical approval and waiver of consent were obtained from the Health Research Ethics Committee of Stellenbosch University (Approval number: N20/04/002_COVID-19). The National Health Laboratory Service (NHLS) of South Africa also approved the use of their data for the study (Reference: SRSR3982441). The data analytics experimentation was approved by the Research Ethics Committee of the Faculty of Informatics and Design of the Cape Peninsula University of Technology (30/Daramola/2021).
Results
The results obtained from evaluating the ML models when the six options were implemented are presented next.
PCA without cross-validation
Table 4 shows the performance of the three models measured using precision, recall, F1-score, and accuracy when trained with data obtained after applying PCA on the original data without k-fold cross-validation. Figure 2 shows the AUC scores of the three models when trained under the same condition.
PCA with cross-validation
Table 5 shows the performance of the three ML models after PCA plus k-fold cross-validation was applied to the training data, while Fig. 3 shows the AUC scores.
Non-PCA without cross-validation
Table 6 shows the performance of the three models when trained with the original data without using PCA or k-fold cross-validation, while Fig. 4 shows the AUC scores of the three models for the same scenario.
Non-PCA with cross-validation
Table 7 shows the performance of the three models measured using precision, recall, F1-score, and accuracy when trained without applying PCA, but k-fold cross-validation was used. Figure 5 also shows the corresponding AUC scores of the three models.
Effect of applying the synthetic minority oversampling technique (SMOTE)
The results obtained from our experiments on the initial four options showed that the ML models generally performed better when PCA was not applied. Since the dataset was imbalanced in terms of the target value in the ratio of 60:35, we applied SMOTE to increase instances of the minority class. This resulted in a training dataset of 488 records with an even distribution (50:50) of the target variable. We then compared instances of combining cross-validation plus SMOTE without PCA and when cross-validation and SMOTE without PCA were used. Tables 8 and 9 show the results obtained in these instances, while the corresponding AUC scores are shown in Figs. 6 and 7.
The summary overview of the different experiments is shown in Table 10.
The results in Table 10 show that in terms of F1 score and AUC score, the Deep MLP had a generally good performance when cross-validation (CV) without PCA was applied but had the best overall performance when CV and SMOTE were applied without PCA (F1 = 0.92; AUC = 0.94). The SVM (F1 = 0.83; AUC = 0.82) had the second-best performance when CV and SMOTE were applied without PCA. Thus, we found that, generally, the performance of both SVM and Deep MLP can be enhanced through CV, and they also perform very well without PCA. The XGBoost model (F1 = 081; AUC = 0.79) had the best performance of the three models when we used the raw data without applying any of CV, PCA or SMOTE. XGBoost also performed well with SMOTE, but it is not affected by CV and performed worse in all cases when PCA was applied.
Analysis of feature importance generated by ML models based on SHAP values
Based on the result of the experiments, we extracted instances where each of the three ML models had their best performance and examined the features that had the most influence on their prediction (feature importance) based on SHAP values. These are shown in Figs. 8, 9 and 10.
A comparison of the best-performing instances of the trained models is highlighted in Table 11, while the frequency of occurrence of specific features across the three ML models is shown in Table 12.
From the results in Table 12, the six most critical variables for predicting the mortality or survival of COVID-19 patients were: Length of stay in the hospital, Duration in ICU, Time to ICU from Admission, Days discharged alive or death, D-dimer, and Blood pH. In addition, we also found that other features such as Age at admission, PaO2/FiO2 ratio (Pf_ratio), TropT, Ferritin, ventilation, C-reactive protein (CRP), and Symptom of Acute respiratory distress syndrome (ARDS) (Symp_Acute_Res) are also critical determinants of outcome in terms of survival or death of a patient.
Discussion
We will discuss the results of our experiments from two perspectives that align with the goals of our study.
Lessons from applying PCA and cross-validation
We noticed that applying PCA in combination with cross-validation has not been commonly used for predicting the mortality risk of COVID-19. Most previous approaches have focused on applying feature selection techniques, such as the Pearson correlation37, Chi-square38, and recursive feature elimination22, to identify the relevant predictors to use for prediction without considering applying PCA. Table 13 shows an overview of recent papers on mortality prediction of COVID-19. Our observation is that although our accuracy, F1-score, and AUC results compare favourably with those of previous studies, the instances where the effect of application of PCA + cross-validation + SMOTE are not yet prevalent. However, the results of our research and those of others, as shown in Table 13, emphasise the potency of machine learning as a viable tool to aid decision-making on handling of COVID-19 cases and other diseases whenever a quality dataset is available.
Based on the results of our study, we learnt the following about the three selected ML models:
-
i.
The performance of deep MLP and SVM
Cross-validation can enhance the model performance of Deep MLP and SVM if the proper training parameters are selected. The Deep MLP had a generally good performance in terms of F1-score and AUC when cross-validation was applied (F1 = 0.92 (very good); AUC = 0.94 (excellent)). This observation was consistent irrespective of whether SMOTE was applied or not (see Table 10). The SVM also showed the same pattern because it had a good F1-score (0.83) and a good AUC score (0.82) with or without SMOTE when cross-validation was applied. These observations align with the postulations in the literature that model training can be generally enhanced through cross-validation to prevent overfitting and improve model generalisation on unseen data26,30. Thus, cross-validation is recommended when using Deep MLP and SVM.
-
ii.
The effect of PCA on the performance of ML models
We found a reduction in model performance when PCA was applied to the three ML models (Deep MLP, SVM, XGBoost). However, the Deep MLP performed well when PCA plus cross-validation was applied. Considering that we did not have a large data set, the drop in model performance when PCA was applied compared to when it was not was quite significant (see Table 9). This observation suggests that the choice of whether to apply PCA on a tabular dataset should be weighed carefully depending on the choice of ML model for predictive analytics. Based on our observation, PCA plus cross-validation could be a good choice when using deep learning models. Our findings in this study also support this notion. However, for the other ML models, there may be better results in accuracy. Our perspective on this, which is reinforced by the result of our experimentation, aligns with an established school of thought that the application of PCA can lead to loss of accuracy but has benefits in terms of easier model visualisation and faster training, which could be a desirable trade-off for model accuracy in certain types of applications26,30. For the medical domain, where model prediction accuracy is paramount, applying PCA on a tabular medical dataset may not be the first option. Instead, feature selection based on correlations between the independent variable and the target variable to work with a reduced number of variables, particularly for large data sets, is a better option to achieve better model accuracy. In all cases, the right choice that suits a specific dataset in terms of either applying PCA, feature selection through correlation, or using the whole feature set should be investigated through experimentation.
-
iii.
Performance of XGBoost
XGBoost is one of the most powerful ML algorithms, particularly when dealing with tabular (non-image) data. It is acclaimed for being successful in several Kaggle competitions. Based on our experimentation, we found that XGBoost (F1 = 0.81 (good); AUC = 0.79 (fair)) performed best when used with the raw dataset without any PCA, cross-validation or SMOTE when compared with the other two ML models (Deep MLP, SVM). This observation aligns with the acknowledged characteristics of XGBoost as profiled in the literature38,39. According to39, XGBoost has inbuilt regularisation, which avoids data overfitting problems. It also has an internal function for cross-validation and is well-equipped to deal with missing values. These features make XGBoost capable of dealing with raw data without needing external treatments that other ML algorithms require. Our experiment shows that an attempt to apply cross-validation externally to XGBoost leads to lower performance. Also, because it is sparse-aware, SMOTE yielded no significant performance improvement, while PCA caused a reduced performance.
Lessons from COVID-19 mortality prediction
From our study, we found that the six most critical variables for the prediction of mortality or survival of the COVID-19 patients admitted into the ICU (see Table 12) are (i) length of stay in the hospital; (ii) the Length of time spent in the ICU (Duration in ICU); (iii) Time to ICU from Admission; (iv) the number of days spent in the hospital before discharge or death; (v) the blood clotting factor (D-dimer), and (vi) the blood pH of the patient.
The first four variables collectively indicate the severity of the patient's illness level at the time of admission and its effect on the outcome. In comparison, the last two variables indicate the comorbidity condition of a patient. D-dimer, which is a measure of the blood clotting factor of a patient, could be indicative of the existence or absence of a comorbidity. Studies have shown that D-dimer has prognostic value in detecting severity and fatality associated with COVID-19 cases40,41,42,43. According to40, D-dimer can predict severe and fatal cases of COVID-19 with moderate accuracy. In44, it was reported that elevated D-dimers (> 2590 ng mL−1) and the absence of anticoagulant therapy could predict pulmonary embolism (PE) in hospitalised COVID-19 patients with severe infections.
Similarly45, established a relationship between blood pH and fatal outcomes in critically ill COVID-19 patients at an intensive care unit. This also confirms the prognostic value of blood pH for COVID-19, which our ML models also detected. In addition45, found that Blood pH value, mean arterial pressure, base excess, troponin, and procalcitonin were highly significant prognostic factors of in-hospital mortality.
Apart from D-dimers and Blood pH, other factors include age at admission, PaO2/FiO2 ratio (Pf_Ratio), TropT, Ferritin, ventilation, CRP, and Symptom of Acute respiratory distress syndrome (ARDS) also have a strong association with the severity and fatality of COVID-19 cases. This observation is supported by studies that have shown that the most frequently reported biological anomalies in COVID-19 patients include elevations of inflammatory markers such as C-reactive protein, D-dimers, Ferritin and interleukin-645,46,47. In addition, these findings also significantly complement the results of three studies that were previously conducted on the dataset of the same cohort in South Africa, as reported in48,49,50. In48, the authors found that raised neutrophil count and neutrophil/lymphocyte ratio (NLR) were associated with a worse outcome. At the same time, age, female sex, and D-dimer levels were independent risk factors associated with mortality risk. The results of49 found seventeen categorical variables were associated with mortality among patients admitted to the ICU. These are: age, intubation, HIV positive status, arterial blood gas results, lactate, PF ratio, urea, neutrophil count, C-reactive protein (CRP), Procalcitonin (PCT), D-Dimer, NT-proBNP, troponin T, HbA1c, magnesium, aspartate transaminase (AST), and alkaline phosphatase (ALP). Lastly, in50, the result showed that age (above 48 years), requiring intubation, HIV status, procalcitonin (PCT), Troponin T, Aspartate Aminotransferase (AST), and a low pH on admission were significant predictors of mortality.
Incidences of abnormal indicators of D-dimer, Blood pH, C-reactive protein (CRP), Ferritin, and TropT are usually associated with patients with comorbidities such as Diabetes, obesity, coronary artery disease, hypertension, atrial fibrillation, and chronic heart failure (CHF), cancer, previous angina, peripheral arterial disease, previous myocardial infarction (MI), chronic kidney disease (CKD), cerebrovascular disease, and chronic obstructive pulmonary disease (COPD), HIV and many more51,52. Thus, our findings have identified some prognostic factors that strongly relate to comorbidities and the prediction of severity and outcomes of critically ill COVID-19 patients.
Study limitations
This study used a dataset of critically ill patients under intensive care in a public South African hospital. A limitation of this study was the limited size of the dataset. It was occasioned by the fact that South Africa had the highest number of intensive care admissions during the first wave of COVID-19 and, to a lesser extent, during the second wave, when vaccines were not available, and there was generally limited knowledge of COVID-19 preventive care strategies. After the first and second waves of COVID-19, ICU admissions in most hospitals dwindled nationwide (South Africa), which resulted in reduced access to data on intensive care patients. For this reason, the findings of this study have limited generalizability in that the performance of these ML models may vary when applied to different datasets or healthcare settings. Still, we consider our study’s findings profound and valuable for scholarship because they align with the findings from several clinical studies on the prognosis of COVID-19 mortality40,41,42,43. Also, very few case studies pertaining to COVID-19 patients under intensive care from the African context have been reported in the literature.
Conclusions
In this study, we investigated mortality risk prediction using a dataset of COVID-19 patients admitted into a South African hospital's ICU during the first wave of COVID-19. We found that the six most critical variables associated with predicting the severity of COVID-19 in the patients were Length of stay in the hospital, Duration in ICU, Time to ICU from Admission, Days discharged alive or dead, D-dimer, and Blood pH. In addition, we also found that other features such as Age at admission, PaO2/FiO2 (the ratio of arterial oxygen partial pressure (PaO2 in mmHg) to fractional inspired oxygen (FiO2)), TropT, Ferritin, ventilation, C-reactive protein (CRP), and Symptom of Acute respiratory distress syndrome (ARDS). To do this, we compared the performance of three machine learning models, namely Extreme Gradient Boosting Trees (XGBoost), Support Vector Machines (SVM) and Deep Multi-Layer Perceptron (Deep MLP) when specific treatments: PCA, cross-validation and SMOTE were applied during the model training.
This study makes a theoretical contribution by providing insight into the effect of the combination of CV, PCA, and SMOTE on the performance and interpretability of MLP, XGBoost, and SVM, which has not been previously explored to the best of our knowledge. Also, the study demonstrates how predictive modelling can be applied to identify variables with prognostic value for predicting the severity and outcome of critically ill COVID-19 patients from a study setting in South Africa, which is not prevalent in the literature.
This study has practical implications. Firstly, it reveals how machine learning could be used to identify the critical variables with prognostic value for predicting the severity and mortality of critically ill COVID-19 patients with comorbidities. It can aid healthcare service delivery in Africa, where medical decision-support mechanisms that can lead to increased productivity and efficiency of medical practice are not common. Secondly, the study offers an intellectual guide to help machine learning practitioners make informed decisions during popular ML models like Deep MLP, SVM, and XGBoost. It provides practical insights on how cross-validation, PCA, and SMOTE could be applied during model training and the likely expected outcomes.
Since AI relies on complex mathematics and statistics, it is essential to ensure that the model produces accurate and clinically relevant results. Decision-tracing practices should be developed based on the computations behind the model for user implementation. In future work, we will look into the creation of explicit inference rules and the development of a decision support tool (web app) that can guide medical practitioners when dealing with critical cases of COVID-19. Also, we will explore the adaptation of such an approach to other infectious diseases.
Data availability
The code implementation can be shared on request. Peter Nyasulu (nyasulu@sun.ac.za) can be contacted for the data.
References
Zoabi, Y., Deri-Rozov, S. & Shomron, N. Machine learning-based prediction of COVID-19 diagnosis based on symptoms. NPJ Digit. Med. 4(1), 3 (2021).
Alazab, M., Awajan, A., Mesleh, A., Abraham, A., Jatana, V., & Alhyari, S. 2020. COVID-19 prediction and detection using deep learning. http://www.softcomputing.net/ijcisim_1.pdf. Accessed 13 Jan 2023.
Allwood, B. W. et al. Predicting COVID-19 outcomes from clinical and laboratory parameters in an intensive care facility during the second wave of the pandemic in South Africa. IJID Reg. 3, 242–247 (2022).
Muhammad, L. J. et al. Supervised machine learning models for prediction of COVID-19 infection using epidemiology dataset. SN Comput. Sci. 2(1), 11 (2021).
Alakus, T. B. & Turkoglu, I. Comparison of deep learning approaches to predict COVID-19 infection. Chaos Solitons Fractals 140(110120), 110120 (2020).
Chowdhury, M. E. H. et al. An early warning tool for predicting mortality risk of COVID-19 patients using machine learning. Cognit. Comput. https://doi.org/10.1007/s12559-020-09812-7 (2021).
Domínguez-Olmedo, J. L., Gragera-Martínez, Á., Mata, J. & Pachón Álvarez, V. Machine learning applied to clinical laboratory data in Spain for COVID-19 outcome prediction: Model development and validation. J. Med. Internet Res. 23(4), e26211 (2021).
Aljameel, S. S., Khan, I. U., Aslam, N., Aljabri, M. & Alsulmi, E. S. Machine learning-based model to predict the disease severity and outcome in COVID-19 patients. Sci. Program. 2021, 1–10 (2021).
Sun, C., Hong, S., Song, M., Li, H. & Wang, Z. Predicting COVID-19 disease progression and patient outcomes based on temporal deep learning. BMC Med. Inform. Decis. Mak. 21(1), 45 (2021).
An, C. et al. Machine learning prediction for mortality of patients diagnosed with COVID-19: A nationwide Korean cohort study. Sci. Rep. 10(1), 18716 (2020).
Pourhomayoun, M. & Shakibi, M. Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making. Smart Health 20(100178), 100178 (2021).
Subudhi, S. et al. Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19. NPJ Digit. Med. 4(1), 87 (2021).
Shahid, F., Zameer, A. & Muneeb, M. Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM. Chaos Solitons Fractals 140(110212), 110212 (2020).
Assaf, D. et al. Utilization of machine-learning models to accurately predict the risk for critical COVID-19. Intern. Emerg. Med. 15(8), 1435–1443 (2020).
Davies, M. -A. HIV and risk of COVID-19 death: A population cohort study from the Western Cape Province, South Africa. medRxiv (2020).
Daramola, O. et al. Towards AI-enabled multimodal diagnostics and management of COVID-19 and comorbidities in resource-limited settings. Informatics (MDPI) 8(4), 63 (2021).
Rustam, F. et al. COVID-19 future forecasting using supervised machine learning models. IEEE Access 8, 101489–101499 (2020).
Sarumi, O. A., Aouedi, O. & Muhammad, L. J. Potential of Deep Learning Algorithms in Mitigating the Spread of COVID-19. In Understanding COVID-19: The Role of Computational Intelligence Vol. 963 (eds Nayak, J. et al.) 225–244 (Springer International Publishing, Cham, 2022).
Nemati, M., Ansary, J. & Nemati, N. Machine-learning approaches in COVID-19 survival analysis and discharge-time likelihood prediction using clinical data. Patterns 1(5), 100074 (2020).
Das, A. K., Mishra, S. & Saraswathy Gopalan, S. Predicting CoVID-19 community mortality risk using machine learning and development of an online prognostic tool. PeerJ 8, e10083 (2020).
Zakariaee, S. S., Naderi, N., Ebrahimi, M. & Kazemi-Arpanahi, H. Comparing machine learning algorithms to predict COVID-19 mortality using a dataset including chest computed tomography severity score data. Sci. Rep. 13(1), 11343 (2023).
Khadem, H., Nemat, H., Eissa, M. R., Elliott, J. & Benaissa, M. COVID-19 mortality risk assessments for individuals with and without diabetes mellitus: Machine learning models integrated with interpretation framework. Comput. Biol. Med. 144, 105361 (2022).
Mahanty, C., Kumar, R., Asteris, P. G. & Gandomi, A. H. COVID-19 patient detection based on fusion of transfer learning and fuzzy ensemble models using CXR images. Appl. Sci. 11(23), 11423 (2021).
Mahanty, C., Kumar, R. & Patro, S. G. K. Internet of medical things-based COVID-19 detection in CT images fused with fuzzy ensemble and transfer learning models. N. Gener. Comput. 40(4), 1125–1141 (2022).
Abd Elaziz, M. et al. Boosting COVID-19 image classification using MobileNetV3 and aquila optimizer algorithm. Entropy 23(11), 1383 (2021).
Brownlee, J. Machine Learning Mastery with Python: Understand Your Data, Create Accurate Models, and Work Projects End-to-End (Machine Learning Mastery, 2016).
Allwood, B. W. et al. Clinical evolution, management and outcomes of patients with COVID-19 admitted at Tygerberg Hospital, Cape Town, South Africa: A research protocol. BMJ Open 10(8), e039455 (2020).
Lundberg, S. & Lee, S. -I. A unified approach to interpreting model predictions. arXiv [cs.AI] (2017).
Douglass, M. J. J. Book review: Hands-on machine learning with scikit-learn, keras, and tensorflow, 2nd edition by Aurélien géron. Phys. Eng. Sci. Med. 43(3), 1135–1136 (2020).
Lyashenko, V. & Jha, A. Cross-Validation in Machine Learning: How to Do It Right (Neptune Blog, 2021).
Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A. & Moore, J. H. Data-Driven Advice for Applying Machine Learning to Bioinformatics Problems. in Biocomputing 2018 (Kohala Coast, 2018).
Chen, T. & Guestrin, C. XGBoost, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA, (2016).
Luckner, M., Topolski, B. & Mazurek, M. Application of XGBoost Algorithm in Fingerprinting Localisation Task. In Computer Information Systems and Industrial Management Vol. 10244 (eds Saeed, K. et al.) 661–671 (Springer, 2017).
Awad, M. & Khanna, R. 2015. Efficient learning machines: Theories, concepts, and applications for engineers and system designers. Available: https://library.oapen.org/bitstream/handle/20.500.12657/28170/1001824.pdf?sequen. Accessed 13 Jan 2023.
Daramola, O., Ajala, I. & Akinyemi, I. An experimental comparison of three machine learning techniques for web cost estimation. Int. J. Softw. Eng. Appl. 10(2), 191–206 (2016).
Allwright, S. What is a good F1 score and how do I interpret it? (2022).
Moulaei, K., Shanbehzadeh, M., Mohammadi-Taghiabad, Z. & Kazemi-Arpanahi, H. Comparing machine learning algorithms for predicting COVID-19 mortality. BMC Med. Inform. Decis. Mak. 22(1), 1–12 (2022).
Reinstein, I. 2017. Xgboost, a top machine learning method on kaggle, explained. Disponível em: https://www.kdnuggets.com/2017/10/xgboost-top-machine-learningmethod-kaggle-explained.html. Accessed 01 Aug 2018
Dhaliwal, S., Nahid, A.-A. & Abbas, R. Effective intrusion detection system using XGBoost. Information 9(7), 149 (2018).
Zhan, H. et al. Diagnostic value of D-dimer in COVID-19: A meta-analysis and meta-regression. Clin. Appl. Thromb. Hemost. 27, 10760296211010976 (2021).
Zhao, R. et al. Associations of D-dimer on admission and clinical features of COVID-19 patients: A systematic review, meta-analysis, and meta-regression. Front. Immunol. 12, 691249 (2021).
Varikasuvu, S. R. et al. D-dimer, disease severity, and deaths (3D-study) in patients with COVID-19: A systematic review and meta-analysis of 100 studies. Sci. Rep. 11(1), 21888 (2021).
Mouhat, B. et al. Elevated D-dimers and lack of anticoagulation predict PE in severe COVID-19 patients. Eur. Respir. J. 56(4), 2001811 (2020).
Sartini, S. et al. Role of SatO2, PaO2/FiO2 ratio and PaO2 to predict adverse outcome in COVID-19: A retrospective, cohort study. Int. J. Environ. Res. Public Health 18(21), 11534 (2021).
Kieninger, M. et al. Lower blood pH as a strong prognostic factor for fatal outcomes in critically ill COVID-19 patients at an intensive care unit: A multivariable analysis. PLoS One 16(9), e0258018 (2021).
Wang, D. et al. Clinical characteristics of 138 hospitalized patients with 2019 novel Coronavirus-infected pneumonia in Wuhan, China. JAMA 323(11), 1061–1069 (2020).
Zhou, F. et al. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: A retrospective cohort study. Lancet 395(10229), 1054–1062 (2020).
Chapanduka, Z.C., Abdullah, I., Allwood, B., Koegelenberg, C.F., Irusen, E., Lalla, U., Zemlin, A.E., Masha, T.E., Erasmus, R.T., Jalavu, T.P. & Ngah, V.D. Haematological predictors of poor outcome among COVID-19 patients admitted to an intensive care unit of a tertiary hospital in South Africa. PloS one 17(11), p.e0275832. (2022).
Nyasulu, P.S. et al. Clinical characteristics associated with mortality of COVID-19 patients admitted to an intensive care unit of a tertiary hospital in South Africa. PloS one 17(12), p.e0279565. (2022)
Chimbunde, E., Sigwadhi, L. N., Tamuzi, J. L., Okango, E. L., Daramola, O., Ngah, V. D., & Nyasulu, P. S. Machine learning algorithms for predicting determinants of COVID-19 mortality in South Africa. Frontiers in Artificial Intelligence 6, 1171256. (2023).
Sundaram, V. et al. Impact of comorbidities on peak troponin levels and mortality in acute myocardial infarction. Heart 106(9), 677–685 (2020).
Brojakowska, A. et al. Comorbidities, sequelae, blood biomarkers and their associated clinical outcomes in the Mount Sinai Health System COVID-19 patients. PLoS One 16(7), e0253660 (2021).
Acknowledgements
The authors acknowledge the support of the management authorities at Cape Peninsula University of Technology (CPUT), and Stellenbosch University (SU) for creating an environment that enabled the execution of this study.
Funding
The research reported in this article was supported by South African Medical Research Council with funds received from National Treasury (under the project titled “Data Analytics for Clinical Guidance on Immune Response to COVID19 Vaccination for Patients with Comorbidities and Prior Infection”).
Author information
Authors and Affiliations
Contributions
O.D., M.J.K and P.N., conceptualized the idea. O.D. was responsible for the research design. O.D, T.K. performed the code implementation, and experiments, and wrote the first draft of the paper. P.N., F.D., M.N collected the data. O.S drafted the literature review. All authors (O.D., T. K., M.J.K., J.V.M., O.S., B.K., T.M., K.S., I.F., F. D, M.N., S.J.R.,P.N.) provided a critical review of the manuscript and approved the final draft for publication. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Daramola, O., Kavu, T.D., Kotze, M.J. et al. Predictive modelling and identification of critical variables of mortality risk in COVID-19 patients. Sci Rep 15, 2184 (2025). https://doi.org/10.1038/s41598-023-46712-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-46712-w