Abstract

Prediction of the death among COVID-19 patients can help healthcare providers manage the patients better. We aimed to develop machine learning models to predict in-hospital death among these patients. We developed different models using different feature sets and datasets developed using the data balancing method. We used demographic and clinical data from a multicenter COVID-19 registry. We extracted 10,657 records for confirmed patients with PCR or CT scans, who were hospitalized at least for 24 hours at the end of March 2021. The death rate was 16.06%. Generally, models with 60 and 40 features performed better. Among the 240 models, the C5 models with 60 and 40 features performed well. The C5 model with 60 features outperformed the rest based on all evaluation metrics; however, in external validation, C5 with 32 features performed better. This model had high accuracy (91.18%), F-score (0.916), Area under the Curve (0.96), sensitivity (94.2%), and specificity (88%). The model suggested in this study uses simple and available data and can be applied to predict death among COVID-19 patients. Furthermore, we concluded that machine learning models may perform differently in different subpopulations in terms of gender and age groups.

1. Introduction

In spite of more than 2 years since the COVID-19 pandemic and performing vaccination in many countries, the disease’s prevalence and mortality have not slowed down, and many countries are still experiencing high peaks [1]. In addition, multiple mutations in the virus have become a new challenge to control the disease, leading to the spread of the disease and increased mortality [24]. Until April 16, 2022, more than 500 million cases of the disease and more than 6 million deaths due to COVID-19 have been reported globally, with more than 7 million cases and 140,000 deaths in Iran [1].

Since the beginning of the COVID-19 pandemic, one of the most critical challenges for the healthcare systems has been to increase the number of patients with severe symptoms and the growing demand for hospitalization. In developing countries, which do not have sufficient healthcare infrastructure, the increase in inpatients has put a lot of burden on the healthcare system. Moreover, numerous studies have reported various risk factors such as old age, male gender, and underlying medical conditions (such as hypertension, cardiovascular disease, diabetes, COPD, cancer, and obesity) for the deterioration of COVID-19 patients [59].

The use of modern and noninvasive methods to triage patients into specific and known categories at the early stages of the disease is beneficial [10]. One of these approaches is the use of predictive models based on machine learning [11, 12]. For example, developing predictive models based on mortality risk factors can positively prevent mortality through controlling acute conditions and planning in intensive care units [13]. Furthermore, machine learning can classify patients based on the deteriorating risk and predict the likelihood of death to manage resources optimally [14, 15].

To date, several studies have been published on the application of machine learning to develop diagnostic models or predict the death of patients due to COVID-19 [1423]. For example, several deep learning models have been reported to diagnose COVID-19 based on images [24]. In a study, researchers developed an enhanced fuzzy-based deep learning model to differentiate between COVID-19 and infectious pneumonia (no-COVID-19) based on portable CXRs and achieved up to 81% accuracy. Their fuzzy model had only three misclassifications on the validation dataset [24].

As for death prediction, several studies have also been published [16, 2528]. The results obtained from the studies on machine learning-based predictive methods indicated that those methods had reliable predictability and could identify the correlation between intervening variables in complex and ambiguous conditions caused by COVID-19. Therefore, they can be used to predict such situations in the future. Although those techniques have been tested on some regional datasets of the risk factors, the performance of the models can be improved when they apply to different datasets related to other countries such as Iran, where the prevalence of the COVID-19 and related deaths is high.

Iran is one of the first countries to face a widespread outbreak of the disease and has experienced more than four major epidemic waves with the highest mortality rates [29, 30]. As a result, due to the high prevalence and mortality rate of COVID-19 in Iran and the limitation of healthcare resources [31, 32], it is vital to have a prediction model based on Iranian conditions and local data. Therefore, this study aimed to fit a model for predicting the death caused by COVID-19 based on machine learning algorithms. Many previous models are based on laboratory, imaging, or treatment data [16, 2528]; however, we suggested models based on available demographic data, symptoms, and comorbidities that can be easily collected. We also conducted a bias analysis of machine learning models based on subgroups of patient populations to show the bias of these models.

2. Materials and Methods

2.1. Population and Data

We extracted data from the Khuzestan COVID-19 registry system belonging to Ahvaz Jundishapure University of Medical Sciences (AJUMS). From the beginning of the pandemic, this registry collects data from suspected (based on clinical signs) and confirmed (based on the results of PCR or CT scan) outpatients and inpatients in Khuzestan province, Iran. This registry collects demographic data, signs and symptoms, patient outcomes, PCR and CT results, and comorbidities from 38 hospitals. The details of data collection and data quality control were published elsewhere [30].

We included only patients with a confirmed diagnosis of COVID-19 based on PCR test or CT scan results for this modeling study. Furthermore, we included only patients who were hospitalized for more than 24 hours. Because outpatients and hospitalized patients with a short stay (less than 24 hours) had a lot of missing data, we excluded these cases from the final analysis. We also included patients from all age groups. Finally, we extracted data for 10,657 patients. The frequency of nonsurviving patients (until discharge) was 1711 (16.06%); 8946 patients (83.94%) were discharged alive. Figure 1 shows the steps of this study.

2.2. Data Preprocessing
2.2.1. Imputing Missing Variables

Because of the data quality controls in the registry, the database had a low rate of missing data. The 28 variables had a missing rate below 4% (Supplement 1, Table S1). In machine learning, data imputation is a standard approach to improve the models’ performance. Different methods such as imputation with mean, median, or mode are common. We imputed the missing values with the mean for age and the highest frequency of values for nonnumerical variables as well [11, 33].

2.2.2. Features and Feature Selection

The outcome measure of the study is in-hospital mortality until discharge which is collected as binary (yes/no). The dataset contains 60 input variables. Age and the number of comorbidities are numerical; oxygen saturation level (PO2) includes two values including below and above 93%. We created three dummy variables for the diagnosis method (only positive PCR, only abnormal CT, positive PCR, and abnormal CT). Other variables have two values: yes or no.

For feature selection, we applied univariate analysis using Chi-square or Fisher exact tests for nonnumerical variables and Mann-Whitney U test for age and number of comorbidities (due to abnormal distribution). We created different feature sets to build the prediction models. The first set included all the 60 variables. The second set consisted of variables that were significant in univariate analysis ( value <0.05). The third feature set included the marginal variables based on univariate analysis ( value <0.2). To create the fourth feature set, we used the feature selection node in the IBM SPSS modeler. This node identifies important features based on univariate analysis as well as the frequency of missing values and the percentage of records with the same value. Table 1 shows the variables in each of these feature sets.

2.2.3. Data Balancing

We first developed our models with a variety of machine learning algorithms on the original dataset (dataset 1). We found the inappropriate performance of these models, in terms of the sensitivity, because of the small number of samples in the death class (83.94% surviving vs. 16.06% nonsurviving, ratio = 5.23), so the models did not perform well to predict death. There are various methods such as oversampling the minor class or undersampling the major class to solve this problem [11, 12]. We oversampled the death cases to create more balanced datasets. Datasets 2 and 3 included 5,133 (36.5%, ratio = 1.74) and 8,938 (49.98%, ratio = 1) nonsurviving patients, respectively. We developed our models with all four feature sets on these three datasets.

2.3. Model Development and Evaluation

We randomly divided the data into two sets, training (70%) and testing (30%) sets, and developed our models using common machine learning algorithms that are usually reported to perform well in medicine including Multiple Layer Perceptron (MLP) neural networks [11, 12, 34], Chi-Squared Detection of Automatic Interaction (CHAID), C5, and Random Forest (RF) decision trees [11, 12, 33, 34], Support Vector Machine (SVM) with Radial Basic Function (RBF) kernel [12, 35, 36], and Bayesian network [12, 3739].

We first developed models based on the default settings of parameters. We developed CHAID decision trees with a maximum depth of five and a minimum record of two in the nodes. Moreover, we implemented the C5 tree with a minimum of two records in nodes. RF was also implemented with a maximum depth of 10, and a minimum of five records in nodes using 100 models. The SVM model was implemented with a regularization parameter of 10 and a gamma of 0.1. We additionally developed MLPs using the different number of neurons (5, 10, 15, and 20) in one and two hidden layers and also with the number of neurons suggested by the software. We also implemented the best CHAID, C5, and MLP with boosting ensemble method and 10-fold cross-validation. Furthermore, we implemented stack models (combining individual models) [40]. Our analysis showed that models developed on dataset 3 had generally better performance. Therefore, we developed stack models, based on the best individual models, on this dataset with different feature sets.

2.4. External Validation

For external validation, we extracted 1734 records from the Khuzestan COVID-19 registry system. These data are from four different hospitals in different timeframes. Therefore, these data were not used in training or testing the models. This dataset contained 1425 surviving and 309 nonsurviving patients. Inclusion and exclusion criteria were similar to the training/testing dataset, described in Section 2.1. The best performing models selected from the previous step and also ensemble models were validated using this dataset.

2.5. Subpopulation Bias Analysis

Previous studies show that predictive models may have different performances against different subpopulations, for example, in different sex or age groups [41, 42]. To assess this effect, we adopted the method suggested by Seyyed-Kalantari et al. They suggested the use of false-positive rate (FPR) and false-negative rate (FNR) in subpopulations to assess the underdiagnosis and overdiagnosis of machine learning models [41]. We similarly calculated FNR and FPR to assess the underprediction or overprediction of death in our models. To this end, we used the best performing models in external evaluation and the external dataset.

2.6. Analysis

We applied IBM SPSS statistical software version 23 for statistical analysis and IBM SPSS modeler version 18 to develop and evaluate machine learning models. We evaluated and compared the models using confusion matrix, accuracy, precision, sensitivity, specificity, F-score, and Area under the Curve (AUC). To select the best performing models, we compared the models obtained from each dataset-feature with each other based on AUC and F-score.

2.7. Ethical Considerations

This study received ethical approvals from the Ethics Research Committee of Ahvaz Jundishapur University of Medical Sciences (IR.AJUMS.REC.1400.325).

3. Results

3.1. Descriptive Data

We extracted data for 10,657 patients from the Khuzestan COVID-19 registry [30]. The frequency of nonsurviving patients (until discharge) was 1711 (16.06%); 8946 patients (83.94%) were discharged alive. Table 2 shows that the death due to COVID-19 was significantly higher among men, older patients, and those who have been in contact with infected individuals. In addition, respiratory distress, convulsion, altered consciousness, and paralysis were more common among the nonsurviving patients. Conversely, cough, headache, diarrhea, and dizziness were less prevalent among them. Furthermore, oxygen saturation status was better among the recovered patients versus the dead. Moreover, the comorbidities and risk factors (excluding pregnancy) as well as the intubation, oxygen therapy at the beginning of hospitalization, and ICU admission were significantly higher among the dead.

3.2. The Machine Learning Algorithms and Their Evaluation

The results of performing various models with different settings on three datasets and four feature groups are reported as follows.

3.2.1. The Machine Learning Algorithms on Original Dataset 1

The details on the performance of the models are given in Supplement 1 (Tables S2S5). The result showed that the lowest and highest accuracy of the models based on the original dataset 1 were 84.52% (RF with 32 features) and 91.12% (Bayesian network with 32 features), respectively. In addition, the minimum and maximum AUC were 0.757 (C5 with 32 features) and 0.914 (Bayesian network with 32 features), respectively. According to the findings, the sensitivity for predicting death based on original dataset 1 was low and between 0.484 (MLP network with 60 features) and 0.775 (RF with 32 features) which indicates that the sensitivity of the models on imbalanced data is not appropriate. Table 3 shows the results of the performance of the top 10 models based on the test data of dataset 1. According to the table, the best two models were the Bayesian network and the CHAID tree on 32 features, respectively. The ROC curve for the best models is presented in Supplementary Figure S1.

3.2.2. The Machine Learning Algorithms on Dataset 2

The details on the performance of the models based on dataset 2 are given in Supplement 1, Tables S6S9. The findings showed that the lowest and highest accuracy were 82.64% (MLP with 60 features) and 87.86% (RF with 60 features), respectively. Moreover, the minimum and maximum values of the AUC were 0.888 (MLP with 60 features) and 0.942 (SVM with 60 features), respectively. According to the findings, the sensitivity for predicting death was between 0.658 (MLP network) and 0.861 (CHAID tree with 32 features). The best results obtained for each algorithm based on dataset 2 were shown in Supplementary Figure S2. According to Table 4, SVM and C5 models had the best performance on 60 and 40 features, respectively.

3.2.3. The Machine Learning Algorithms on Dataset 3

The details on the performance of the models based on dataset 3 are given in Supplement 1, Tables S10S13. The results showed that the lowest and highest accuracy were 81.27% (CHIAD tree with 32 features) and 92.77% (C5 with 60 features), respectively. Moreover, the minimum and maximum AUC were 0.899 (CHIAD with 32 features) and 0.972 (C5 with 60 features), respectively. The sensitivity for predicting death was also between 0.752 (MLP with 60 features) and 0.951 (C5 tree with 60 features). The best results obtained for each algorithm based on dataset 3 are shown in Supplementary Figure S3. According to Table 5, the C5 model had the best performance with different features, and SVM with 60 features was also one of the optimal models.

3.3. Ensemble Models

Table 6 indicates that the best ensemble model had 89.13% accuracy and 0.961 AUC. However, the comparison of these models with the corresponding individual models (Table 5) shows that C5 models have better performance than these ensemble models, even though these ensemble models are better than other individual models.

3.4. External Validation

We evaluated all ensemble models (Table 6) and the top 10 models developed on dataset 3 (Table 5) using an external dataset. As shown in Table 7, C5 boosting models with feature sets 1 and 2 have better scores.

3.5. Subpopulation Bias Analysis

We selected the four best models based on external validation for subpopulation bias analysis (Supplement 1, Table S14). Figures 2 and 3 show the FPR and FNR of these models. As these figures indicate, most of these models better perform on female patients than male patients. Furthermore, the performance of these models decreases in older patients. As for FPR, Figure 2 indicates that SVM and C5 (feature set 2) have a less biased prediction in terms of gender and age groups. Additionally, Figure 3 shows that C5 (feature set 2) has a less biased prediction.

3.6. Comparison of the Models

A comparison of the models showed that, with the balancing of the data, the sensitivity and AUC increased. However, the accuracy based on dataset 2 decreased, but it also increased based on dataset 3. Furthermore, models with 60 and 40 features performed better. In general, the C5 model with 60 features outperformed the rest based on all evaluation indicators; however, based on the external validation, C5 boosting models with feature sets 1 (17 features) and 2 (32 features) have better external validity. Subpopulation analysis suggests that the C5 boosting model with 32 features has less bias.

3.7. Variable Importance

Figure 4 shows the importance of each variable in the selected model (C5). As indicated, intubation, number of comorbidities, age, gender, respiratory distress, blood oxygen saturation level, ICU admission, cough, unconsciousness, positive PCR, and abnormal CT are considered the most important death predictors by this model.

4. Discussion

In the first stage of the study, the risk factors for death due to COVID-19 were discovered using univariate analysis. Then, based on the important features, different machine learning models were developed to predict death. The results showed significant differences between recovered and nonrecovered patients in terms of age, sex, contact with infected people, respiratory distress, convulsion, altered consciousness, paralysis, blood oxygen saturation level, the number of comorbidities, intubation, oxygen therapy, and the need for ICU services.

We found that intubation, number of comorbidities, age, gender, respiratory distress, blood oxygen saturation level, ICU admission, cough, unconsciousness, positive PCR, and abnormal CT are the most important death predictors. Other studies showed that age [17, 18, 23, 27, 28, 43], male gender [43], respiratory disease [16, 17], the number of comorbidities [43], and low oxygen saturation [17, 18, 23, 43] increased cases of death due to COVID-19. Some researchers indicate that high blood pressure, heart disease, cancer, kidney disease [16, 17], diabetes [18], cerebrovascular diseases [28], smoking [18, 23], and asthma [16] increased mortality from COVID-19. However, our model did not consider these factors significant. It is worth mentioning that these risk factors increased the number of comorbidities in a patient and this factor was also considered significant in the C5 model.

We developed various models with different features to predict death from COVID-19. Based on the results, the best performance was related to the C5 decision tree with 32 features. In the same way, several studies tried to develop machine learning models for predicting death from COVID-19 [1623, 2528, 4345]. Since a variety of variables (demographic, laboratory, radiographic, therapeutic, signs and symptoms, and comorbidities) and datasets are used, it is not easy to compare the studies. For example, some researchers used laboratory data to develop models in addition to other variables [17, 23, 28, 43], and a study applied only laboratory variables [45]. In another study, vital signs and imaging results were used to develop models [23]. However, the variables used in our study were similar to most of the studies. Despite this, a comparison of our study with previous studies showed that the performance of our selected model was better than those models (Table 8). The model developed by Gao et al. [43] has better performance (AUC = 0.976 vs. AUC = 0.972); however, this model was developed with small sample size. In addition, the F-score (F = 0.97) of the model developed by Yan et al. [19] was higher than our selected model. However, Barish et al. [46] showed that Yan’s model did not have a good result in the external validation. Khan’s model [26] also has a higher F-score than our model. Khan et al. and Gao et al. used unbalanced data; Barish et al. [46] have shown that models developed based on unbalanced data to predict death from COVID-19 may not have accurate results in the real environment.

We found that machine learning models perform differently in subpopulations in terms of gender and age groups. Other studies similarly show that predictive models have different performances in different ethnic groups, genders, and age groups of patients and patients with different insurance [41, 42]. Therefore, researchers and clinicians should apply these models to different population groups cautiously. Moreover, developing models for different patient groups may be necessary.

The strengths of our model are the use of demographic data, symptoms, and comorbidities that can be easily collected. Despite some previous studies, we did not use laboratory, treatment, and imaging data. It can be considered a limitation. However, we supposed that all patients received almost similar treatments. Moreover, applying models which are developed based on treatment data may be difficult because of changes in patients’ treatment. Furthermore, models that depend on laboratory and imaging data require a lot of time and cost to gather these data to use the model in a real clinical environment. A comparison of our study with those that used laboratory and imaging data (Table 8) indicates that our selected model outperforms many of these models. A study also indicated that imaging data did not affect the performance of machine learning models to predict death from COVID-19 [23]. In addition, the data used in our study have been collected from 38 hospitals, which is the strength of the study. A similar study indicated that up to 20% of missing data in COVID-19 studies is acceptable for developing machine learning models [18]; however, the missing rate in our study was under 4%.

Despite the strengths, some limitations should be considered. Firstly, we only analyzed the subpopulation bias based on gender and age groups. Future studies should consider other variables in this analysis. Furthermore, there are several well-established models such as APACHE and SOFA [41, 42]. Researchers are recommended to compare the performance of machine learning models with these models to predict deaths from COVID-19.

5. Conclusions

Different machine learning models were developed to predict the likelihood of death caused by COVID-19. The best prediction model was the C5 decision tree (accuracy = 91.18%, AUC = 0.96, and F = 0.916). Therefore, this model can be used to detect high-risk patients and improve the use of facilities, equipment, and medical practitioners for patients with COVID-19.

Data Availability

The data used to support the findings of this study are restricted by the Ethics Research Committee of Ahvaz Jundishapur University of Medical Sciences in order to protect patient privacy.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

J. Zarei and A. Jamshidnezhad contributed to conceptualization, data curation, and writing—review and editing. M. H. Shoushtari contributed to conceptualization, methodology, writing—review and editing. A. Hadianfard and M. Cheraghi contributed to conceptualization and writing—review and editing. A. Sheikhtaheri contributed to conceptualization, methodology, data analysis, supervision, writing—the original draft, and writing—review and editing. All authors reviewed the final version of the manuscript and approved it to submit. This study was counducted based on the Khuzestan COVID-19 registry data. We would like to thank Khuzestan COVID-19 registry for providing data for this study.

Acknowledgments

This study was supported by Ahvaz Jundishapur University of Medical Sciences. The funder had no role in the study design; data collection, analysis, and interpretation; writing of the report; and the decision to submit.

Supplementary Materials

Supplement 1: detailed Tables S1–S14. Supplement 2: Figures S1–S3. (Supplementary Materials)