Introduction

Management of the surging infections of the coronavirus disease (COVID-19) is a huge clinical challenge. Currently, the pandemic is pummeling the global health system, with 18,902,735 people infected as of August 7, 20201,2,3. The overwhelmed health facilities are unable to curb the increasing mortality of COVID-193. Moreover, without proven effective treatments to date, patients who rapidly deteriorate into a refractory state harbor significantly higher risks of death4,5. Third, advanced COVID-19 is characterized by heterogeneous clinical features and multiorgan damage5,6, which requires an effective triage and intensive monitoring. Therefore, an early warning system that enables stratification of COVID-19 patients by risk of death on admission holds enormous promise to assist in the management of COVID-19.

Electronic health records (EHRs) abound with valuable information generated from routine clinical practices7,8, which can be useful for mortality risk prediction of COVID-19. However, data in EHRs are complex, multidimensional, nonlinear, and heterogeneous. Using models more effective than traditional statistical methods (univariate or multivariate Cox regressions and logistic regression (LR)) for analysis can help to fully utilize the clinical data in EHRs. Machine learning (ML), a subfield of artificial intelligence, encapsulates statistical and mathematical algorithms that enable facts interrogation and complex decision-making9,10. Therefore, combinatory uses of ML algorithms and EHRs for prognosis prediction in the context of COVID-19 pandemic are worth exploring.

ML algorithms have been explored in myriad fields of COVID-19 including, but not limited to, detecting outbreaks, identification and classification of COVID-19 medical images, rapid diagnosis, severity risk prediction, and prognosis prediction11,12,13,14,15. For COVID-19 patients and clinicians, the greatest concern is whether the patients can survive. Available ML models that focus on this exhibit promising prognostic implications, but are still impeded by the paucity of external validations and limited follow-ups, and lack the capability of predicting prognosis as early as the time of admission.

In this study, we aim to develop a mortality risk prediction model for COVID-19 (MRPMC) that utilizes clinical data in EHRs to stratify patients by mortality risk on admission. The validated capability of enabling expeditious and accurate mortality risk stratification of COVID-19 may facilitate more responsive health systems that are conducive to high-risk COVID-19 patients via early identification, and ensuing instant intervention as well as intensive care and monitoring, thus, hopefully assisting to save lives during the pandemic.

Results

Study design and baseline characteristics

To train and validate the MRPMC for prognosis prediction of COVID-19, we included 2520 consecutive COVID-19 patients with known outcomes (discharge or death) from two affiliated hospitals of Tongji Medical College, Huazhong University of Science and Technology, including Sino-French New City Campus of Tongji Hospital (SF) and Optical Valley Campus of Tongji Hospital (OV), and The Central Hospital of Wuhan (CHWH) between January 27, 2020 and March 21, 2020. As a total of 360 patients were excluded with definite reasons, 2160 COVID-19 patients met eligibilities. For detailed exclusions, see Fig. 1 and “Methods,” participants. We randomly partitioned 50 and 50% of participants from SF into the training cohort (SFT cohort) and internal validation cohort (SFV cohort), respectively. Participants from OV and CHWH were used as two external validation cohorts (OV cohort and CHWH cohort). Compositions of the four cohorts are displayed in Fig. 1 and “Methods,” cohorts. The study design has been schematically presented in Fig. 1 and Supplementary Fig. 1.

Fig. 1: Study design.
figure 1

MRPMC mortality risk prediction model for COVID-19, SFT training cohort of Sino-French New City Campus of Tongji Hospital, SFV internal validation cohort of Sino-French New City Campus of Tongji Hospital, OV Optical Valley Campus of Tongji Hospital, CHWH The Central Hospital of Wuhan.

Table 1 shows the baseline characteristics of the four cohorts. The median age of the participants was 62 years (interquartile range [IQR]: 51–71) in the SFT cohort, 63 years (IQR: 51–70) in the SFV cohort, 63 years (IQR: 50–70) in the OV cohort, and 62.5 years (IQR: 55–72) in the CHWH cohort. The male patients accounted for 50.7, 50.0, 46.7, and 54.3% of all participants in the SFT, SFV, OV, and CHWH cohorts, respectively. Hypertension (37.1–40.3%) was the most prevalent comorbidity and fever (61.2–86.0%) remained the most common symptom. The median time from admission to death or discharge ranged from 17 to 23 days among all four cohorts.

Table 1 Baseline characteristics of individuals by cohort.

Features selected by least absolute shrinkage and selection operator (LASSO)

Among 53 raw features extracted from EHRs (Supplementary Table 1), those with a proportion of missing values greater than or equal to 5% in each cohort were filtered (Supplementary Fig. 2), resulting in 34 features, including 18 categorical features and 16 continuous ones (Supplementary Fig. 3 and 4) that underwent feature selection by the LASSO (Fig. 2a). Only 14 of the 34 features were eventually chosen for modeling (Fig. 2b), among which 8 features had a positive association with mortality (high risk: consciousness, male sex, sputum, blood urea nitrogen [BUN], respiratory rate [RR], D—dimer, number of comorbidities, and age) and 6 features were negatively correlated with mortality (low risk: platelet count [PLT], fever, albumin [ALB], SpO2, lymphocyte, and chronic kidney disease [CKD]). Multivariable Cox analysis using raw data of the 34 features proved that the features selected by LASSO exhibited similar prognostic implications (Supplementary Fig. 5 and Supplementary Table 2). High-risk features identified by LASSO were also significant unfavorable prognostic indicators recognized via multivariable Cox analysis (hazard ratio [HR] > 1 and p < 0.05). Similarly, low-risk features accorded with favorable prognostic indicators (HR < 1 and p < 0.05).

Fig. 2: Feature selection by LASSO.
figure 2

a LASSO variable trace profiles of the 34 features whose intracohort missing rates were less than 5%. The vertical dashed line shows the best lambda value 0.014 chosen by tenfold cross validation. b Feature coefficient of LASSO with best lambda value 0.014. High-risk (positive coefficient) and low-risk (negative coefficient) features are colored in red and blue, respectively. Gray features with coefficient 0 were considered redundant and removed, resulting in 14 features left for downstream prognosis modeling. LASSO least absolute shrinkage and selection operator, BUN blood urea nitrogen, RR respiratory rate, COPD chronic obstructive pulmonary disease, Hb hemoglobin, WB, white blood cell count, Cr creatinine, GGT gamma-glutamyl transferase, TB total bilirubin, AST aspartate aminotransferase, ALT alanine transaminase, MAP mean arterial pressure, ALB albumin, SpO2 oxygen saturation, CKD chronic kidney disease.

Model performance

In general, six ML models including LR, support vector machine (SVM), gradient boosted decision tree (GBDT), neural network (NN), K-nearest neighbor (KNN), and random forest (RF) all displayed varying but promising performances to predict mortality risk in the three validation cohorts in terms of discrimination and calibration. To build a predictive model with augmented prognostic implications, we integrated the top four best predictive models (LR, SVM, GBDT, and NN) to create an ensemble model called MRPMC. MRPMC outputted a normalized probability of mortality risk ranging from 0 to 1. We selected the threshold of 0.6 to assign the predicted mortality risk label by optimizing F1 score on the training cohort (Supplementary Fig. 6). Probabilities of less than 0.6 were assigned to low risk and otherwise to high risk for all ML methods across all cohorts. The procedures of establishing the MRPMC are elaborated in Methods, Model development. As expected, MRPMC exhibited greater capability of predicting mortality risk of COVID-19 than the four contributive models alone in the SFV and CHWH cohorts, though the differences between SVM and MRPMC were nuanced in the OV cohort (Fig. 3a–c).

Fig. 3: Predictive performance of models across cohorts.
figure 3

AUC to assess the performance of mortality risk prediction of models (LR, SVM, GBDT, NN, and MRPMC) in a SFV cohort, b OV cohort, and c CHWH cohort, respectively. Source data are provided as a Source Data file. Kaplan–Meier curves indicating overall survival of patients with high and low mortality risk in d SFV cohort, e OV cohort, and f CHWH cohort, respectively. The tick marks refer to censored patients. The dark red or blue line indicates the survival probability, and the light red or blue areas represent the 95% confidence interval of survival probability (p < 0.0001). AUC area under the receiver operating characteristics curve, SFV internal validation cohort of Sino-French New City Campus of Tongji Hospital, OV Optical Valley Campus of Tongji Hospital, CHWH The Central Hospital of Wuhan, LR logistic regression, SVM support vector machine, GBDT gradient boosted decision tree, NN neural network, MRPMC mortality risk prediction model for COVID-19.

MRPMC achieved an area under the receiver operating characteristics (ROC) curve (AUC) of 0.9621 (95% confidence interval [CI]: 0.9464–0.9778) in identification of nonsurvivors with an accuracy of 92.4% (95% CI: 90.1–94.4%) in SFV cohort. For OV cohort, MRPMC demonstrated an AUC of 0.9760 (95% CI: 0.9613–0.9906) and an accuracy of 95.5% (95% CI: 93.8–96.8%) to predict prognosis of COVID-19. An AUC of 0.9246 (95% CI: 0.8763–0.9729) and an accuracy of 87.9% (95% CI: 80.6–93.2%) for prognosis prediction were observed for CHWH cohort (Table 2). The calibration curve of MRPMC in the three validation cohorts are depicted in Supplementary Fig. 7, showing that MRPMC displayed a Brier score of 0.051 for SFV cohort, 0.029 for OV cohort, and 0.083 for CHWH cohort. Performances of four contributing algorithms are listed in Table 2, and that of the other two ML models (KNN and RF) in Supplementary Fig. 8 and Supplementary Table 3.

Table 2 Performance for mortality risk prediction of models in validation cohorts.

Moreover, with the time from admission to death or discharge as the endpoint, Kaplan–Meier analysis further confirmed that MRPMC could robustly stratify patients by mortality risk. High-risk COVID-19 patients labeled by MRPMC were significantly less likely to survive than low-risk patients in the SFV, OV, and CHWH validation cohorts (Fig. 3d–f; p < 0.0001) with an HR of 26.85 (95% CI: 17.41–41.42), 32.83 (95% CI: 19.70–54.70), and 12.81 (95% CI: 5.09–32.24), respectively, highlighting the capability of MRPMC to accurately predict prognosis of COVID-19.

Analyzing features included in models

Eight continuous features included in MRPMC exhibited correlation to varying degrees (Fig. 4a). Relative importance rank of all 14 variables for mortality prediction in MRPMC and the four contributive models are illustrated in Fig. 4b and Supplementary Table 4. The top weighted features (elevated D-dimer, decreased SpO2, increased RR, and lymphocytopenia) coincided with previously reported risk factors that were highly correlated with poor outcome in COVID-194,5. Standard box plots presented all differential continuous variables between survivors and nonsurvivors (Fig. 4c). Nonsurvivors had significantly (p < 0.001) advanced age, higher levels of BUN and D-dimer, and lower levels of SpO2, lymphocyte, ALB, and PLT (Fig. 4c and Supplementary Table 5). These findings were also parallel to risk factors of mortality of COVID-19 delineated previously16, indicating that the selected features were highly relevant to prognosis.

Fig. 4: Statistical analysis of features included in models.
figure 4

a Heatmap representing the correlation between continuous features included in MRPMC using Spearman’s correlation coefficient. The colors in the plot represent the correlation coefficients. The redder the color, the stronger the positive monotonic relationship. The bluer the color, the stronger the negative monotonic relationship. The size of the circle represents the absolute value of the correlation coefficient, where a larger circle represents a stronger correlation. The numbers in the lower triangle represent the value of correlation coefficient. b Scaled importance rank of all features included in MRPMC for identifying high mortality risk COVID-19 patients included in the models. The size of circles represents the value of relative importance. The different color of circles represents the feature importance in different models. c Box and jitter plots showing distribution of continuous features included in MRPMC between deceased patients (n = 254) and discharged patients (n = 1906). The center line represents the median of the feature. Box limits represent upper and lower quartiles. Whiskers represent 1.5 times interquartile range. Gray points represent outliers. The median [IQR] of the features shown in Fig. 4c were listed in Supplementary Table 4. Wilcoxon test was used in the univariate comparison between groups and a two-tailed p < 0.05 was considered as statistically significant. ***p < 0.001. Source data are provided as a Source Data file. MRPMC mortality risk prediction model for COVID-19, ALB albumin, SpO2 oxygen saturation, BUN blood urea nitrogen, RR respiratory rate, LYM lymphocyte count, PLT platelet count, No. comorbidities number of comorbidities, CKD chronic kidney disease, IQR interquartile range.

Discussion

In this multicenter retrospective study, we built the MRPMC, an ensemble model derived from four ML algorithms (LR, SVM, GBDT, and NN), that enabled accurate prediction of physiological deterioration and death for COVID-19 patients up to 20 days in advance using clinical information in EHRs on admission, and validated it both internally and externally. Importantly, the MRPMC displayed an AUC ranging from 0.9186 to 0.9762 in the three validation cohorts. The prognostic implications of MRPMC might facilitate more responsive health systems that are conducive to high-risk COVID-19 patients via early identification, and ensuing instant intervention as well as intensive care and monitoring, thus, hopefully assisting to save lives during the pandemic.

Generalizability was the first advantage of MRPMC. Initially, the SFV and OV cohorts comprised patients from two designated campuses for COVID-19, where 40 top-level medical teams across China collaborated to eradicate the crisis. Patients in the CHWH cohort were treated in a general hospital. Therefore, medical records on admission were more comprehensive in SFV and OV cohorts than in CHWH, and the treatments that patients received throughout hospitalization were more parallel between SFV and OV cohorts. Second, 44% of participants in CHWH cohort were COVID-19 patients with malignancy who were more vulnerable to COVID-19 and less likely to survive than noncancerous COVID-19 patients17,18. Validation of MRPMC in the CHWH cohort offered us opportunities not only to predict mortality risk in COVID-19 patients with cancer, a group where prognosis prediction is particularly pivotal and challenging, but also to assess MRPMC in an external validation cohort with heterogenous baseline characteristics. Importantly, although the settings of the validation cohorts varied, MRPMC exhibited an AUC of 0.9186 (95% CI: 0.8686–0.9687) to identify high-risk patients in the CHWH cohort, indicating that the prognostic implications of MRPMC were not confined to cohorts similar to SFT, but could also be successfully validated in an inhomogeneous cohort.

Strengths of MRPMC also include its stability and practicability upon COVID-19 patients with several missing features. To begin with, the 14 features for prognosis prediction were readily accessible and frequently monitored in routine clinical practice. Age and sex were basic information. Fever, sputum, and consciousness were easily observed symptoms, while RR and SpO2 were physical signs available at hand. Presence of CKD and number of comorbidities could be ascertained by referring to previous EHRs and patients or their family doctors. PLT, BUN, D-dimer, ALB, and lymphocytes were low-cost laboratory tests and conveniently determined. Unlike self-reported symptoms, these features were relatively more objective and solid, and less susceptible to memory bias. Though the 14 features were readily accessible, we appreciated the differences in medical procedures and uneven distribution of medical resources among different regions, countries, and continents. The missing features may thwart those who imminently need MRPMC. Importantly, with the imputation method we adopted (see Methods), MRPMC could still perform well in patients with several missing features.

In addition, MRPMC had certain interpretability. Features contributing to mortality risk prediction in this study were tangible and many of them had been proven intimately correlated with mortality in COVID-19 patients. Advanced age, male sex, and presence of multiple comorbidities were identified as risk factors associated with death in COVID-19 patients4,5. Sputum, supraphysiologic RR, and decreased SpO2 were directly related to pulmonary abnormalities in COVID-19. Elevated BUN, increased D-dimer, and lymphocytopenia might indicate extrapulmonary disorders and were potentially correlated with multiorgan damage caused by COVID-194,5.

Available ML-based studies on prognosis prediction of COVID-19 patients are impeded by limited sample size, category of variables for prediction, short-term follow-ups for outcomes, and paucity of independent external validation19,20,21,22,23,24,25,26. To overcome these obstacles, we included 2520 consecutive inpatients with definite outcomes and detailed baseline characteristics within a specific time period for training and multiple validations of MRPMC to avoid overfitting and ensure general applicability, reproducibility, and credibility. Meanwhile, the features contributing to prognosis prediction were collected and proposed by a multidisciplinary team including experienced clinicians, epidemiologists, and informaticians, which guaranteed the representativeness of features. Importantly, time from admission to death or discharge was 21 (IQR: 15–29) days, 19 (IQR: 14–26) days, and 17 (IQR: 12–24) days in the SFV, OV, and CHWH validation cohorts, respectively. As MRPMC displayed impressive AUCs to predict mortality risk in the validation cohorts, it could predict death ~20 days in advance. Last, since the characteristics of datasets could affect the validity of the classification strategies of ML algorithms, we proposed an ensemble model derived from four ML algorithms for more accurate prediction of mortality risk in COVID-19 patients.

Although most cases of COVID-19 are not life-threatening, those that underwent physiological deterioration harbored significantly higher mortality (49.0% for critically ill patients versus 2.3% for overall patients)27. As the pandemic causes more infections, our understandings of the risk factors for mortality and the role that supportive, targeted, and immunological therapies play in treating COVID-19 continue to improve16,28,29. The aim of developing MRPMC is to mitigate the huge burden derived from COVID-19 on global health system and help to optimize clinical decision makings. MRPMC could automatically identify patients having high mortality risk as early as the time of admission when related symptoms are mild and nonspecific. This group of patients needs intensive monitoring and instant treatment when unfavorable prognostic indicators are observed, thus, hopefully improving patient outcomes. However, multiple evaluations of MRPMC in larger cohorts, prospective settings, and clinical trials are needed before elucidating its contribution to improving outcome of COVID-1915.

This study had some limitations. Patients included were primarily local residents from Wuhan, China. The predictive performance of the ML models merits investigation in other regions and ethnicities. Besides, the prognostic implications of MRPMC have not been evaluated in prospective cohorts due to the retrospective nature of this study.

In conclusion, combinatorial applications of MRPMC and EHRs with readily available features can enable timely and accurate risk stratification of COVID-19 patients on admission. MRPMC can potentially assist clinicians to promptly target the high-risk patients on admission, and accurately predict physiological deterioration and death up to 20 days in advance.

Methods

Participants

We included 2520 consecutive COVID-19 patients with known outcomes (discharge or death) from two affiliated hospitals of Tongji Medical College, Huazhong University of Science and Technology (Sino-French New City Campus of Tongji Hospital, SF and Optical Valley Campus of Tongji Hospital, OV) and The Central Hospital of Wuhan (CHWH) between January 27, 2020 and March 21, 2020. A total of 360 patients were excluded for various reasons, including 72 patients who failed to accord with the defined diagnosis of COVID-19 in the 7th edition of the Diagnosis and Treatment Protocol of COVID-19 released by the National Health Commission of China30, 217 patients who were transferred from Fangcang shelter hospitals for isolation, 33 patients who died within 24 h of admission, and 38 patients who were under 18 years of age, were pregnant, or were re-hospitalized or discharged for special reasons such as dialysis (Fig. 1). Eventually, 2160 patients were included for model training and validations.

Cohorts

We randomly partitioned 50 and 50% of participants from SF into training cohort (SFT cohort) and internal validation cohort (SFV cohort), respectively. Participants from OV and CHWH were used as two external validation cohorts (OV cohort and CHWH cohort). Specifically, as Fig. 1 indicates, SFT cohort comprised 621 patients (535 survivors and 86 nonsurvivors); SFV, 622 patients (533 survivors and 89 nonsurvivors); OV, 801 patients (741 survivors and 60 nonsurvivors); and CHWH, 116 patients (97 survivors and 19 nonsurvivors). Patients with malignancy were reportedly more susceptible and vulnerable to COVID-19 owing to their immunocompromised states caused by the cancer itself, cachexia, and antitumor treatment31. They were also less likely to survive than noncancerous COVID-19 patients17,18, making COVID-19 patients with cancer an intriguing group of population for prognosis prediction. To investigate the capability of ML models to predict prognosis in this population, we consecutively included 54 malignant COVID-19 patients from the Cancer Center of CHWH and 62 noncancerous COVID-19 patients from the Department of Respiratory of CHWH to constitute another external validation cohort. The detailed baseline characteristics of the cohorts are shown in Table 1.

Ethics

This study was approved by the Research Ethics Commission of Tongji Medical College, Huazhong University of Science and Technology (TJ-IRB20200406) with waived informed consent by the Ethics Commission mentioned above. This study was part of the observational clinical trial titled “A retrospective study for evolution and clinical outcomes study of novel coronavirus pneumonia (COVID-19) patients,” which was registered in the Chinese Clinical Trial Registry (ChiCTR2000032161). The clinical trial partly aimed to investigate the independent risk factors for adverse outcomes of COVID-19. The detailed information can be accessed in http://www.chictr.org.cn/showprojen.aspx?proj=52561.

Data collection

Under the guidance of a multidisciplinary team including experienced clinicians, epidemiologists, and informaticians, we extracted 53 features including epidemiological, demographic, clinical, laboratory, radiological, and outcome data from EHRs using identical data collection forms on the first day of admission (Supplementary Table 1). Trained researchers entered and double-checked the data independently. To ensure the alarming function and subjective initiative of models, we abandoned variables generated in the late admission and variables regarding treatment. For patients with multiple features, we included only the first episode in various categories at admission.

Feature filtering and imputation

First, we removed 19 features that harbored a proportion of missing values greater than or equal to 5% in each cohort (Supplementary Fig. 2). Filtering features with a large fraction of missing entries is common when dealing with clinical data32. Then, we imputed the missing entries with R-package missForest in the three cohorts separately33. Imputation of clinical data with RF has been widely adopted34,35, which displayed the capability of handling mix-type missing values including continuous and categorical variables. Although we visualized the imputation result of categorical and continuous variables separately (Supplementary Fig. 3 and 4), the imputation was conducted with the 34 mix-type features together for three validation cohort, respectively.

Feature selection

After filtering 19 features and data imputation, there were 34 features remaining for feature selection. To eliminate redundant collinear features and diminish cost of clinical testing, we performed feature selection by recognizing the most predictive variables using LASSO LR (Fig. 2a)32,36. LASSO added the L1 norm of the feature coefficients as a penalty term to the loss function, which forced the coefficients corresponding to those weak features to become zero. Herein, we considered features whose coefficients were equal to zero as redundant features and abandoned them, resulting in 14 selected features for model constructions (Fig. 2b).

Model development

We trained the models to predict mortality risk with the 14 variables and outcomes of COVID-19 patients. During model training, we fitted six baseline ML models, including LR, SVM, KNN, RF, GBDT, and NN, into the SFT cohort with tenfold cross validation to fine-tune the model parameters. Increasing the weight of minority categories in the model can increase the punishment for wrong classification of minority categories during training, and improve the model’s ability to recognize minority categories37. Therefore, we adopted weighted cross-entropy and increased the weight of class death for probability-based classifiers (LR, RF, GBDT, and NN). Subsequently, an ensemble model derived from four baseline models of best predictive performance (LR, SVM, GBDT, and NN), named MRPMC, was proposed by weighted voting. Specifically, the mortality risk probability of each individual estimator (LR, SVM, GBDT, and NN) was integrated by manually assigning weights with 0.25, 0.3, 0.1, and 0.35, respectively. After all ML models were well fitted, they were internally and externally evaluated in SFV, OV, and CHWH cohorts. Herein, we modeled the mortality prediction task as a binary classification problem. All included ML models output a normalized probability of mortality risk range from 0 to 1. We selected the threshold of 0.6 to assign the predicted mortality risk label by optimizing F1 score on the training cohort (Supplementary Fig. 6). Probabilities of less than 0.6 were assigned to low risk and otherwise to high risk for all ML methods across all cohorts. R library caret was utilized for model training and prediction. The LR, SVM, KNN, RF, GBDT, and NN models were called with method bayesglm, svmLinear, knn, rf, gbm, and avNNet with default settings, respectively. We standardized the features data with BoxCox, center, and scale function before training and prediction. Especially, we first adopted BoxCox transformation to make the data distribution more Gaussian-like38, and then standardized features by subtracting the mean and scaling to unit variance. Variable z was calculated as: \(z = (x-u)/s\), where u was the mean and s was the standard deviation of the variable.

Model evaluation

The predictive performance of the models was evaluated by ROC curve, Kaplan–Meier curve, calibration curve, and evaluation metrics including area under the ROC curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score, Cohen’s Kappa coefficient (Kappa), and Brier score. The relative feature importance of each model was calculated using varImp function in caret R package. As SVM and KNN classifier had no built-in importance score, the AUC for each feature was utilized as the importance score.

Statistical analysis

Statistical analysis was performed in R (version 3.6.2). For descriptive analysis, median (IQR) and frequencies (%) were assessed for continuous and categorical variables, respectively. The ROC curve and AUC analysis were conducted with R pROC package. Accuracy, sensitivity, specificity, PPV, NPV, Kappa, and F1 score were calculated with R caret and epiR packages. The calibration curve and Brier score were obtained with R-package rms. Relative feature importance was calculated using R-package caret. Survival curves were developed by Kaplan–Meier method with log-rank test, and plotted with R-package survival and survminer. Comparison of continuous variables was achieved by the Mann–Whitney U test using R-package table1. Odds ratio and corresponding 95% CI from LR were calculated with R-package stats. The significance level was set at a two-sided p value below 0.05. Univariate and multivariate Cox regression was utilized to calculate the HR with R-package survival. All dry-lab experiments were conducted in three different computing servers with consistent result.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.