[Retracted] Identifying COVID-19-Specific Transcriptomic Biomarkers with Machine Learning Methods

Chen, Lei; Li, Zhandong; Zeng, Tao; Zhang, Yu-Hang; Feng, KaiYan; Huang, Tao; Cai, Yu-Dong

doi:https://doi.org/10.1155/2021/9939134

BioMed Research International

On this page

Abstract Introduction Materials and Methods Results Discussion Conclusion Abbreviations Data Availability Conflicts of Interest Authors’ Contributions Acknowledgments Supplementary Materials References Copyright Related Articles

Research Article Retraction

!

This article has been Retracted. To view the article details, please click the ‘Retraction’ tab above.

Special Issue

Applications of Computational Methods for Prediction, Prevention, Diagnosis, and Treatment of Neurodegenerative Diseases

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 9939134 | https://doi.org/10.1155/2021/9939134

[Retracted] Identifying COVID-19-Specific Transcriptomic Biomarkers with Machine Learning Methods

Lei Chen,^1,2Zhandong Li,³Tao Zeng,⁴Yu-Hang Zhang,⁵KaiYan Feng,⁶Tao Huang,^4,7and Yu-Dong Cai¹

Academic Editor: Min Tang

Received17 Mar 2021

Revised03 Jun 2021

Accepted24 Jun 2021

Published07 Jul 2021

Abstract

COVID-19, a severe respiratory disease caused by a new type of coronavirus SARS-CoV-2, has been spreading all over the world. Patients infected with SARS-CoV-2 may have no pathogenic symptoms, i.e., presymptomatic patients and asymptomatic patients. Both patients could further spread the virus to other susceptible people, thereby making the control of COVID-19 difficult. The two major challenges for COVID-19 diagnosis at present are as follows: (1) patients could share similar symptoms with other respiratory infections, and (2) patients may not have any symptoms but could still spread the virus. Therefore, new biomarkers at different omics levels are required for the large-scale screening and diagnosis of COVID-19. Although some initial analyses could identify a group of candidate gene biomarkers for COVID-19, the previous work still could not identify biomarkers capable for clinical use in COVID-19, which requires disease-specific diagnosis compared with other multiple infectious diseases. As an extension of the previous study, optimized machine learning models were applied in the present study to identify some specific qualitative host biomarkers associated with COVID-19 infection on the basis of a publicly released transcriptomic dataset, which included healthy controls and patients with bacterial infection, influenza, COVID-19, and other kinds of coronavirus. This dataset was first analysed by Boruta, Max-Relevance and Min-Redundancy feature selection methods one by one, resulting in a feature list. This list was fed into the incremental feature selection method, incorporating one of the classification algorithms to extract essential biomarkers and build efficient classifiers and classification rules. The capacity of these findings to distinguish COVID-19 with other similar respiratory infectious diseases at the transcriptomic level was also validated, which may improve the efficacy and accuracy of COVID-19 diagnosis.

1. Introduction

COVID-19 is recognized as starting from the end of 2019. It is a severe respiratory disease caused by a new type of coronavirus SARS-CoV-2 and has been spreading all over the world [1–3]. By the end of January 2021, approximately 100 million cases and 2 million deaths have been reported worldwide [4], making COVID-19 one of the most widespread and deadly infectious diseases in human history. In the US alone, more than 26 million cases were reported [4]. Different from other severe diseases, COVID-19 hardly has typical symptoms that could be used for diagnosis. A wide range of disease-associated symptoms, such as respiratory or systematic, were reported to be associated with COVID-19, including fever, cough, headache, diarrhea, and muscle or body aches [5, 6]. Moreover, patients infected with SARS-CoV-2 may have no pathogenic symptoms, i.e., presymptomatic patients and asymptomatic patients. In the early stage (first 2 days) of SARS-CoV-2 infection, patients may not have any COVID-19 associated symptoms, and they could be clustered as presymptomatic patients [7]. However, some patients may never have any symptoms but still have been infected by SARS-CoV-2, and they could be defined as asymptomatic patients. Both types of patients could further spread the virus to other susceptible people, thereby making the control of the COVID-19 pandemic difficult [8].

The two major challenges for COVID-19 diagnosis at present are as follows: (1) patients could share similar symptoms with other respiratory infections, and (2) patients may not have any symptoms but could still spread the virus. Therefore, identifying new biomarkers at different omics levels (genomic, transcriptomic, or proteomic levels) may be helpful for large-scale screening and diagnosis of COVID-19. Genomic analyses on COVID-19 mainly focused on the genomics of the virus and not the host by identifying the typical sequence of the ORF1ab, spike, ORF3a, envelope, membrane, and nucleocapsid of SARS-CoV-2 [9]. Meanwhile, many transcriptomic and proteomic analyses focused on the host, especially on the host–virus interaction-associated alterations in the host system. For example, in April 2020, a systematic study (GSE150728) on the expression pattern of immune-associated genes in lung tissue or related human lung cells during the infection of SARS-CoV-2 was presented, revealing that the selective death of type II pneumocytes caused by abnormal immune responses caused high morbidity and mortality in COVID-19 cases [10]. However, despite the encouraging results presented, this study has two obvious shortcomings: (1) the major findings were based on in vitro-cultured cell lines and only two patients each group were enrolled, and (2) only immune-associated genes were taken into consideration. As for other transcriptomic analyses, only single-cell subgroups, such as human lung cell lines [10], cardiomyocyte cells [11], and human bronchial organoids [12], have been analysed and discussed, and systematic transcriptomic analyses on lung tissue are lacking.

Although some initial analyses on such transcriptomic datasets could identify a group of candidate gene biomarkers, such as IFI6, TIMP1, and LGR6, for COVID-19 in the previous study [13], the dataset used did not contain normal controls and only divided patients into three rough groups: patients with COVID-19, those with other viral infections, and those without viral infections. Thus, the previous work could not fully identify biomarkers capable for clinical use in COVID-19, which requires disease-specific diagnosis compared with other multiple infectious diseases. As an extension of the previous study, a recent dataset released on the Gene Expression Omnibus (GEO) database (GSE161731) [14] was introduced for further analyses. These blood sample transcriptomic data of 195 subjects include 19 healthy controls and 23, 17, 77, and 59 patients with bacterial infection, influenza, COVID-19, and other kinds of coronavirus, respectively. The new dataset could be used to screen out potential transcriptomic biomarkers from the comprehensive lung tissue, and a comparison between COVID-19 and other infectious respiratory diseases could further help identify disease-specific biomarkers to distinguish COVID-19 from other similar diseases.

In this study, on the basis of the publicly released dataset, optimized machine learning models were applied to identify some specific qualitative host biomarkers associated with COVID-19 infection. Two powerful feature selection methods (Boruta [15] and Max-Relevance and Min-Redundancy (mRMR) [16]), were applied on this dataset one by one. A feature list was generated, which was further fed into the incremental feature selection (IFS) method [17]. Four classic classification algorithms were tried in the IFS method. As a result, we accessed some essential biomarkers, efficient classifiers, and classification rules. The capacity of these findings to distinguish COVID-19 with other similar respiratory infectious diseases at the transcriptomic level was validated, which could improve the efficacy and accuracy of COVID-19 diagnosis.

2. Materials and Methods

2.1. Data

The blood expression profiles of 15,379 genes in acute respiratory infection samples were downloaded from the GEO database under accession number GSE161731 [14]. A total of 195 samples with demographic information were included as follows: 19 healthy controls, 23 patients with bacterial pneumonia, 17 patients with influenza, 59 patients with seasonal coronavirus, and 77 patients with SARS-CoV-2 infection. The 15,379 genes are listed in Table S1. The processed transcript-per-million expression data were used for further analysis.

2.2. Boruta Feature Filtering

The investigated dataset involved lots of features/genes. Evidently, some are relevant to acute respiratory infection, whereas others are not. To extract the relevant features, the Boruta [15] method was employed.

Boruta is a random forest- (RF-) based feature select method. Given a dataset, a shuffled feature is added for each original feature. A RF classifier is built on a dataset with original and added features. According to the performance of RF, calculate the score of all features and find the maximum score among shuffled features (MZSA). Determine the original features as “important” if their scores are significantly higher than MZSA; whereas when scores of some features are signicantly lower than MZSA, they are labelled as “unimportant.” The above procedures are executed several times until all original features are labelled as “important” or “unimportant,” or the times of RF runs have reached a predefined number.

In this study, we adopted the program of Boruta retrieved from https://github.com/scikit-learn-contrib/boruta_py. It was run with its default parameters.

2.3. Max-Relevance and Min-Redundancy (mRMR) Feature Selection

mRMR [16] is a mutual information- (MI-) based feature selection approach to evaluate the importance of features. This method has wide applications in tackling several biological and medical problems [13, 18–23]. For variables and , their MI can be calculated bywhere denotes the marginal probabilistic density of , represents the joint probabilistic density of and , respectively. A high MI means two variables have high associations. For a feature, its importance is reflected by its rank in a feature list. To generate such list, a loop procedure is included in the mRMR method. Initially, this list is empty. A feature is selected in each round and appended to this list. Such feature is selected by the following manner. For each nonselected feature, calculate its relevance to class labels, which is defined as the MI of it and class labels, and its redundancies to already-selected features, which is defined as the average MI of it and already-selected features. The feature with maximum difference of above two values is selected. The loop procedure stops until all features are selected. For convenience, this list was called mRMR feature list in this study.

In present study, the mRMR program downloaded from http://penglab.janelia.org/proj/mRMR/ was used. Such program was executed with its default parameters.

2.4. Incremental Feature Selection (IFS)

IFS is a widely used approach integrated with supervised classifier (e.g., SVM) to determine the optimal feature number for classification model construction [17]. On the basis of the mRMR feature list available from mRMR, IFS could produce step-wise feature subsets in a given step interval (i.e., 1). For instance, the first feature subset has the top-ranked features, and then the second feature subset has the top-ranked features, and so on. For each candidate feature subset, a classifier could be built on the basis of the training sample data within such feature subset. In IFS, the optimal feature subset is obtained when a classifier could achieve the best performance measurement, evaluated by Matthew’s correlation coefficient (MCC) [24], within 10-fold cross-validation [25] on such feature subset.

2.5. Candidate Classification Algorithms

The four classification algorithms were tried in the IFS method. Their brief descriptions are as follows.

2.5.1. RF

RF is an assembly prediction model that uses average prediction [26], which predicts the class label of a test sample dependent on the consensus prediction results from a series of decision trees (DTs). It is widely used in bioinformatics researches [27–31].

2.5.2. Support Vector Machine (SVM)

SVM [32–38] consists of several computational steps. First, it transforms the original data from a low-dimensional data space to a high-dimensional data space. It could also transform the original nonlinear data pattern to new linear data pattern [39, 40]. Second, it divides the data points in the high-dimensional data space by maximizing the space interval among data points from different classes/labels. Finally, it predicts the test sample’s class label by judging which space interval this new data point belongs to. Here, the SVM model construction adopted the SMO in Weka.

2.5.3. -Nearest Neighbor (kNN)

The computational steps of kNN [41] are as follows: first, it calculates the sample distance between a new sample and all training samples. Then, it ranks all training samples in accordance with these distance measurements. Next, it chooses the -nearest training samples and estimates the class label distribution of these samples. Finally, it predicts the class label of new sample as the one with the largest distribution frequency. Here, the kNN model building adopted the Ibk in Weka.

2.5.4. DT

As a rule-based white-box classification and regression model, DT [42, 43] generally applies IF-TEHN format to indicate each feature’s role and weight in a model and corresponding rule, which thereby provides interpretative rules. Here, the DT model learning adopted the CART algorithm with the Gini index in the Scikit-learn package.

2.6. MCC

MCC [24] can evaluate the classification performance of different models. For the multiclass problem faced in this work, MCC could be calculated using the following formula:where data matrix has binary values representing the predicted sample classes, data matrix has binary values indicating the true sample classes, and calculates the two matrices’ covariance. The value of MCC ranges from −1 to +1 [19], and it is equal to +1 when the classification model has the best performance.

3. Results

In this study, we applied several advanced computational methods to the blood expression profiles of acute respiratory infection samples. The whole procedures are illustrated in Figure 1. The detailed results are listed in this section.

3.1. Results of Boruta and mRMR Methods

Each acute respiratory infection sample was represented by the blood expression level of 15,379 genes, which are provided in Table S1. These features (genes) were first analysed by the Boruta method. 604 relevant features were extracted, which are listed in Table S2. Then, these features were evaluated by the mRMR method. A feature list, called mRMR feature list, was produced, which is also provided in Table S2.

3.2. Results of IFS Method

The mRMR feature list was fed into the IFS method, which incorporated one of four classification algorithms (RF, SVM, KNN, and DT). 604 feature subsets were constructed in the IFS method, each of which contained some top features in the mRMR feature list. On each feature subset, a classifier was built based on a given classification algorithm, which was further assessed by 10-fold cross-validation. The accuracy on each category, overall accuracy, and MCC were counted. The above measurements obtained by all classification algorithms and constructed feature subsets are available in Table S3. For an easy observation, a curve was plotted for each classification algorithm, in which MCC was set as the -axis and number of features was set as the -axis. These four curves are shown in Figure 2. For SVM, the highest MCC was 0.917, which was obtained by using top 168 features. Thus, the SVM classifier with these features was deemed as the optimum SVM classifier. The overall accuracy of such classifier was 0.938 (Table 1). The accuracies on five categories yielded by such classifier are illustrated in Figure 3. Samples in three categories were all correctly predicted. These results indicated the excellent performance of the optimum SVM classifier.

As for KNN and RF, the highest MCCs were 0.845 and 0.896 when the top 183 and 565, respectively, features were used. These MCCs were lower than that of the optimum SVM classifier. Likewise, the optimum KNN and RF classifiers were built with the corresponding top features. The overall accuracies of these two classifiers are listed in Table 1. They were also lower than that of the optimum SVM classifier. The accuracies on five categories yielded by these two classifiers were also generally lower than those of the optimum SVM classifier (see Figure 3).

In addition to the above-mentioned three black-box classification algorithms, we also employed a white-box classification algorithm, DT. The same procedure was done for this algorithm. The curve is shown in Figure 2. The highest MCC was 0.818 when top 511 features were adopted. Such MCC was lower than that of the optimum SVM/KNN/RF classifier. The overall accuracy was 0.867 (Table 1), also lower than that of the optimum SVM/KNN/RF classifier. Furthermore, the accuracies on five categories, as shown in Figure 3, were also generally lower than those of other three optimum classifiers. Although such DT classifier did not provide good performance, we can obtain more insights from such classifier, which would be listed in the following subsection.

3.3. Classification Rules

The best DT classifier adopted top 511 features. Thus, we used these 511 features to build a DT using all acute respiratory infection samples. 21 rules were extracted from this DT, which are listed in Table S4. Among these 21 rules, eight rules were for prediction of SARS-CoV-2 infection samples, which were most, followed by rules for seasonal coronavirus, influenza, healthy control, and bacterial pneumonia (see Figure 4). The discussion on these rules can be found in Discussion.

3.4. Functional Enrichment Analyses

The optimum SVM classifier adopted top 168 features (genes). Using these selected COVID19 associated genes as gene of interest and all genes in analyses as gene background, we performed GO and KEGG enrichment analyses using DAVID website (https://david.ncifcrf.gov/). The FDR threshold for significant enriched results is set as 0.05. All the significant results are presented in Table 2.

4. Discussion

The top-ranked features (genes/transcripts) and rules were identified by applying these optimal machine learning models. According to recent publications [44–51], several identified top-ranked features and rule-involved features have been confirmed to be associated with the infection of a specific kind of pathogen, thus validating the efficacy and accuracy of the prediction in the current work. The detailed discussion can be found below.

4.1. Transcripts Associated with Disease-Specific Diagnosis of Different Pathogens

The first identified gene in the prediction list is RPL6. Together with some other ribosomal proteins, such as RPL3 and RPS20, RPL6 has already been reported to have differential expression patterns under specific physical and pathological conditions [52, 53]. Early in 2006, ribosomal proteins have been shown to be associated with lung bacterial infections caused by pneumococcal pneumonia in a mouse model [44]. As for influenza virus infections, in 2015, another independent study [45] at transcriptomic level has identified a group of ribosomal proteins, including RPL6, RPL15, RPL17, and RPL22, to have differential expression levels during influenza infections. As for coronavirus, including SARS-CoV-2, in 2020, a study [54] on the interactions between viral envelope protein and host cells confirmed that papain-like proteases, which are quite conserved in the coronavirus family, interact with the host ribosomal proteins. Therefore, ribosomal proteins, such as RPL6, RPL3, and RPS20, have differential expression levels during bacterial infection, influenza, and coronavirus infections, including COVID, thus making such transcripts potential biomarkers to distinguish patients with viral infections and normal controls.

The next identified gene is ZNF496, an effective DNA-binding transcription factor in the lung under physical and pathological conditions [55]. With few validated reports on its associations with infections, it has only been shown to be associated with SARS-CoV-2 in a recent transcriptional regulatory network study [46] and identified as a potential therapeutic target, implying its potential significance for COVID-19 [46]. Therefore, such gene may also be a potential biomarker distinguishing patients with COVID-19 from others.

DYNLRB1, as another predicted biomarker candidate, has previously been reported to be associated with linking dynein to cargos and regulatory adapters for dynein functions [56]. Early in 2011, DYNLRB1 has been confirmed to be associated with multiple viral infections in lung, including influenza virus but not coronavirus, in mouse models [47]. However, no direct evidence has shown that such gene is associated with bacterial or coronavirus infections (including COVID-19), indicating it may be a potential biomarker for influenza virus infection, which is also in agreement with the prediction.

TRBV20-1 is a transcript of the variable domain of T cell receptor, which participates in the antigen recognition and varies for different potential antigens, such as those from different pathogens, including influenza, bacteria, or coronavirus [48, 49]. Although gene TRBV20-1 does not have tissue specificity, considering that T cell-mediated immune responses have shown to be associated with COVID infections and the predicted gene TRBV20-1 has been confirmed to be expressed in lung, it is reasonable to speculate that TRBV20-1 may participate in the COVID-mediated lung infections.

Apart from another ribosomal protein associated transcript RPL36AL, PHOSPHO1, as a potential regulator for phosphatase activity and phosphocholine phosphatase activity regulations in cells, has been predicted to have differential expression levels during infection with different pathogens. Phosphatase activity has been shown to be essential for the infection of bacteria [57], influenza [58], and SARS-CoV-2 [50]. In particular, in the study associated with SARS-CoV-2, PHOSPHO1 has also been shown to be associated with immunomodulatory effects of the host against such virus [50]. Therefore, PHOSPHO1 may also be one of the potential biomarker candidates with disease-specific diagnosis capacity.

TMEM165, as a widely reported transmembrane protein expressed in fibroblasts, has also been predicted to be associated with bacterial infections in lungs. Different from the genes discussed above, TMEM165 has been shown to be not associated with viral infections, including influenza or coronavirus infection. In 2019, researchers have shown that TMEM165 is associated with bacterial infections in yeast [59]. Further, another study confirmed that such gene is effective in the lung and associated with chronic bacterial infection and inflammation [51], thus corresponding with the prediction in the present work.

4.2. Quantitative Rules Associated with Disease-Specific Diagnosis of Different Pathogens

Apart from the above qualitative analyses, quantitative analyses were performed to establish accurate rules for disease classification. Here, the top rules of each group were selected for follow-up detailed discussion.

The first rule aims to identify patients with COVID-19 infection with decreased expression levels of SORT1, RPL21P28, SIDT2, and TKT and a relatively high expression of GZMB. SORT1 has been shown to be upregulated in almost all lung infections due to its specific relationships with neutrophil recruitment in lung tissues/surrounding vascular against pathogens, especially for bacterial infections [60–62]. By contrast, specifically, in COVID-19, a network based analyses has shown that such gene is associated with the infection of SARS-CoV-2 with a relatively low expression, corresponding with our predictions [63]. Similar decreased expression levels of RPL21P28, SIDT2, and TKT have also been validated in the transcriptomic analyses of COVID-19 host cells [64, 65]. Generally, GZMB has been widely reported to be expressed within cytotoxic CD8+ T cells. However, recent publications have also confirmed that in anti-virus CD4+ T cells, GZMB is also highly expressed which is detected using intracellular staining [66]. As specific for SARS-CoV-2 associated infections, in 2020, a specific single-cell transcriptomic analyses on SARS-CoV-2 host cells revealed that in reactive CD4+ T cells, GZMB turned out to be upregulated [67], corresponding with the prediction in the present study. Although based on our bulk analyses, we cannot confirm whether detected GZMB is derived from CD4+ or CD8+ T cells; however, as an SARS-CoV-2 viral infection-associated gene, the identification of such gene may also prove the validity of the prediction to a certain extent.

The next rule is aimed at identifying patients with other coronavirus infection with decreased expression levels of HK3, CDKN1A, HMGN3, CACNA1l, and ATP6V1D and an increased expression of SORT1. As discussed above, SORT1 has been shown to be associated with lung infections induced by multiple pathogens [60–62], including other coronavirus, thus explaining the high expression of such gene in this rule. HK3, which encodes the effective hexokinase 3 protein and participates in glucose metabolism pathways, has been predicted to be downregulated during coronavirus infection [68], including SARS-CoV-2 infection [69]. As for the remaining four genes, CKDN1A has been directly reported to be positively associated with coronavirus infection and related complications [70]. Although no direct evidence confirmed the relationship between coronavirus infection and HMGN3, CACNA1l, and ATP6V1D, all these genes have been shown to be associated with infection-associated inflammation responses [71, 72], indicating their potential capacity for the prediction of coronavirus infections.

In rules associated with bacterial lung infections, the high expression levels of SORT1, HK3, and BAZ1A may be enough to identify patients with bacterial lung infections. As discussed above, a high expression of SORT1 indicates the activation of neutrophil recruitment, which is quite common for bacterial infections [73] and different from COVID-19 infection. Meanwhile, HK3 seems to be upregulated in lungs during bacterial infection, and such gene has been screened out as a host transcriptomic biomarker for the classification of bacteria and virus [74], thus corresponding with the prediction in the present work. Although no direct reports indicated the expression patterns of BAZ1A during bacterial infections, as mentioned above, neutrophil recruitment is quite common for bacterial infections.

With the involvement of effective biomarker candidates, such as SORT1, HK3, CDKN1A, NLRC5, and DACH1, the next rule contributes to the identification of influenza virus infections. Similar with the previous rules, SORT1, HK3, and CDKN1A have been predicted to be associated with the identification of influenza virus infections. As discussed above, a high expression of SORT1, a downregulated HK3, and a high expression of CDKN1A are associated with viral infections [60–62, 68, 71, 72]. The upregulation of NLRC5 and the activation of related pathways triggered by interactions between NLRC5 and RIG-I initiate a robust antiviral response against influenza virus infection [75]. Therefore, a relatively increased expression of NLRC5 during influenza virus infections is reasonable. As for DACH1, a recent comparable study [76] on COVID-19 infection, influenza virus infection, and normal controls revealed that after transcriptional regulation, the expression of DACH1 was relatively increased in patients infected with influenza, thus validating the efficacy and accuracy of the newly presented computational methods.

Increased expression levels of RPL21P28 and RTN1 and a decreased expression of SORT1 contribute to the rule for identifying healthy controls. The decreased expression of SORT1 indicated no remarkable neutrophil recruitment, corresponding with the physical conditions of normal controls. RPL21P28 has shown to be significantly differentially expressed in normal controls and tissues after infections, especially in human macrophages [77], which are generally activated during infections. Therefore, such gene could be summarized in this rule for the identification of normal controls. Similarly, RTN1 has been shown to be associated with macrophage-mediated immune suppressants, different from the immune activators in the previously discussed rules [78], thereby validating the predictions on normal control.

4.3. Functional Enrichment Analyses Using DAVID (DAVID Bioinformatics Resources 6.8)

Here, with the 168 selected COVID19 associated genes as gene of interest and all candidate genes as gene background, we performed functional enrichment analyses on GO terms and KEGG pathways using DAVID (Table 2) and selected the significant enriched results with FDR threshold as 0.05. According to the enriched results, multiple GO terms and KEGG pathways associated with RNA binding and replication via reverse transcription processes have been identified, meaning that selected genes are shown to be enriched in the RNA viral replication. Considering that COVID19 is a typical RNA virus, the enrichment results validated the reliability of the selected genes. Apart from that, we also identified multiple GO/KEGG terms associated with extracellular exosome/matrix. According to recent publications, extracellular microenvironment, especially for the vesicles outside the cells, is associated with the proliferation and spread of COVID-19 virus [79], validating our enrichment results.

All in all, the optimal blood-oriented features identified for the disease-specific diagnosis of COVID-19 and similar respiratory infectious pathogens have been validated. They are associated with their respective pathogens, and they even directly contribute to the pathogenesis according to recent publications. Therefore, the newly presented computational method in this study could be effective for the identification of COVID-19-associated biomarkers, and they could lay a solid foundation for further pathogenesis exploration on COVID-19-associated diseases.

5. Conclusion

In this study, a computational analysis was performed on an existing dataset of acute respiratory infection samples. The results included three parts. The first part was a set of genes/transcripts. They were highly related to one or more types of acute respiratory infection and can be latent biomarkers. The second part was the efficient classifiers, which can quickly identify the type of acute respiratory infection for a query sample. The third part was a set of classification rules, indicating different expression patterns on five types, giving more information to help us understand different types of acute respiratory infection.

Abbreviations

mRMR:	Max-Relevance and Min-Redundancy
IFS:	Incremental feature selection
RF:	Random forest;
MI:	Mutual information
MCC:	Matthew’s correlation coefficient
DT:	Decision tree
SVM:	Support vector machine
kNN:	-nearest neighbor

Data Availability

The data used to support the findings of this study have been deposited in the Gene Expression Omnibus repository (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE161731).

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Authors’ Contributions

Lei Chen and Zhandong Li contributed equally to this work.

Acknowledgments

This research was funded by the Strategic Priority Research Program of Chinese Academy of Sciences (XDB38050200), the National Key R&D Program of China (2017YFC1201200 and 2018YFC0910403), Shanghai Municipal Science and Technology Major Project (2017SHZDZX01), the National Natural Science Foundation of China (31701151), Shanghai Sailing Program (16YF1413800), the Youth Innovation Promotion Association of Chinese Academy of Sciences (CAS) (2016245), and the Fund of the Key Laboratory of Tissue Microenvironment and Tumor of Chinese Academy of Sciences (202002).

Supplementary Materials

Supplementary 1. Table S1: 15937 features (genes) to represent each acute respiratory infection sample.

Supplementary 2. Table S2: mRMR feature list generated by mRMR method.

Supplementary 3. Table S3: performance of IFS with different classifiers.

Supplementary 4. Table S4: rules generated from DT analysis.

References

K. Yuki, M. Fujiogi, and S. Koutsogiannaki, “COVID-19 pathophysiology: a review,” Clinical Immunology, vol. 215, p. 108427, 2020.
View at: Publisher Site | Google Scholar
T. P. Velavan and C. G. Meyer, “The COVID-19 epidemic,” Tropical Medicine & International Health, vol. 25, no. 3, pp. 278–280, 2020.
View at: Publisher Site | Google Scholar
K. Dhama, S. Khan, R. Tiwari et al., “Coronavirus disease 2019-COVID-19,” Clinical Microbiology Reviews, vol. 33, no. 4, 2020.
View at: Google Scholar
E. Dong, H. Du, and L. Gardner, “An interactive web-based dashboard to track COVID-19 in real time,” The Lancet Infectious Diseases, vol. 20, no. 5, pp. 533-534, 2020.
View at: Publisher Site | Google Scholar
H. A. Rothan and S. N. Byrareddy, “The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak,” Journal of Autoimmunity, vol. 109, p. 102433, 2020.
View at: Publisher Site | Google Scholar
G. Pascarella, A. Strumia, C. Piliego et al., “COVID-19 diagnosis and management: a comprehensive review,” Journal of Internal Medicine, vol. 288, no. 2, pp. 192–206, 2020.
View at: Publisher Site | Google Scholar
L. C. Tindale, M. Coombe, J. E. Stockdale et al., “Transmission interval estimates suggest pre-symptomatic spread of COVID-19,” MedRxiv, 2020.
View at: Publisher Site | Google Scholar
H. Han, Z. Xu, X. Cheng et al., “Descriptive, retrospective study of the clinical characteristics of asymptomatic COVID-19 patients,” mSphere, vol. 5, no. 5, 2020.
View at: Publisher Site | Google Scholar
M. Chiara, D. S. Horner, C. Gissi, and G. Pesole, “Comparative genomics suggests limited variability and similar evolutionary patterns between major clades of SARS-Cov-2,” BioRxiv, 2020.
View at: Publisher Site | Google Scholar
D. Blanco-Melo, B. E. Nilsson-Payant, W. C. Liu et al., “Imbalanced host response to SARS-CoV-2 drives development of COVID-19,” Cell, vol. 181, no. 5, pp. 1036–1045.e9, 2020.
View at: Publisher Site | Google Scholar
A. Sharma, G. Garcia Jr., Y. Wang et al., “Human iPSC-derived cardiomyocytes are susceptible to SARS-CoV-2 infection,” Cell Reports Medicine, vol. 1, no. 4, article 100052, 2020.
View at: Publisher Site | Google Scholar
T. Suzuki, Y. Itoh, Y. Sakai et al., “Generation of human bronchial organoids for SARS-CoV-2 research,” BioRxiv, 2020.
View at: Publisher Site | Google Scholar
Y.-H. Zhang, H. Li, T. Zeng et al., “Identifying transcriptomic signatures and rules for SARS-CoV-2 infection,” Frontiers in Cell and Developmental Biology, vol. 8, p. 627302, 2021.
View at: Publisher Site | Google Scholar
M. T. McClain, F. J. Constantine, R. Henao et al., “Dysregulated transcriptional responses to SARS-CoV-2 in the periphery,” Nature Communications, vol. 12, no. 1, p. 1079, 2021.
View at: Publisher Site | Google Scholar
M. Kursa and W. Rudnicki, “Feature selection with the Boruta package,” Journal of Statistical Software, Articles, vol. 36, no. 11, pp. 1–13, 2010.
View at: Google Scholar
Hanchuan Peng, Fuhui Long, and C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
View at: Publisher Site | Google Scholar
H. A. Liu and R. Setiono, “Incremental feature selection,” Applied Intelligence, vol. 9, no. 3, pp. 217–230, 1998.
View at: Publisher Site | Google Scholar
S. Zhang, T. Zeng, B. Hu et al., “Discriminating origin tissues of tumor cell lines by methylation signatures and Dys-methylated rules,” Frontiers in Bioengineering and Biotechnology, vol. 8, p. 507, 2020.
View at: Publisher Site | Google Scholar
S. Zhang, X. Y. Pan, T. Zeng et al., “Copy number variation pattern for discriminating MACROD2 states of colorectal cancer subtypes,” Frontiers in Bioengineering and Biotechnology, vol. 7, p. 407, 2019.
View at: Publisher Site | Google Scholar
L. Chen, T. Zeng, X. Pan, Y. H. Zhang, T. Huang, and Y. D. Cai, “Identifying methylation pattern and genes associated with breast cancer subtypes,” International Journal of Molecular Sciences, vol. 20, no. 17, p. 4269, 2019.
View at: Publisher Site | Google Scholar
S. He, F. Guo, Q. Zou, and HuiDing, “MRMD2.0: a Python tool for machine learning with feature ranking and reduction,” Current Bioinformatics, vol. 15, no. 10, pp. 1213–1221, 2021.
View at: Publisher Site | Google Scholar
X. Zhao, L. Chen, and J. Lu, “A similarity-based method for prediction of drug side effects with heterogeneous information,” Mathematical Biosciences, vol. 306, pp. 136–144, 2018.
View at: Publisher Site | Google Scholar
X. Pan, H. Li, T. Zeng et al., “Identification of protein subcellular localization with network and functional embeddings,” Frontiers in Genetics, vol. 11, p. 626500, 2021.
View at: Publisher Site | Google Scholar
B. W. Matthews, “Comparison of the predicted and observed secondary structure of T4 phage lysozyme,” Biochimica et Biophysica Acta (BBA) - Protein Structure, vol. 405, no. 2, pp. 442–451, 1975.
View at: Publisher Site | Google Scholar
R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in International joint Conference on artificial intelligence, pp. 1137–1145, Montreal, QC, Canada, 1995.
View at: Google Scholar
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
View at: Publisher Site | Google Scholar
Z. B. Lv et al., “RF-PseU: a random forest predictor for RNA Pseudouridine sites,” Frontiers in Bioengineering and Biotechnology, vol. 8, p. 10, 2020.
View at: Google Scholar
L. Xu, G. Liang, C. Liao, G.-D. Chen, and C.-C. Chang, “k-Skip-n-Gram-RF: a random forest based method for Alzheimer's disease protein identification,” Frontiers in Genetics, vol. 10, p. 7, 2019.
View at: Publisher Site | Google Scholar
Y. Jia, R. Zhao, and L. Chen, “Similarity-based machine learning model for predicting the metabolic pathways of compounds,” IEEE Access, vol. 8, pp. 130687–130696, 2020.
View at: Publisher Site | Google Scholar
H. Liang, L. Chen, X. Zhao, and X. Zhang, “Prediction of drug side effects with a refined negative sample selection strategy,” Computational and Mathematical Methods in Medicine, vol. 2020, Article ID 1573543, 16 pages, 2020.
View at: Publisher Site | Google Scholar
X. Y. Pan, T. Zeng, Y. H. Zhang et al., “Investigation and prediction of human interactome based on quantitative features,” Frontiers in Bioengineering and Biotechnology, vol. 8, p. 730, 2020.
View at: Publisher Site | Google Scholar
C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
View at: Publisher Site | Google Scholar
J. Li, L. Lu, Y. H. Zhang et al., “Identification of leukemia stem cell expression signatures through Monte Carlo feature selection strategy and support vector machine,” Cancer Gene Therapy, vol. 27, no. 1-2, pp. 56–69, 2020.
View at: Publisher Site | Google Scholar
J.-P. Zhou, L. Chen, T. Wang, and M. Liu, “iATC-FRAKEL: a simple multi-label web server for recognizing anatomical therapeutic chemical classes of drugs with their fingerprints only,” Bioinformatics, vol. 36, no. 11, pp. 3568-3569, 2020.
View at: Publisher Site | Google Scholar
H. Liu, B. Hu, L. Chen, and L. Lu, “Identifying protein subcellular location with embedding features learned from networks,” Current Proteomics, vol. 17, 2020.
View at: Publisher Site | Google Scholar
L. Chen, S. Wang, Y. H. Zhang et al., “Identify key sequence features to improve CRISPR sgRNA efficacy,” IEEE Access, vol. 5, pp. 26582–26590, 2017.
View at: Publisher Site | Google Scholar
L. Chen, X. Y. Pan, W. Guo et al., “Investigating the gene expression profiles of cells in seven embryonic stages with machine learning algorithms,” Genomics, vol. 112, no. 3, pp. 2524–2534, 2020.
View at: Publisher Site | Google Scholar
Y. Zhu, B. Hu, L. Chen, and Q. Dai, “iMPTCE-Hnetwork: A Multilabel Classifier for Identifying Metabolic Pathway Types of Chemicals and Enzymes with a Heterogeneous Network,” Computational and Mathematical Methods in Medicine, vol. 2021, Article ID 6683051, 12 pages, 2021.
View at: Publisher Site | Google Scholar
C. Meng, F. Guo, and Q. Zou, “CWLy-SVM: a support vector machine-based tool for identifying cell wall lytic enzymes,” Computational Biology and Chemistry, vol. 87, p. 107304, 2020.
View at: Publisher Site | Google Scholar
M. Tahir and A. Idris, “MD-LBP: an efficient computational model for protein subcellular localization from HeLa cell lines using SVM,” Current Bioinformatics, vol. 15, no. 3, pp. 204–211, 2020.
View at: Publisher Site | Google Scholar
T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
View at: Publisher Site | Google Scholar
S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 21, no. 3, pp. 660–674, 1991.
View at: Publisher Site | Google Scholar
Y.-H. Zhang, T. Zeng, L. Chen, T. Huang, and Y. D. Cai, “Determining protein-protein functional associations by functional rules based on gene ontology and KEGG pathway,” Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, vol. 1869, no. 6, article 140621, 2021.
View at: Publisher Site | Google Scholar
T. Sawa, S. Kimura, N. H. Honda et al., “Diagnostic usefulness of ribosomal protein L7/L12 for pneumococcal pneumonia in a mouse model,” Journal of Clinical Microbiology, vol. 51, no. 1, pp. 70–76, 2013.
View at: Publisher Site | Google Scholar
E. E. Davenport, R. D. Antrobus, P. J. Lillie, S. Gilbert, and J. C. Knight, “Transcriptomic profiling facilitates classification of response to influenza challenge,” Journal of Molecular Medicine, vol. 93, no. 1, pp. 105–114, 2015.
View at: Publisher Site | Google Scholar
C. Su, S. Rousseau, and A. Emad, “Identification of COVID-19-relevant transcriptional regulatory networks and associated kinases as potential therapeutic targets,” bioRxiv, 2020.
View at: Publisher Site | Google Scholar
J. Merino-Gracia, M. F. García-Mayoral, and I. Rodríguez-Crespo, “The association of viral proteins with host cell dynein components during virus infection,” The FEBS Journal, vol. 278, no. 17, pp. 2997–3011, 2011.
View at: Publisher Site | Google Scholar
J. Rossjohn, S. Gras, J. J. Miles, S. J. Turner, D. I. Godfrey, and J. McCluskey, “T cell antigen receptor recognition of antigen-presenting molecules,” Annual Review of Immunology, vol. 33, no. 1, pp. 169–200, 2015.
View at: Publisher Site | Google Scholar
L. Rowen, B. F. Koop, and L. Hood, “The complete 685-kilobase DNA sequence of the human beta T cell receptor locus,” Science, vol. 272, no. 5269, pp. 1755–1762, 1996.
View at: Publisher Site | Google Scholar
M. J. Corley, C. Sugai, M. Schotsaert, R. E. Schwartz, and L. C. Ndhlovu, “Comparative in vitro transcriptomic analyses of COVID-19 candidate therapy hydroxychloroquine suggest limited immunomodulatory evidence of SARS-CoV-2 host response genes,” bioRxiv, 2020.
View at: Google Scholar
D. Polineni, H. Dang, P. J. Gallins et al., “Airway mucosal host defense is key to genomic regulation of cystic fibrosis lung disease severity,” American Journal of Respiratory and Critical Care Medicine, vol. 197, no. 1, pp. 79–93, 2018.
View at: Publisher Site | Google Scholar
A. Anirudhan, P. I. Angulo-Bejarano, P. Paramasivam et al., “RPL6: a key molecule regulating zinc- and magnesium-bound metalloproteins of Parkinson's disease,” Frontiers in Neuroscience, vol. 15, p. 631892, 2021.
View at: Publisher Site | Google Scholar
Q. Wu, Y. Gou, Q. Wang et al., “Downregulation of RPL6 by siRNA inhibits proliferation and cell cycle progression of human gastric cancer cell lines,” PLoS One, vol. 6, no. 10, article e26401, 2011.
View at: Publisher Site | Google Scholar
G. Nallur, “Interaction of the SARS-COV2 envelope protein (E) with lysophosphatidic acid receptor 1 (LPAR1) and additional human proteins involved in inflammation, immunity, ADP ribosylation and vesicular transport. Immunity, ADP Ribosylation and Vesicular Transport,” SSRN Electronic Journal, 2020.
View at: Google Scholar
J. A. Browne, M. NandyMazumdar, A. Paranjapye, S. H. Leir, and A. Harris, “The Bromodomain Containing 8 (BRD8) Transcriptional Network in Human Lung Epithelial Cells,” Molecular and Cellular Endocrinology, vol. 524, article 111169, 2021.
View at: Publisher Site | Google Scholar
B. Wanschers, R. van de Vorstenbosch, M. Wijers, B. Wieringa, S. M. King, and J. Fransen, “Rab6 family proteins interact with the dynein light chain protein DYNLRB1,” Cell Motility and the Cytoskeleton, vol. 65, no. 3, pp. 183–196, 2008.
View at: Publisher Site | Google Scholar
M. Alhariri, M. A. Majrashi, A. H. Bahkali et al., “Efficacy of neutral and negatively charged liposome-loaded gentamicin on planktonic bacteria and biofilm communities,” International Journal of Nanomedicine, vol. Volume 12, pp. 6949–6961, 2017.
View at: Publisher Site | Google Scholar
L. M. Al-Dalawi, Effect of lipids on the infectivity of influenza A viruses, University of Nottingham, 2019.
E. Lebredonchel, M. Houdou, H. H. Hoffmann et al., “Investigating the functional link between TMEM165 and SPCA1,” Biochemical Journal, vol. 476, no. 21, pp. 3281–3293, 2019.
View at: Publisher Site | Google Scholar
C. L. Vázquez, A. Rodgers, S. Herbst et al., “The proneurotrophin receptor sortilin is required for _Mycobacterium tuberculosis_ control by macrophages,” Scientific Reports, vol. 6, no. 1, article 29332, 2016.
View at: Publisher Site | Google Scholar
Z. Zeng, H. B. Huang, L. L. Huang et al., “Regulation network and expression profiles of Epstein-Barr virus-encoded microRNAs and their potential target host genes in nasopharyngeal carcinomas,” Science China Life sciences, vol. 57, no. 3, pp. 315–326, 2014.
View at: Publisher Site | Google Scholar
J. Ma, C. Chen, A. S. Barth, C. Cheadle, X. Guan, and L. Gao, “Lysosome and cytoskeleton pathways are robustly enriched in the blood of septic patients: a meta-analysis of transcriptomic data,” Mediators of Inflammation, vol. 2015, Article ID 984825, 15 pages, 2015.
View at: Publisher Site | Google Scholar
A. M. Alshabi, I. A. Shaikh, B. M. Vastrad, and C. M. Vastrad, “Identification of differentially expressed genes and enriched pathways in SARS-CoV-2/COVID-19 using bioinformatics analysis,” Research Square, 2020.
View at: Publisher Site | Google Scholar
S. Di Giorgio, F. Martignano, M. G. Torcia, G. Mattiuz, and S. G. Conticello, “Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2,” Science Advances, vol. 6, no. 25, article eabb5813, 2020.
View at: Publisher Site | Google Scholar
J. Sun, F. Ye, A. Wu et al., “Comparative transcriptome analysis reveals the intensive early stage responses of host cells to SARS-CoV-2 infection,” Frontiers in Microbiology, vol. 11, p. 2881, 2020.
View at: Publisher Site | Google Scholar
L. Hua, S. Yao, D. Pham et al., “Cytokine-dependent induction of CD4+ T cells with cytotoxic potential during influenza virus infection,” Journal of Virology, vol. 87, no. 21, pp. 11884–11893, 2013.
View at: Publisher Site | Google Scholar
B. J. Meckiff, C. Ramírez-Suástegui, V. Fajardo et al., “Single-cell transcriptomic analysis of SARS-CoV-2 reactive CD4⁺ T cells,” SSRN Electronic Journal, 2020.
View at: Publisher Site | Google Scholar
S. Miyamoto, A. N. Murphy, and J. H. Brown, “Akt mediates mitochondrial protection in cardiomyocytes through phosphorylation of mitochondrial hexokinase-II,” Cell Death & Differentiation, vol. 15, no. 3, pp. 521–529, 2008.
View at: Publisher Site | Google Scholar
I. Ortea and J.-O. Bock, “Re-analysis of SARS-CoV-2 infected host cell proteomics time-course data by impact pathway analysis and network analysis. A potential link with inflammatory response,” BioRxiv, 2020.
View at: Publisher Site | Google Scholar
J. Cinatl Jr., G. Hoever, B. Morgenstern et al., “Infection of cultured intestinal epithelial cells with severe acute respiratory syndrome coronavirus,” Cellular and Molecular Life Sciences CMLS, vol. 61, no. 16, pp. 2100–2112, 2004.
View at: Publisher Site | Google Scholar
Y. Xia, N. Liu, X. Xie et al., “The macrophage-specific V-ATPase subunit ATP6V0D2 restricts inflammasome activation and bacterial infection by facilitating autophagosome-lysosome fusion,” Autophagy, vol. 15, no. 6, pp. 960–975, 2019.
View at: Publisher Site | Google Scholar
D. Cornblath, “DS3. 1 neuromuscular manifestations of HIV infection,” Clinical Neurophysiology, vol. 117, pp. 21–21, 2006.
View at: Publisher Site | Google Scholar
I. E. Galani and E. Andreakos, “Neutrophils in viral infections: current concepts and caveats,” Journal of Leukocyte Biology, vol. 98, no. 4, pp. 557–564, 2015.
View at: Publisher Site | Google Scholar
T. E. Sweeney, H. R. Wong, and P. Khatri, “Robust classification of bacterial and viral infections via integrated host gene expression diagnostics,” Science Translational Medicine, vol. 8, no. 346, p. 346ra91, 2016.
View at: Publisher Site | Google Scholar
P. Ranjan, N. Singh, A. Kumar et al., “NLRC5 interacts with RIG-I to induce a robust antiviral response against influenza virus infection,” European Journal of Immunology, vol. 45, no. 3, pp. 758–772, 2015.
View at: Publisher Site | Google Scholar
A. C. Yang, F. Kern, P. M. Losada et al., “Broad transcriptional dysregulation of brain and choroid plexus cell types with COVID-19,” BioRxiv, 2020.
View at: Publisher Site | Google Scholar
A. M. Filip, J. Klug, S. Cayli et al., “Ribosomal protein S19 interacts with macrophage migration inhibitory factor and attenuates its pro-inflammatory function,” The Journal of Biological Chemistry, vol. 284, no. 12, pp. 7977–7985, 2009.
View at: Publisher Site | Google Scholar
J. Li, E. Abosmaha, C. S. Coffin, P. Labonté, and T. N. Bukong, “Reticulon-3 modulates the incorporation of replication competent hepatitis C virus molecules for release inside infectious exosomes,” PLoS One, vol. 15, no. 9, article e0239153, 2020.
View at: Publisher Site | Google Scholar
M. Hassanpour, J. Rezaie, M. Nouri, and Y. Panahi, “The role of extracellular vesicles in COVID-19 virus infection,” Infection, Genetics and Evolution, vol. 85, p. 104422, 2020.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Lei Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1343

Downloads

1325

Citations