Introduction

The world is suffering from the COVID-19 pandemic since its outbreak in December 20191,2,3. COVID-19 is highly contagious and infected patients can be asymptomatic but infectious4. As of July 11, 2020, there have been over 12 million confirmed COVID-19 cases and 556,335 deaths worldwide5. Community transmission has been increasingly reported in more than 180 countries5. Before any effective and safe vaccine of COVID-19 becomes available in clinical settings, improving the efficiency of the current clinical pathways and the capacity of patient management are crucial to successfully combat the COVID-19 pandemic and possible resurgence in the future6,7. Case identification is an important first step for subsequent clinical triage and treatment optimization. The reference detection method is using the real-time reverse transcription PCR (RT-PCR) assay to detect viral RNA1. Several limitations of this assay may curb its prompt large-scale application8,9,10.

Chest computed tomography (CT) can effectively capture the manifestations of COVID-19 infections and even asymptomatic infections10,11,12. Deep learning, an artificial intelligence (AI) technology, has achieved impressive performance in the analysis of CT images13,14,15,16. Chest CT with the aid of deep learning offers promises to reduce the burden of prompt mass case detection, especially under the shortage of RT-PCR17. We developed an automated robust deep learning model, COVIDNet, by directly analyzing 3D CT images, to assist screening and diagnosis of COVID-19 infected patients. Furthermore, as of March 23, 2020, the COVIDNet system had been employed in 6 hospitals in China with PCR confirmation. We provided clinical insights into the image features extracted by COVIDNet and proposed a practical scenario on how the developed tool might improve clinical efficiency.

Results

Two independent cohorts of 2,800 patients were retrospectively recruited for model development and secondary test (Fig. 1). The model development cohort enrolled 920 COVID-19 patients and 1,073 non-COVID-19 patients, and all the patients in this cohort were randomly divided into three non-overlapping sets at the patient level: training, validation, and initial test, approximately at a 3:1:1 ratio (Fig. 1, Supplementary Tables 14). The secondary test consisted of 233 COVID-19 patients and 289 non-COVID-19 pneumonia patients (Fig. 1, Supplementary Tables 57). Regarding the two cohorts, the training and validation datasets consisted of the images of all scans for each patient to train and fine-tune the COVIDNet system. However, the initial and the secondary test dataset only employed the first CT scan image of each patient to calculate the model performance at the first-diagnosis.

Figure 1
figure 1

Flow diagram illustrating the division of model development data into training, validation, initial test, and secondary test on the secondary test dataset. The model development cohort consisted of 1197 COVID-19 patients and 1,081 non-COVID-19 patients. The secondary test cohort enrolled 233 COVID-19 patients and 289 non-COVID-19 patients. 285 patients were excluded due to ≥ 2 weeks interval of the first CT scan to the first positive nucleic acid test (244 patients), and severe artifacts (41 patients). The model development cohort was then divided into a training dataset (1316 patients), a validation dataset (305 patients), and an initial test dataset (372 patients) to train and fine-tune COVIDNet. A secondary test was then performed to test the model generalizability by comparing the diagnostic performance of COVIDNet and 8 different expert radiologists on the secondary test dataset. *COVID-19 radiologists: radiologists working at the COVID-19 designated hospitals; §non-COVID-19 radiologists: radiologists not working at the COVID-19 designated hospitals.

The initial test dataset included 372 patients, 41.4% (154/372) of which were confirmed COVID-19 cases. COVIDNet yielded a remarkable diagnostic performance, with an accuracy rate of 96.0% and an AUC of 0.986. More performance measures were shown in Table 1 and Extended Data Fig. 1. The model performance was based on the first CT scan of each patient. Note that our development dataset might have an overlap of 17 patients with the dataset of Li et al.18, including 14 patients in the training set, and 3 patients in the initial test set. Retraining and retesting without these patients yielded similar results (Supplementary Table 8 and Extended Data Fig. 2).

Table 1 COVIDNet diagnostic performance on the initial test dataset.

In the secondary test, COVIDNet significantly outperformed all eight radiologists, which correctly diagnosed 492 of 522 patients, with an accuracy of 94.3% (CI 92.1%-96.1%) and an AUC of 0.981 (CI 0.969–0.990) (Table 2). Cohen’s κ coefficient19 was used to access inter-rater agreement between COVIDNet and the three radiologist groups (all radiologists, the radiologists from the COVID-19 designated hospitals, and the other radiologists) (Supplementary Table 9). Median inter-rater agreement among the radiologists in each group was good, with κ = 0.658 [IQR 0.539–0.776], κ = 0.738 [IQR 0.680–0.797], and κ = 0.600 [IQR 0.502–0.699], respectively. The agreement was good between each group of radiologists and our model (κ = 0.745, 0.746, and 0.678, respectively). The agreement between the COVID-19 radiologists and the non-COVID-19 radiologists was excellent (κ = 0.853).

Table 2 Diagnostic performance for each patient among COVIDNet and eight radiologists on the secondary test cohort.

The t-SNE representation of chest CT showed two clear clusters, color-coded by the class labels (Fig. 2). Most cases are located within their respective clusters, suggesting that COVIDNet successfully extracted distinct CT features of COVID-19 pneumonia. We selected three groups of representative cases (G1, G2, G3 in Fig. 2), and presented their CT manifestations along with the probability of COVID-19 in Extended Data Table 8. A typical manifestation of COVID-19 pneumonia is multiple ground-glass opacity (GGO) in the subpleural area of bilateral lungs. Radiologists confirmed a similar manifestation in the COVID-19 cluster (for example, the G1 red points). As for the misclassified COVID-19 cases (the G3 red points), three cases had no definite finding, which was difficult to identify only using the images; the other cases had extensive GGO with partial consolidation or combined with pleural effusion and interstitial edema, which were not the typical manifestations of COVID-19. The misclassified non-COVID-19 cases (the G2 blue points) consisted of one bacterial and two influenza B patients, which were classified as COVID-19 due to the appearance of extensive GGO.

Figure 2
figure 2

t-SNE of CT images of COVID-19 and other causes of pneumonia on the secondary test dataset. G1 depicted chest CT images of COVID-19 pneumonia with the highest differentiation. G2 represented three cases of false-positive prediction of COVIDNet. G3 pointed out nine cases of false-negative prediction of COVIDNet. CT image manifestations of G1 to G3 were illustrated in Supplementary Table 10.

After the secondary test, we deployed the COVIDNet system in 6 hospitals to assist radiologists to screen suspected patients upon initial contact. The application pipeline of COVIDNet was illustrated in Fig. 3 and Extended Data Fig. 3. The pipeline aided efficiency improvement of the clinical pathway, which offered generalizable clinical insights for regions under considerable strains of nucleic test kits, with limited testing facilities, or facing community transmission epidemic. As of March 23, 2020, the COVIDNet system had been used to process 11,966 CT scans in six hospitals with PCR confirmation, resulting in a sensitivity of 90.52% and a specificity of 88.50% (Fig. 4 and Supplementary Table 11).

Figure 3
figure 3

The pipeline of COVIDNet application in the real world.

Figure 4
figure 4

The performance of COVIDNet in the real-world application. (a) The confusion matrix of COVIDNet in the 6 hospitals, (bg) the confusion matrix of COVIDNet in each hospital (bg represents Hospital 1–6, respectively).

Discussion

The COVID-19 pandemic continues to spread widely around the world. Until an effective vaccine becomes available in clinical use, we are in the combat against SARS-CoV-2 for the foreseeable near future. Accurate and prompt diagnosis of COVID-19 infection is essential for patient management. The specified criteria described in the current COVID-19 clinical management guideline have faced several challenges20. As the primary diagnostic tool, the nucleic acid test has several disadvantages21. Early clinical manifestations of COVID-19 are fever, cough, and dyspnea that are similar to non-COVID-19 viral pneumonia. Although chest radiograph is the initial screening image tool in some countries around the world, chest CT has been a vital part of the COVID-19 infection diagnostic pathway. COVID-19 infection with the main CT presentation of GGO can be easily confused with other viral pneumonia and fungal pneumonia. COVID-19 infection with the main CT manifestation of consolidation may be confused with a bacterial infection.

Our research showed that COVIDNet offered one powerful tool for screening the COVID-19 suspected patients. It could distinguish COVID-19 from other pneumonia infections promptly and accurately. The secondary test showed COVIDNet’s robustness against seven other types of pneumonia with confirmed pathogen evidence and various CT devices, as well as its faster and more accurate performance over expert radiologists. Our results also showed that the radiologists from the COVID-19 designated hospitals performed better than those from the non-epidemic regions. The excellent inter-rater reliability among the radiologists, together with their overall poorer performance against COVIDNet, suggested that COVIDNet provided more unbiased results and captured clinically important features of COVID-19 infections that might not have been detected by the human experts, given the fact that all COVID-19 cases were confirmed via nucleic test.

One recent study developed a deep learning screening tool for COVID-1918. They extracted features from each axial CT scan of a patient independently and aggregated the stack of features right before making the classification decision. On the contrary, our COVIDNet model directly extracted spatial features from the entire 3D CT scan using a true three-dimensional deep learning model. We also demonstrated the generalizability of our model on a secondary test dataset through comparison with expert radiologists. Most importantly, COVIDNet had been deployed for clinical use in 6 hospitals in China with PCR confirmation as of March 23rd, 2020.

Lacking methods for visualizing how deep learning works has been one of the major bottlenecks for its application in medical settings. To further investigate how COVIDNet made classification decisions, we visualized the extracted features using t-Distributed Stochastic Neighbor Embedding (t-SNE)22. The results showed that COVIDNet indeed extracted image features that could separate COVID-19 from the other types of pneumonia. We reported image signatures from representatives of the correctly classified COVID-19 cases, the misclassified COVID-19, and non-COVID-19 cases. Such image signatures could offer useful insights for clinical decisions. However, due to the limitations of the indistinct outline of lesion regions in the CT images, it would be subjective to classify COVID-19 and non-COVID-19 pneumonia by the approach of image labeling and segmentation, which is a traditional pathway to illustrate the difference of the diseases.

When facing an outbreak of COVID-19, often with a severe shortage of medical personnel, prompt and accurate image review and interpretation might be a key limiting factor for appropriate clinical decision making. COVIDNet can rapidly detect clinically relevant lung lesions from hundreds of CT images. Together with the probability sorting, COVIDNet may greatly improve the screening and diagnosis efficiency. Besides, COVIDNet can automatically quantify the proportion of image abnormalities, supporting further clinical decisions. However, CT presentations for patients with COVID-19 vary dramatically according to stages of the disease, especially for those with basic diseases and complications. Other types of pneumonia may also share image abnormalities with COVID-19. Deep learning technology may not perform well under these circumstances. Therefore, a patient’s epidemiological and clinical information needs to be closely integrated for further clinical decision-making in the diagnosis and treatment of COVID-19, and the scores produced by the model are not calibrated. Therefore, even though they can serve as a proxy for the classification confidence, their interpretation is rooted in accumulated experience obtained by integrating the model with clinical practice. Moreover, the slight overlap in hospitals between the model development cohort and the external validation may lead to the imperfection of the external validation. Above all, no one can ignore that this virus is evolving in directions that we don’t know yet23. In this particular occasion, COVIDNet would serve as an effective tool for routine screening in clinical settings where chest CT is prescribed. The screening role of COVIDNet may be limited in regions where chest radiography is the primary investigation method instead of CT.

In conclusion, we have developed an automated classification neural network model, COVIDNet, specifically designed to distinguish COVID-19 from seven other types of pneumonia with confirmed pathogens through analyzing patients’ 3D chest CT scans. In principle, the model can be deployed anywhere in the world with CT imaging capability at a low cost and provide radiological decision support where COVID-19 imaging diagnosis expertise is scarce, especially when facing COVID-19 outbreaks. Our results warrant further validation in future studies.

Methods

Datasets

We retrospectively recruited two cohorts for model development and secondary test, with a total of 2,800 patients (1,430 COVID-19 patients and 1,370 non-COVID-19 patients). And only the non-contrast scans were enrolled in this study. The model development dataset consisted of CT scans from 2,278 pneumonia patients, who suffered from either COVID-19 or other types of pneumonia. We collected 1,197 COVID-19 cases between January 5, 2020 and March 1, 2020 from ten designated COVID-19 hospitals in China (Supplementary Table 1). These COVID-19 cases were confirmed by positive results from RT-PCR assays testing nasal or pharyngeal swab specimens. We also randomly selected 1,081 non-COVID-19 patients with chest CT abnormalities according to the criteria listed in Supplementary Table 2 from patients that were hospitalized between November 18, 2018 and February 21, 2020 in three other general hospitals in China (Supplementary Table 4).

We excluded 285 patients under the following two circumstances after screening all images by two senior radiologists with 30-year work experience (Fig. 1): 244 COVID-19 patients with the time between the first CT scan and the first positive nucleic acid test longer than two weeks; and 41 patients with large breathing or body motion artifacts, including 33 COVID-19 patients and 8 non-COVID-19 patients.

For the secondary test cohort, we collected 233 COVID-19 cases between March 2, 2020 to March 13, 2020 from four COVID-19 designated hospitals. We also randomly selected 289 patients that were hospitalized between February 22, 2020 and March 1, 2020 in two general hospitals in China. Two of the COVID-19 hospitals and one non-COVID-19 hospital are also enrolled in the model development cohort. The inclusion criteria of COVID-19 and non-COVID-19 patients are the same as described above.

All CT scans of the two cohorts were performed upon the first contact, with patients in the supine position at full inspiration, and covered the whole chest.

Study ethics

This study is compliant with the “Guidance of the Ministry and Technology (MOST) for the Review and Approval of Human Genetic Resources”. All the CT image data were obtained from Chinese PLA General Hospital, Suizhou Central Hospital, Wuhan Third Hospital, Wenzhou Central Hospital, Xiantao First People’s Hospital affiliated to Yangtze University, The First People's Hospital of Jiangxia District, Wuhan Jinyintan Hospital, Affiliated Hospital of Putian University, Chengdu Public Health Clinical Medical Center, Wuhan Huangpi People's Hospital, Dazhou Central Hospital, Beijing Daxing District People's Hospital, Shaoxing People's Hospital, The People’s Hospital of Zigui, Anshan Central Hospital, Guizhou Provincial People’s Hospital, 5th Medical Center of Chinese PLA General Hospital. This study was approved by the Ethics Committee (EC) of all the hospitals, and the written informed consent was waived by the ECs of all the hospitals since the data under evaluation has been de-identified and the research poses no potential risk to patients. The study follows the Declaration of Helsinki. The Trial Registration Number is ChiCTR2000030390 in Chinese Clinical Trial Registry, http://www.chictr.org.cn/showproj.aspx?proj=50224.

CT image collection and preprocessing

CT images were obtained from different scanners at multiple imaging centers. The detailed CT scan setting and device distribution were listed in Supplementary Table 12 and Extended Fig. 4. COVIDNet was a classification neural network that classifies 3D CT images to either the COVID-19 pneumonia class or the non-COVID-19 pneumonia class. CT scans from different sources had various numbers of slices and slice dimensions. For unification, each of the 3D CT scan volumes was preprocessed in the following way. We first removed extreme voxel intensities by clipping those outside the range [− 1024, 1024] to the interval edges, and linearly scaled the clipped intensities to [0,1]. We subsequently resized the 3D scans to a stack of 64 square axial images of dimension 512 through linear interpolation and cropped the central square region of size 384 on each 2D axial slice, yielding a stack of 64 axial images of size 384, which was the input to the model. Preprocessed image examples were included in Extended Fig. 5.

Architecture of COVIDNet

The structure of COVIDNet is illustrated in Extended Fig. 6, which is a modified DenseNet-264 model consisting of 4 dense blocks24. Each dense block has different numbers of composition units. Each unit consists of two sequentially connected stacks with an instance normalization layer25, a ReLU activation layer, and a convolution layer. It receives feature maps from all preceding units in the same dense block through dense connections. The training batch size is 8. We adopted Adam optimizer26 with a learning rate of 0.001 to minimize the binary cross-entropy loss. The model was developed using TensorFlow (version 1.8 with CUDA V9.1.85 and cuDNN 7.0.5) on 16 T P100 GPU.

Model evaluation

The performance of the model was evaluated using accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score. The receiver operating characteristic (ROC) curve and confusion matrix were generated based on the classification results. The area under the ROC (AUC) was also calculated. Bootstrap with 10,000 replications was used to calculate the 95% confidence interval for each metric. Evaluation results were obtained and visualized using python libraries, including NumPy (v.1.16.4), pandas (v.0.25.3), scikit-learn (v.0.19.2), and Matplotlib (v.2.1.2).

Furthermore, the performance of COVIDNet was compared with eight independent expert radiologists with 6–23 years of experience, on the diagnosis of COVID-19 using the secondary test dataset. Four radiologists are from the COVID-19 designated hospitals and the other four are not. To ensure that the radiologists could concentrate on the trail, each of them only read CT images no more than two hours per day under the surveillance of one research assistant. Before the radiologists initiated the CT image reading, the research assistant informed each radiologist about the CT signs in the guidelines to eliminate knowledge bias. The true pneumonia class was blinded to all the radiologists. We also used Cohen’s \(\kappa\) coefficient to evaluate the inter-rater agreement among COVIDNet and the eight radiologists (Supplementary Table 9)19. We categorized κ coefficients as follows: poor (0 < κ ≤ 0.20), fair (0.20 < κ ≤ 0.40), moderate (0.40 < κ ≤ 0.60), good (0.60 < κ ≤ 0.80), and excellent (0.80 < κ ≤ 1.00).

To further understand the model’s classification decision, we visualized the extracted feature distribution of the model using t-Distributed Stochastic Neighbor Embedding (t-SNE)22, an unsupervised non-linear dimension reduction algorithm commonly used to visualize high dimensional data. It projects high-dimensional feature maps right before the final fully-connected layer of COVIDNet onto a two-dimensional space and converts similarities between the original data pairs to similarities between the projected data pairs in the two-dimensional space. Since it considers the local structure so that after projection, it can reveal interesting clusters among the data.