Introduction
Coronavirus disease 2019 (COVID-19) has been widespread worldwide since December 2019 [1], [2]. It is highly contagious, and severe cases can lead to acute respiratory distress or multiple organ failure [3]. On 11 March 2020, the WHO has made the assessment that COVID-19 can be characterised as a pandemic. As of 8th April 2020, in total, 1,391,890 cases of COVID-19 have been recorded, and the death toll has reached 81,478 with a rapid increase of cases in Europe and North America.
The disease can be confirmed by using the reverse-transcription polymerase chain reaction (RT-PCR) test [4]. While being the gold standard for diagnosis, confirming COVID-19 patients using RT-PCR is time-consuming, and both high false-negative rates and low sensitivities may put hurdles for the presumptive patients to be identified and treated early [3], [5], [6].
As a non-invasive imaging technique, computed tomography (CT) can detect those characteristics, e.g., bilateral patchy shadows or ground glass opacity (GGO), manifested in the COVID-19 infected lung [7], [8]. Hence CT may serve as an important tool for COVID-19 patients to be screened and diagnosed early. Despite its advantages, CT may share some common imagery characteristics between COVID-19 and other types of pneumonia, making the automated distinction difficult.
Recently, deep learning based artificial intelligence (AI) technology has demonstrated tremendous success in the field of medical data analysis due to its capacity of extracting rich features from multimodal clinical datasets [9]. Previously, deep learning was developed for diagnosing and distinguishing bacterial and viral pneumonia from thoracic imaging data [10]. In addition, attempts have been made to detect various chest CT imaging features [11]. In the current COVID-19 pandemic, deep learning based methods have been developed efficiently for the chest CT data analysis and classification [2], [3], [12]. Besides, deep learning algorithms have been proposed for COVID-19 monitoring [13], screening [14] and prediction of the hospital stay [15]. A full list of current AI applications for COVID-19 related research can be found elsewhere [16]. In this study, we will focus on the chest CT image based localisation for the infected areas and disease classification and diagnosis for the COVID-19 patients.
Although initial studies have demonstrated promising results by using chest CT for the diagnosis of COVID-19 and detection of the infected regions, most existing methods are based on commonly used supervised learning scheme. This requires a considerable amount of work on manual labelling of the data; however, at such an outbreak situation clinicians have very limited time to perform the tedious manual drawing, which may fail the implementation of such supervised deep learning methods. In this study, we propose a weakly supervised deep learning framework to detect COVID-19 infected regions fully automatically using chest CT data acquired from multiple centres and multiple scanners. Based on the detection results, we can also achieve the diagnosis for the COVID-19 patients. In addition, we also test the hypothesis that based on the CT radiological features, we can classify COVID-19 cases from community acquired pneumonia (CAP) and non-pneumonia (NP) scans using the deep neural networks we developed.
Materials and Methods
A. Patients and Data
This retrospective study was approved by the institutional review board of the participating hospitals in accordance with local ethics procedures. Further consent was waived with approval. This study included 150 3D volumetric chest CT exams of COVID-19, CAP and NP patients, respectively. In total, 450 patient scans acquired from two participating hospitals between September 2016 and March 2020 were included for further analysis.
All the COVID-19 patients were confirmed as positive by the RTPCR testing that were scanned from December 2019 to March 2020. According to the diagnosis and treatment program of COVID-19 (Trial version sixth) issued by the National Health Commission in China [17], the clinical classification of COVID-19 patients can be categorised as mild, moderate, severe, and critical. All our COVID-19 patients were at severe or critical stage and all the CT scans had been performed within 3 days of hospitalisation.
CAP and other NP (no lung disease, lung nodules, chronic inflammation, chronic obstructive pulmonary disease) patients were randomly chosen from the participating hospitals between September 2016 and January 2020. The inclusion criteria of CAP patients are in accordance with the guidelines on the management of community-acquired pneumonia in adults published by the Infectious Diseases Society of America/American Thoracic Society [18]. CAP diagnosis is focused on the existence of identified clinical characteristics (e.g., cough, fever, sputum development, and pleuritic chest pain) and is accompanied by pulmonary examination, typically by chest X-ray and in our case using CT. In the regular examination of patients that are suspected to have CAP, a chest radiograph is needed to determine the diagnosis and to better distinguish CAP from other specific causes of cough and fever, such as acute bronchitis. Although various CT manifestations might be observed due to different pathogens, all our CAP patients were laboratory confirmed bacterial culture positive cases or negative cases, e.g., with mycoplasma and viral pneumonia. Our assumption is that the proposed weakly supervised deep learning method can sense subtle discrepancies in CT images acquired for CAP and COVID-19 patients. NP patients were diagnosed with no lung disease or lung disease, e.g., lung nodules, chronic inflammation, chronic obstructive pulmonary disease and others. It is of note that the criterion for normal CT in the context is that the CT examinations have shown no obvious lesions in both lungs.
Demographic statistics of the patients are as reported in Table 1. One-way ANOVA (ANalysis Of VAriance) were conducted on gender and age distribution over the three patient groups and the
COVID-19 patients were admitted from two hospitals in China, including 138 patients from Hospital of Wuhan Red Cross Society (WHRCH) and 12 patients from Shenzhen Second Hospital (SZSH). Both CAP and NP patients were recruited from SZSH. COVID-19 patients were obtained from either Siemens SIEMENS SOMATOM go.Now16 (WHRCH) or GE Revolution 256 (SZSH) CT systems. For the SIEMENS SOMATOM go.Now16 CT system, the scanning parameters were as follows: tube voltage = 130 kVp, automatic tube current modulation = 50 mAs, pitch = 1.5 mm, matrix
B. Dataset for Lung Segmentation
In order to achieve a highly accurate lung segmentation that can facilitate the following infection detection and classification, we utilised an open dataset (TCIA dataset) [19] for training a deep neural network for the lung delineation. The data can be accessed from the Cancer Imaging Archive (TCIA) Public Access.1 In total, 60 3D CT lung scans were retrieved with manual delineations of the lung anatomy. These open datasets were made publicly accessible from the scans obtained by three different institutions: MD Anderson Cancer Centre, Memorial Sloan-Kettering Cancer Centre, and the MAASTRO clinic, with 20 cases from each institution. All the data were scanned with matrix
C. Pre- and Post-Processing for Lung Segmentation
Data pre-processing steps were performed to standardise data acquired from multiple centres and multiple scanners. Instead of normalising input slices into a pre-defined Hounsfield unit (HU) window, we designed a more flexible scheme based on previously proposed image enhancement methods [20], [21]. Rather than clipping based on HU windows, we proposed to use a fixed-sized sliding window
D. Detection and Classification Network
Inspired by the VGG architecture [24], we adopted the configuration that increased CNN depth using small convolution filters stacked with non-linearity injected in between, as depicted in Figure 1. All convolution layers consisted of
Network architecture of our proposed weakly supervised multi-scale learning framework for COVID-19/NP/CAP classification and lesions detection.
E. Multi-Scale Learning
From the previous findings using CT [25]–[27], it is known that infections of COVID-19 share the similar and common radiographic features as CAP, such as GGO and airspace consolidation. They frequently distribute bilaterally, peripherally in lower zone predominant, and the infectious areas can vary significantly in size depending on the condition of the patients. For example, in mild cases the lesions appear to be small, but in severe cases they appear scattered and spread around over a large area. Therefore, we proposed a multi-scale learning scheme to cope with variations of the size and location of the lesions. To implement this, we fed the intermediate CNN representations, i.e., feature maps, at Conv3, Conv4 and Conv5, respectively into the weakly supervised classification layers, in which \begin{equation*} L= -\frac {1}{N}\sum _{i=1}^{N}w_{i}f_{i}\left({S_{c}(x_{i})-\log \sum _{k=1}^{K}e^{S_{k}(x_{i})}}\right), \tag{1}\end{equation*}
F. Weakly Supervised Lesions Localisation
After determining the class score maps and the image category in a forward pass through the network, the discriminative patterns corresponding to that category can then be localised in the image. A coarse localisation could already be achieved by directly relating each of the neurons in the class score maps to its receptive field in the original image. However, it is also possible to obtain pixel-wise maps containing information about the location of class-specific target structures at the resolution of the original input images. This can be achieved by calculating how much each pixel influences the activation of the neurons in the target score map. Such maps can be used to obtain a much more accurate localisation, like the examples shown in Figure 2.
Examples of saliency maps for COVID-19 lesions localisation: (a) shows an example input image, (b) shows the saliency map obtained at Conv3, (c) shows the saliency map obtained at Conv4, (d) shows the saliency map obtained at Conv5, (e) shows the overlay of the joint saliency map (pixel-wise multiplication of the Conv3, Conv4 and Conv5 saliency maps) with the input image, and (f) shows the resulting bounding boxes.
In the following, we will show how categorical-specific saliency maps can be obtained through the integrated gradients. Besides, we will also show how to post-process the saliency maps from which we can extract bounding boxes around the detected lesions.
1) Category-Specific Saliency
Generally, suppose we have a flattened input image denoted as \begin{equation*} \phi _{i}(S(x),x,x')\!=(x_{i}-x'_{i})\!\times \int _{\alpha =0}^{1}\frac {\partial S(x'+\alpha (x-x'))}{\partial x_{i}}d\alpha,\quad \tag{2}\end{equation*}
\begin{align*} \phi _{i}(S(x),x,x')\!\approx \! (x_{i}-x'_{i})\!\times \sum _{n=1}^{m}\frac {\partial S\left({x'+\frac {n}{m}\times (x-x')}\right)}{\partial x_{i}}\!\times \! \frac {1}{m}, \!\!\!\!\!\!\!\! \\\tag{3}\end{align*}
2) Bounding Box Extraction
Next, we post-processed the joint saliency map from which a bounding box can be extracted. Firstly, we took the absolute value of the joint saliency map and blurred it with a
G. Implementation Details
1) Experiments Setup
We trained the proposed model for both a three-way classification (i.e.,
2) Training Configurations
We implemented the proposed model (as depicted in Figure 1) using Tensorflow 1.14.0. All models were trained from scratch on four Nividia GeForce GTX 1080 Ti GPUs with an Adam optimiser (learning rate: 10−4,
3) Data Augmentation
We applied several random on-the-fly data augmentation strategies during training, including (1) cropping square patches at the centre of the input frames with a scaling factor randomly chosen between 0.7 to 1, and resized the crops to the size of
Dice scores of the lung segmentation using different pre-processing and post-processing methods on the TCIA dataset. Left Panel: without any pre-processing; Middle Panel: normalising using a pre-defined Hounsfield unit (HU) window; Right Panel: normalising using the proposed fixed-sized sliding window. W/O P: without multi-view learning based post-processing; W P: with multi-view learning based post-processing.
H. Evaluation Metrics
Using positive results of the RTPCR testing as the ground truth labelling for the COVID-19 group and diagnosis results of CAP and NP patients, accuracy, precision, sensitivity and specificity [34], [35] of our classification framework were calculated. We also carried out the area under the receiver operating characteristic curve (AUC) analysis for the quantification of our classification performance. For the lung segmentation, we used Dice score [36] to evaluate the accuracy.
Experiments and Results
A. Lung Segmentation
In order to evaluate the lung segmentation network, we randomly split the 60 TCIA data with ground truth into 40 training, 10 validation and 10 independent testing datasets. Ablation study results of different pre-processing and post-processing methods using Dice scores are shown in Figure 3.
B. Infection Detection
1) Class Activation Mapping
As a result of multi-scale learning, Figure 4 illustrates some examples of COVID-19 class activation maps (CAMs) obtained at the different feature levels, i.e., Conv3, Conv4 and Conv5. The CAMs depict the spatial distribution of classification probability on which the hot areas indicate where infected areas are. The hotter the areas, the more likely they are infected. Of note from the multi-scale CAMs, our proposed model learns to capture the distributions of lesions with different scale: for instance, the large patchy-like lesions, such as crazy paving sign and consolidation; and also small nodule-like lesions, such as ground-glass opacities (GGO) and bronchovascular thickening. Although the CAMs can indicate where the diseased regions are, they are still too coarse to localise and estimate the extent of lesions precisely. The saliency maps shown in Figure 5, on the other hand, can provide pixel-level information that delineates the exact extent of the lesions, and therefore can deduce a precise localisation of the lesions. Notably from the saliency maps, the mid-level layer, i.e., Conv3 can learn to detect small lesions (GGO most frequently), especially those distributed peripherally and subpleurally. However, Conv3 is not able to capture larger patchy-like lesions, and this may be because of the limited receptive field at the mid-layer. On the contrary, the higher-level layers, e.g., Conv4 and Conv5, having sufficiently large receptive filed to detect the diffuse and patchy-like lesions, such as crazy paving sign and consolidation, which are often distributed centrally and peribronchially. However, Conv4 and Conv5 tend to overestimate the extent of small lesions. The multi-scale features complement each other and result in more precise localisation and estimation of the lesions extent, as shown from the joint saliency maps.
Multi-scale detection of COVID-19 lesions with varied size. Green box: small lesions. Yellow box: mix of small and large patchy or strip like lesions. Red box: large lesions.
2) Categorical-Specific Saliency
Figure 6 shows the examples of categorical-specific joint saliency computed by integrated gradients. It shows the original inputs on the left and the overlaid saliency on the right. CAMs showed in Figure 4 only depict the spatial distribution of infection. However, it can not be used for precise localisation of the lesions. The saliency maps, on the other hand, can provide pixel-level information that delineates the exact extent of the lesions so providing a precise localisation of the lesions.
The saliency maps can also be useful for diagnosis that the percentage of infection to lung areas can be estimated automatically. These saliency maps highlight the pixels that contribute to increasing categorical-specific scores: the brighter the pixels, the more significant the contribution. Intuitively, one can also interpret this as the brighter the pixels are, the more critical features to the network to make the decision (prediction). It is of note that in Figure 4 and Figure 6, there is not only an inter-class contrast variation (due to the data are collected from multi-institutions) but also an intra-class contrast variation, especially in COVID-19 group. In our experiments, we found that histogram matching can suppress lesions, especially on COVID-19 images; for instance, GGO disappears or become less apparent. Besides, this leads to inferior performance of detection. Therefore, instead of directly applying histogram matching, we applied random on-the-fly contrast adjustment for data augmentation at training time. This turns out to be very effective, as demonstrated in Figure 6, our proposed model learns to be invariant to image contrast, and precisely capture the lesions.
In particular, in Figure 8, we randomly selected typical example images to illustrate the variations of the image contrast in COVID-19 cases and compared the saliency maps obtained from models trained with and without contrast augmentation (CA vs. NCA). We found that without contrast augmentation, the saliency maps tend to be noisy and poor in localisation, as mis-detection can be observed often in the cases such as either only partial instances of infection being captured or the regions without infection being captured. Whereas, with contrast augmentation, the learned models generate more discriminative saliency maps and localisation of infected areas is robust and more accurate against the contrast variation. As can be seen (enclosed by green box), our model with contrast augmentation is capable of capturing all the diseased regions and highlighting their extent precisely, regardless single or multiple instances of infection.
Bounding boxes extracted from saliency for COVID-19 and CAP examples. (Corresponding to the examples in Figure 6).
Effect of applying random contrast augmentation (in data augmentation). Contrast adjustment leads to better saliency quality (less noisy) and more precise and contrast-invariant detection of infected areas. Cyan arrows: false positives of the saliency maps; Pink arrows: false negatives of the saliency maps; NCA: No Contrast Adjustment; CA: with Contrast Adjustment.
In addition, from the COVID-19 and CAP saliency, we found that the CAP lesions are generally smaller and more constrained locally compare to COVID-19 cases that often have multiple infected regions and lesions are massive and scattered. It should also be noted that COVID-19 and CAP lesions do share similar radiographic features, such as GGO and air space consolidation. Besides, GGOs appear frequently in subpleural regions as well in CAP cases. Interestingly, from the saliency map for the NP cases, we found the network takes the pulmonary arteries as the salient feature. Finally, Figure 7 shows the bounding boxes extracted from COVID-19 and CAP saliency maps (corresponding to the examples in Figure 6). We found the results agree with our primary findings that CAP cases have less infected areas and often there is single-instance of infection, in contrast, COVID-19 cases often have more infected areas (multi-instances of infection), and the COVID-19 lesions vary a lot in terms of extent. Overall, CAP infection areas are smaller compare to those of COVID-19.
C. Classification Performance
Performance of our proposed model for each specific task was evaluated with 5-fold cross-validation, and the results on the test set are reported and summarised in Table 3. We use five evaluation metrics, which are accuracy (ACC), precision (PRC), sensitivity (SEN), specificity (SPE) and the area under the ROC curve (AUC). We report the mean of 5-fold cross-validation results in each metric with the 95% confidence interval. We also compared our proposed method with a reimplementation of the Navigator-Teacher-Scrutinizer Network (NTS-NET) [37].
As described earlier in the experimental settings, basically we have two groups of tasks: three-way classification tasks (indicated by *) and binary classification tasks (indicated by ≀), and two learning configurations: single-scale learning (indicated by †) that assigns an auxiliary classifier to a specific feature level, and multi-scale learning (indicated by ‡) that aggregates the multi-level prediction scores then trained with a joint classifier. All the binary tasks listed were trained with the multi-scale learning. In terms of three-way classification, we found the multi-scale learning with joint classifier achieves superior overall performance than any of the single-scale learning tasks. It is of note that among the single-scale learning tasks, classification with Conv4 and Conv5 features achieve very similar performance in every metric, which is significantly better than classification with mid-level, i.e., Conv3 features. One possible explanation is the mid-level features are not sufficiently semantic compare to higher-level features, i.e., Conv4 and Conv5. As we know, high-level CNN representations are semantically strong but poorly at preserving spatial details, whereas mid-lower level CNN representations preserve well the local features but lack of semantic information.
Furthermore, it is of note that, overall, binary classification tasks achieve significantly better performance than three-way classification, especially in the tasks, such as NP/COVID-19 and NP/CAP. It can be seen our proposed model is reasonably good at distinguishing COVID-19 cases from NP cases as suggested by the results, showing that it achieves a mean ACC of 96.2%, PRC of 97.3%, SEN of 94.5%, SPE of 95.3% and AUC of 0.970, respectively. One can explain this is because binary classification is less complicated, and there is also less uncertainty than three-way classification. This may also because COVID-19 and CAP image features are intrinsically discriminative compare to the NP cases. For instance, as the COVID-19 cases demonstrated earlier, there is often a combination of various diseased patterns and large areas of infection on the scans.
Last but not least, we found that the performance of COVID-19/CAP classification is the least superior among all the binary classification tasks. One possible reason is COVID-19 shares the similar radiographic features with CAP, such as GGO and airspace consolidation and the network capacity may not be enough to learn disease-specific representations. Nevertheless, the results obtained using our proposed method outperformed the ones obtained by the NTS-NET.
We also break down the overall performance, i.e., the joint classifier into classes, and the classification metrics are reported for each class, as shown in Table 4 and Figure. 9. We found that the models learned without contrast augmentation are biased that the classification performance for COVID-19 is significantly better than the other two classes. This may because models learn to discriminate the classes based on image style (contrast) rather than the content (normal or disease patterns) and the COVID-19 class in our data has the most discriminative contrast style (high variability in brightness) among all three classes. In comparison, learning with contrast augmentation results in superior overall classification performance (Table 3) and no class bias (Table 4). In addition, the “COVID-19” and the “NP” classes achieve the comparable performance in each metric and the “NP” class has higher sensitivity (91.3%) than the COVID-19 (87.6%) and CAP (83.0%). Besides, we found, overall, the “COVID” remains the best performed and the most discriminative class with a mean AUC of 0.923, compared to the “CAP” (0.864) and the “NP” (0.901). It can also be noted that the overall results for the class “CAP” are moderately lower than those of the “NP” and “COVID-19”. This could be correlated with our finding in the COVID-19/CAP classification that because of similar appearance, the “CAP” class is likely to be misclassified as the “COVID-19” sometimes. Also, another possible reason is that the network could have learned and be distracted by the few “NP noises”, and there might be a fractional number of non-infected slices in between the CAP training samples. This is because we sampled all the available slices from each subject, and there might be a few slices having no infections.
Receiver operating characteristic (ROC) of individual categories for three-way classification (5-fold cross-validated). (a) NP with AUC of 0.90± 0.03 (mean±standard deviation); (b) CAP with AUC of 0.86± 0.03 (c) COVID-19 with AUC of 0.92± 0.02. The green region indicates the 95%CI. COVID-19: coronavirus disease 2019, CAP: Community Acquired Pneumonia, NP: Non-Pneumonia, CI: Confidence Interval.
Discussions
In this work, we have presented a novel weakly supervised deep learning framework that is capable of learning to detect and localise lesions on COVID-19 and CAP CT scans from image-level label only. Different from other works, we leverage the representation learning on multiple feature levels and have explained what features can be learned at each level. For instance, the high-level representation, i.e., Conv5 captures the patch-like lesions that generally have a large extent. However, it tends to discard small local lesions. This is well complemented by the mid-level representations (Figure 4), i.e., Conv4 and Conv5, from which the lesions detected also correspond to our clinical findings that the infections usually located in the peripheral lung (95%), mainly in the inferior lobe of the lungs (65%), especially in the posterior segment (51%). We speculate that it is mainly because there are more well-developed bronchioles, alveoli, rich blood flows and immune cells such as lymphatic cells in the periphery. These immune cells played a vital role in the inflammation caused by the virus. We have also demonstrated that combing multi-scale saliency maps, generated by integrated gradients, is the key to achieve a precise localisation of multi-instance lesions.
Furthermore, from a clinical perspective, the joint saliency is useful that it provides a reasonable estimation of the percentage of infected lung areas, which is a crucial factor that clinicians take account for evaluating the severity of a COVID-19 patient. Besides, the classification performance of the proposed network has been studied extensively that we have not only conducted three-way classification but also binary classification by combining any two of the classes.
We found one limitation of the proposed network is that it is not discriminative enough when it comes to separate the CAP from COVID-19. We suspect this is due to the limited capacity of the backbone CNN that a straightforward way of boosting CNN capacity is to increase the number of feature channels at each level. Another attempt in the future would be employing more advanced backbone architecture, such as Resnet and Inception. Another limitation in this work is that we have trained the networks on individual slices (images) that we use all available samples for each subject. However, for the CAP or COVID-19 subjects, there might be fractional non-infection slices in between which could introduce noises in training, which have been confirmed by scrutinisation by our clinicians. In the future, we can address the limitation by attention-based multiple instances learning that instead of training on individual slices, we put the patient-specific slices into a bag and train on bags. The network will learn to assign weights to individual slices in a COVDI-19 or CAP positive bag and automatically sample those high weighted slices for infection detection. Further supervision via labelled non-infection slices may also boost the performance of our proposed model, but at a cost of time-consuming manual labelling procedure.
Conclusion
In this study, we designed a weakly supervised deep learning framework for fast and fully-automated detection and classification of COVID-19 infection using retrospectively extracted CT images from multi-scanners and multi-centres. Our framework can distinguish COVID-19 cases accurately from CAP and NP patients. It can also pinpoint the exact position of the lesions or inflammations caused by the COVID-19, and therefore can also potentially provide advice on patient severity in order to guide the following triage and treatment. Experimental findings have indicated that the proposed model achieves high accuracy, precision and AUC for the classification, as well as promising qualitative visualisation for the lesion detections. Based on these findings we can envisage a large-scale deployment of the developed framework.
ACKNOWLEDGMENT
(Shaoping Hu, Yuan Gao, and Zhangming Niu contributed equally to this work.)