Loading web-font TeX/Math/Italic
Weakly Supervised Deep Learning for COVID-19 Infection Detection and Classification From CT Images | IEEE Journals & Magazine | IEEE Xplore

Weakly Supervised Deep Learning for COVID-19 Infection Detection and Classification From CT Images


Network architecture of our proposed weakly supervised multi-scale learning framework for COVID-19/NP/CAP classification and lesions detection.

Abstract:

An outbreak of a novel coronavirus disease (i.e., COVID-19) has been recorded in Wuhan, China since late December 2019, which subsequently became pandemic around the worl...Show More
Topic: Emerging Deep Learning Theories and Methods for Biomedical Engineering

Abstract:

An outbreak of a novel coronavirus disease (i.e., COVID-19) has been recorded in Wuhan, China since late December 2019, which subsequently became pandemic around the world. Although COVID-19 is an acutely treated disease, it can also be fatal with a risk of fatality of 4.03% in China and the highest of 13.04% in Algeria and 12.67% Italy (as of 8th April 2020). The onset of serious illness may result in death as a consequence of substantial alveolar damage and progressive respiratory failure. Although laboratory testing, e.g., using reverse transcription polymerase chain reaction (RT-PCR), is the golden standard for clinical diagnosis, the tests may produce false negatives. Moreover, under the pandemic situation, shortage of RT-PCR testing resources may also delay the following clinical decision and treatment. Under such circumstances, chest CT imaging has become a valuable tool for both diagnosis and prognosis of COVID-19 patients. In this study, we propose a weakly supervised deep learning strategy for detecting and classifying COVID-19 infection from CT images. The proposed method can minimise the requirements of manual labelling of CT images but still be able to obtain accurate infection detection and distinguish COVID-19 from non-COVID-19 cases. Based on the promising results obtained qualitatively and quantitatively, we can envisage a wide deployment of our developed technique in large-scale clinical studies.
Topic: Emerging Deep Learning Theories and Methods for Biomedical Engineering
Network architecture of our proposed weakly supervised multi-scale learning framework for COVID-19/NP/CAP classification and lesions detection.
Published in: IEEE Access ( Volume: 8)
Page(s): 118869 - 118883
Date of Publication: 29 June 2020
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Coronavirus disease 2019 (COVID-19) has been widespread worldwide since December 2019 [1], [2]. It is highly contagious, and severe cases can lead to acute respiratory distress or multiple organ failure [3]. On 11 March 2020, the WHO has made the assessment that COVID-19 can be characterised as a pandemic. As of 8th April 2020, in total, 1,391,890 cases of COVID-19 have been recorded, and the death toll has reached 81,478 with a rapid increase of cases in Europe and North America.

The disease can be confirmed by using the reverse-transcription polymerase chain reaction (RT-PCR) test [4]. While being the gold standard for diagnosis, confirming COVID-19 patients using RT-PCR is time-consuming, and both high false-negative rates and low sensitivities may put hurdles for the presumptive patients to be identified and treated early [3], [5], [6].

As a non-invasive imaging technique, computed tomography (CT) can detect those characteristics, e.g., bilateral patchy shadows or ground glass opacity (GGO), manifested in the COVID-19 infected lung [7], [8]. Hence CT may serve as an important tool for COVID-19 patients to be screened and diagnosed early. Despite its advantages, CT may share some common imagery characteristics between COVID-19 and other types of pneumonia, making the automated distinction difficult.

Recently, deep learning based artificial intelligence (AI) technology has demonstrated tremendous success in the field of medical data analysis due to its capacity of extracting rich features from multimodal clinical datasets [9]. Previously, deep learning was developed for diagnosing and distinguishing bacterial and viral pneumonia from thoracic imaging data [10]. In addition, attempts have been made to detect various chest CT imaging features [11]. In the current COVID-19 pandemic, deep learning based methods have been developed efficiently for the chest CT data analysis and classification [2], [3], [12]. Besides, deep learning algorithms have been proposed for COVID-19 monitoring [13], screening [14] and prediction of the hospital stay [15]. A full list of current AI applications for COVID-19 related research can be found elsewhere [16]. In this study, we will focus on the chest CT image based localisation for the infected areas and disease classification and diagnosis for the COVID-19 patients.

Although initial studies have demonstrated promising results by using chest CT for the diagnosis of COVID-19 and detection of the infected regions, most existing methods are based on commonly used supervised learning scheme. This requires a considerable amount of work on manual labelling of the data; however, at such an outbreak situation clinicians have very limited time to perform the tedious manual drawing, which may fail the implementation of such supervised deep learning methods. In this study, we propose a weakly supervised deep learning framework to detect COVID-19 infected regions fully automatically using chest CT data acquired from multiple centres and multiple scanners. Based on the detection results, we can also achieve the diagnosis for the COVID-19 patients. In addition, we also test the hypothesis that based on the CT radiological features, we can classify COVID-19 cases from community acquired pneumonia (CAP) and non-pneumonia (NP) scans using the deep neural networks we developed.

SECTION II.

Materials and Methods

A. Patients and Data

This retrospective study was approved by the institutional review board of the participating hospitals in accordance with local ethics procedures. Further consent was waived with approval. This study included 150 3D volumetric chest CT exams of COVID-19, CAP and NP patients, respectively. In total, 450 patient scans acquired from two participating hospitals between September 2016 and March 2020 were included for further analysis.

All the COVID-19 patients were confirmed as positive by the RTPCR testing that were scanned from December 2019 to March 2020. According to the diagnosis and treatment program of COVID-19 (Trial version sixth) issued by the National Health Commission in China [17], the clinical classification of COVID-19 patients can be categorised as mild, moderate, severe, and critical. All our COVID-19 patients were at severe or critical stage and all the CT scans had been performed within 3 days of hospitalisation.

CAP and other NP (no lung disease, lung nodules, chronic inflammation, chronic obstructive pulmonary disease) patients were randomly chosen from the participating hospitals between September 2016 and January 2020. The inclusion criteria of CAP patients are in accordance with the guidelines on the management of community-acquired pneumonia in adults published by the Infectious Diseases Society of America/American Thoracic Society [18]. CAP diagnosis is focused on the existence of identified clinical characteristics (e.g., cough, fever, sputum development, and pleuritic chest pain) and is accompanied by pulmonary examination, typically by chest X-ray and in our case using CT. In the regular examination of patients that are suspected to have CAP, a chest radiograph is needed to determine the diagnosis and to better distinguish CAP from other specific causes of cough and fever, such as acute bronchitis. Although various CT manifestations might be observed due to different pathogens, all our CAP patients were laboratory confirmed bacterial culture positive cases or negative cases, e.g., with mycoplasma and viral pneumonia. Our assumption is that the proposed weakly supervised deep learning method can sense subtle discrepancies in CT images acquired for CAP and COVID-19 patients. NP patients were diagnosed with no lung disease or lung disease, e.g., lung nodules, chronic inflammation, chronic obstructive pulmonary disease and others. It is of note that the criterion for normal CT in the context is that the CT examinations have shown no obvious lesions in both lungs.

Demographic statistics of the patients are as reported in Table 1. One-way ANOVA (ANalysis Of VAriance) were conducted on gender and age distribution over the three patient groups and the p -values obtained suggest that there were no significant differences found among three groups in terms of gender and age distribution (p >0.05).

TABLE 1 Summary of the Patient Demographic Statistics
Table 1- 
Summary of the Patient Demographic Statistics

COVID-19 patients were admitted from two hospitals in China, including 138 patients from Hospital of Wuhan Red Cross Society (WHRCH) and 12 patients from Shenzhen Second Hospital (SZSH). Both CAP and NP patients were recruited from SZSH. COVID-19 patients were obtained from either Siemens SIEMENS SOMATOM go.Now16 (WHRCH) or GE Revolution 256 (SZSH) CT systems. For the SIEMENS SOMATOM go.Now16 CT system, the scanning parameters were as follows: tube voltage = 130 kVp, automatic tube current modulation = 50 mAs, pitch = 1.5 mm, matrix = 512\times 512 , slice thickness = 0.7 mm, field of view = 350 mm \times350 mm, and reconstructed slice thickness = 1 mm. For the GE Revolution 256 CT system, the scanning parameters were set as tube voltage = 120 kVp, automatic tube current modulation = 150 mAs, pitch = 1.375 mm, matrix = 512\times 512 , slice thickness = 0.625 mm, field of view = 400 mm \times400 mm, and reconstructed slice thickness = 2 mm. All the CAP and NP patients were scanned using SIEMENS SOMATOM Emotion CT system with the main imaging parameters of tube voltage = 110 kVp, automatic tube current modulation = 70 mAs, pitch = 1.2 mm, matrix =\,\,512\times 512 , slice thickness = 1.2mm, field of view = 260 mm \times260 mm, and reconstructed slice thickness = 1.5 mm. Details are shown in Table 2.

TABLE 2 Imaging Parameters of the CT Systems Used for COVID-19, CAP and NP Patients
Table 2- 
Imaging Parameters of the CT Systems Used for COVID-19, CAP and NP Patients

B. Dataset for Lung Segmentation

In order to achieve a highly accurate lung segmentation that can facilitate the following infection detection and classification, we utilised an open dataset (TCIA dataset) [19] for training a deep neural network for the lung delineation. The data can be accessed from the Cancer Imaging Archive (TCIA) Public Access.1 In total, 60 3D CT lung scans were retrieved with manual delineations of the lung anatomy. These open datasets were made publicly accessible from the scans obtained by three different institutions: MD Anderson Cancer Centre, Memorial Sloan-Kettering Cancer Centre, and the MAASTRO clinic, with 20 cases from each institution. All the data were scanned with matrix = 512\times 512 , the field of view = 500 mm\times 500 mm, and reconstructed slice thickness varies at either 1 mm, 2.5 mm or 3 mm.

C. Pre- and Post-Processing for Lung Segmentation

Data pre-processing steps were performed to standardise data acquired from multiple centres and multiple scanners. Instead of normalising input slices into a pre-defined Hounsfield unit (HU) window, we designed a more flexible scheme based on previously proposed image enhancement methods [20], [21]. Rather than clipping based on HU windows, we proposed to use a fixed-sized sliding window W_{Q,S} (where Q denotes the size of the window and S denotes the step length of the sliding procedure) to find the range where covers most of the pixel values. This can reduce the bias of data acquired from different centres and different scanners. Loosely inspired by [22], we proposed a multi-view U-Net [23] based segmentation network for lung segmentation. Our multi-view U-Net based segmentation network consisted of a multi-window voting post-processing procedure and a sequential information attention module in order to utilise the information from each view of the 3D volume and reinforce the integrity of the 3D lung structure of the delineation results. Our lung segmentation model was trained, cross-validated and tested on the TCIA dataset with manual ground truth. The trained lung segmentation model was then used for inferencing the delineation of the lung anatomy of the COVID-19, CAP and NP patients included in this study.

D. Detection and Classification Network

Inspired by the VGG architecture [24], we adopted the configuration that increased CNN depth using small convolution filters stacked with non-linearity injected in between, as depicted in Figure 1. All convolution layers consisted of 3\times 3 kernels, batch normalisation and Rectified Linear Units. The proposed CNN was fully convolutional consisting of five convolutional blocks, i.e., Conv1, Conv2, Conv3, Conv4 and Conv5 in the backbone architecture. The full architecture, using shorthand notation, is 2\times C(32,3,1)-MP-2\times C(64,3,1)-MP-3\times C(128,3,1)-MP-3\times C(256,3,1)-MP-3\times C(256,3,1)-MP , where C(d,f,s) indicates a convolution layer with d filters of spatial size f\times f , applied to the input with stride s . MP represents non-overlapping max-pooling operation with a kernel size of 2\times 2 .

FIGURE 1. - 
Network architecture of our proposed weakly supervised multi-scale learning framework for COVID-19/NP/CAP classification and lesions detection.
FIGURE 1.

Network architecture of our proposed weakly supervised multi-scale learning framework for COVID-19/NP/CAP classification and lesions detection.

E. Multi-Scale Learning

From the previous findings using CT [25]–​[27], it is known that infections of COVID-19 share the similar and common radiographic features as CAP, such as GGO and airspace consolidation. They frequently distribute bilaterally, peripherally in lower zone predominant, and the infectious areas can vary significantly in size depending on the condition of the patients. For example, in mild cases the lesions appear to be small, but in severe cases they appear scattered and spread around over a large area. Therefore, we proposed a multi-scale learning scheme to cope with variations of the size and location of the lesions. To implement this, we fed the intermediate CNN representations, i.e., feature maps, at Conv3, Conv4 and Conv5, respectively into the weakly supervised classification layers, in which 1\times 1 convolution was applied to mapping the feature maps down to the class score maps (i.e., class activation maps). We then applied a spatial aggregation with a Global Max Pooling (GMP) operation to obtain categorical scores. The scores vectors at Conv3, Conv4 and Conv5 level were aggregated by sum to make a final prediction with a Softmax function. We then trained the proposed model end-to-end by minimising the following objective function \begin{equation*} L= -\frac {1}{N}\sum _{i=1}^{N}w_{i}f_{i}\left({S_{c}(x_{i})-\log \sum _{k=1}^{K}e^{S_{k}(x_{i})}}\right), \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where there are N training images x_{i} and K training classes. S_{k} is the k_{th} component in the score vector \in \Re ^{K} , and c is the true class of x_{i} . As we encountered an imbalanced classification, we added a class-balanced weighting factor w_{i} to the cross-entropy loss, which was set by inverse class frequency, i.e., w_{i}= \frac {1}{freq(c)} . While this emphasised the importance of a rare class during training, it showed no difference between easy and hard examples. For instance, in mild COVID-19 slices, infectious or diseased regions are often very small and not prominent. Thus, they are prone to be misclassified as NP examples. To address this, we introduced another modulating factor, i.e., to down-weight easy examples and therefore focused the training on hard examples [28] f_{i}=(1-P_{c})^{\gamma } , where P_{c} is the true class posterior probability of x_{i} . Intuitively, the modulating factor can reduce the loss contribution from easy examples. This in turn increases the importance of correcting misclassified examples. When an example was misclassified and P_{c} was small, the factor f was near 1 and the loss was unaffected. As P_{c}\to 1 , the factor went to 0 and the loss for well-classified examples was down-weighted. The parameter \gamma is a positive integer which can smoothly adjust the rate at which easy examples are down-weighted. As \gamma is increased the modulating effect of the factor f is likely to be increased.

F. Weakly Supervised Lesions Localisation

After determining the class score maps and the image category in a forward pass through the network, the discriminative patterns corresponding to that category can then be localised in the image. A coarse localisation could already be achieved by directly relating each of the neurons in the class score maps to its receptive field in the original image. However, it is also possible to obtain pixel-wise maps containing information about the location of class-specific target structures at the resolution of the original input images. This can be achieved by calculating how much each pixel influences the activation of the neurons in the target score map. Such maps can be used to obtain a much more accurate localisation, like the examples shown in Figure 2.

FIGURE 2. - 
Examples of saliency maps for COVID-19 lesions localisation: (a) shows an example input image, (b) shows the saliency map obtained at Conv3, (c) shows the saliency map obtained at Conv4, (d) shows the saliency map obtained at Conv5, (e) shows the overlay of the joint saliency map (pixel-wise multiplication of the Conv3, Conv4 and Conv5 saliency maps) with the input image, and (f) shows the resulting bounding boxes.
FIGURE 2.

Examples of saliency maps for COVID-19 lesions localisation: (a) shows an example input image, (b) shows the saliency map obtained at Conv3, (c) shows the saliency map obtained at Conv4, (d) shows the saliency map obtained at Conv5, (e) shows the overlay of the joint saliency map (pixel-wise multiplication of the Conv3, Conv4 and Conv5 saliency maps) with the input image, and (f) shows the resulting bounding boxes.

In the following, we will show how categorical-specific saliency maps can be obtained through the integrated gradients. Besides, we will also show how to post-process the saliency maps from which we can extract bounding boxes around the detected lesions.

1) Category-Specific Saliency

Generally, suppose we have a flattened input image denoted as x=(x_{1}, \ldots,x_{n})\in \Re ^{n} (number of pixels = n), category-specific saliency map can be obtained by calculating the gradient of the predicted class score S(x) at the input x : g=\frac {\partial S(x)}{\partial x}=(g_{1},\ldots,g_{n})\in \Re ^{n} , where g_{i} represents the contribution of individual pixel x_{i} to the prediction. In addition, the gradient can be estimated by back-propagating the final prediction score through each layer of the network. There are many state-of-the-art back-propagation approaches, including Guided-Backpropagation [29], DeepLift [30] and Layer-wise Relevance Propagation (LRP) [31]. However, Guided-Backpropagation method may break gradient sensitivity because it back-propagates through a ReLU node only if the ReLU is turned on at the input. In particular, the lack of sensitivity causes gradients to focus on irrelevant features and results in undesired saliency localisation. DeepLift and LRP methods tackle the sensitivity issue by computing discrete gradients instead of instantaneous gradients at the input. However, they fail to satisfy the implementation invariance because the chain rule does not hold for discrete gradients in general. In doing so, the back-propagated gradients are potentially sensitive to unimportant features of the models. To deal with these limitations, we employ a feature attribution method named “Integrated Gradients” [32] that assigns an importance score \phi _{i}(S(x),x) (similar to pixel-wise gradients) to the i_{th} pixel representing how much the pixel value adds or subtracts from the network output. A large positive score indicates that pixel strongly increases the prediction score S(x) , while an importance score closes to zero indicates that pixel does not influence S(x) . To compute the importance score, it needs to introduce a baseline input representing “absence” of the feature input, denoted as x'=(x_{1}',\ldots,x_{n}')\in \Re ^{n} , which in our study, was a null image (filled with zeros) with the same shape as input image x . We considered the straight-line path, i.e., point-to-point from the baseline x' to the input x , and computed the gradients at all points along the path. Integrated gradients can be defined as \begin{equation*} \phi _{i}(S(x),x,x')\!=(x_{i}-x'_{i})\!\times \int _{\alpha =0}^{1}\frac {\partial S(x'+\alpha (x-x'))}{\partial x_{i}}d\alpha,\quad \tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \alpha \in [{0,1}] . Intuitively, integrated gradients can obtain importance scores by accumulating gradients on images interpolated between the baseline value and the current input. The integral in Eq. 2 can be efficiently approximated via a summation of the gradients as:\begin{align*} \phi _{i}(S(x),x,x')\!\approx \! (x_{i}-x'_{i})\!\times \sum _{n=1}^{m}\frac {\partial S\left({x'+\frac {n}{m}\times (x-x')}\right)}{\partial x_{i}}\!\times \! \frac {1}{m}, \!\!\!\!\!\!\!\! \\\tag{3}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where m is the number of steps in the Riemann approximation of the integral. We compute the approximation in a loop over the set of inputs, i.e., for n=1,\ldots,m . The integrated gradients are computed at different feature levels, in our experiments, which are Conv3, Conv4 and Conv5 respectively, as shown in Figure 2(b), Figure 2(c) and Figure 2(d). Then, a joint saliency can be obtained, as depicted in Figure 2(e), by pixel-wise multiplication between the multi-scale integrated gradients.

2) Bounding Box Extraction

Next, we post-processed the joint saliency map from which a bounding box can be extracted. Firstly, we took the absolute value of the joint saliency map and blurred it with a 5\times 5 Gaussian kernel. Then, we thresholded the blurred saliency map using the Isodata thresholding method [33] that it iteratively decided a threshold segmenting the image into foreground and background, where the threshold was midway between the mean intensities of sampled foreground and background pixels. In doing so, we obtained a binary mask on which we applied morphological operations (dilation followed by erosion) to close the small holes in the foreground. Finally, we took the connected components with areas above a certain threshold and fit the minimum rectangular bounding boxes around them. An example is shown in Figure 2(f).

G. Implementation Details

1) Experiments Setup

We trained the proposed model for both a three-way classification (i.e., K=3 for NP, CAP and COVID-19) and three binary classification tasks (K=2 ), i.e., NP vs. COVID-19, NP vs. CAP and CAP vs. COVID19, respectively. In the three-way classification settings, we first trained individual classifiers at different convolution blocks. In our experiment, we chose Conv3, Conv4 and Conv5, respectively. Then, we trained a joint classifier on the aggregated prediction scores (as described in the “Multi-Scale Learning” Section). All the classifiers were trained with the loss in Eq. 1. Finally, we conducted a 5-fold cross-validation on all tasks that in each category, we split the datasets into training, validation and test set. This can ensure that no samples (images) originating from validation and test patients were used for training. In each fold, we held out 20% of all samples for validation and test, and the remaining were used for training.

2) Training Configurations

We implemented the proposed model (as depicted in Figure 1) using Tensorflow 1.14.0. All models were trained from scratch on four Nividia GeForce GTX 1080 Ti GPUs with an Adam optimiser (learning rate: 10−4, \beta _{1}=0.5 , \beta _{2}=0.9 and \epsilon =10^{-8} ). We set \gamma to 1 in the focal modulator f and the total number of training iterations was set to 20,000. Early stopping was enabled to terminate training automatically when validation loss stopped decreasing for 1,000 iterations. We run validation once every 500 iterations of training, a checkpoint was saved automatically if the current validation accuracy exceeded the previous best validation accuracy. Once the training was terminated, we generated a frozen graph on the latest checkpoint and saved it in.pb format. For testing, we simply loaded the frozen graphs and retrieved the required nodes. Empirically, we found that 20 to 30 steps were good enough to approximate the integral when computing the integrated gradients; thus, we fix m=25 in Eq. 3.

3) Data Augmentation

We applied several random on-the-fly data augmentation strategies during training, including (1) cropping square patches at the centre of the input frames with a scaling factor randomly chosen between 0.7 to 1, and resized the crops to the size of 224\times 224 (input resolution); (2) rotation with an angle randomly selected within \theta = -25^{o} to 25^{o} ; (3) Random horizontal reflection, i.e., flipped the images in the left-right direction with a probability p = 0.5 ; and (4) adjust contrast by randomly darkening or brightening with a factor ranging between 0.5 and 1.5.

FIGURE 3. - 
Dice scores of the lung segmentation using different pre-processing and post-processing methods on the TCIA dataset. Left Panel: without any pre-processing; Middle Panel: normalising using a pre-defined Hounsfield unit (HU) window; Right Panel: normalising using the proposed fixed-sized sliding window. W/O P: without multi-view learning based post-processing; W P: with multi-view learning based post-processing.
FIGURE 3.

Dice scores of the lung segmentation using different pre-processing and post-processing methods on the TCIA dataset. Left Panel: without any pre-processing; Middle Panel: normalising using a pre-defined Hounsfield unit (HU) window; Right Panel: normalising using the proposed fixed-sized sliding window. W/O P: without multi-view learning based post-processing; W P: with multi-view learning based post-processing.

H. Evaluation Metrics

Using positive results of the RTPCR testing as the ground truth labelling for the COVID-19 group and diagnosis results of CAP and NP patients, accuracy, precision, sensitivity and specificity [34], [35] of our classification framework were calculated. We also carried out the area under the receiver operating characteristic curve (AUC) analysis for the quantification of our classification performance. For the lung segmentation, we used Dice score [36] to evaluate the accuracy.

SECTION III.

Experiments and Results

A. Lung Segmentation

In order to evaluate the lung segmentation network, we randomly split the 60 TCIA data with ground truth into 40 training, 10 validation and 10 independent testing datasets. Ablation study results of different pre-processing and post-processing methods using Dice scores are shown in Figure 3.

B. Infection Detection

1) Class Activation Mapping

As a result of multi-scale learning, Figure 4 illustrates some examples of COVID-19 class activation maps (CAMs) obtained at the different feature levels, i.e., Conv3, Conv4 and Conv5. The CAMs depict the spatial distribution of classification probability on which the hot areas indicate where infected areas are. The hotter the areas, the more likely they are infected. Of note from the multi-scale CAMs, our proposed model learns to capture the distributions of lesions with different scale: for instance, the large patchy-like lesions, such as crazy paving sign and consolidation; and also small nodule-like lesions, such as ground-glass opacities (GGO) and bronchovascular thickening. Although the CAMs can indicate where the diseased regions are, they are still too coarse to localise and estimate the extent of lesions precisely. The saliency maps shown in Figure 5, on the other hand, can provide pixel-level information that delineates the exact extent of the lesions, and therefore can deduce a precise localisation of the lesions. Notably from the saliency maps, the mid-level layer, i.e., Conv3 can learn to detect small lesions (GGO most frequently), especially those distributed peripherally and subpleurally. However, Conv3 is not able to capture larger patchy-like lesions, and this may be because of the limited receptive field at the mid-layer. On the contrary, the higher-level layers, e.g., Conv4 and Conv5, having sufficiently large receptive filed to detect the diffuse and patchy-like lesions, such as crazy paving sign and consolidation, which are often distributed centrally and peribronchially. However, Conv4 and Conv5 tend to overestimate the extent of small lesions. The multi-scale features complement each other and result in more precise localisation and estimation of the lesions extent, as shown from the joint saliency maps.

FIGURE 4. - 
Results of the multi-scale COVID-19 class activation mapping.
FIGURE 4.

Results of the multi-scale COVID-19 class activation mapping.

FIGURE 5. - 
Multi-scale detection of COVID-19 lesions with varied size. Green box: small lesions. Yellow box: mix of small and large patchy or strip like lesions. Red box: large lesions.
FIGURE 5.

Multi-scale detection of COVID-19 lesions with varied size. Green box: small lesions. Yellow box: mix of small and large patchy or strip like lesions. Red box: large lesions.

2) Categorical-Specific Saliency

Figure 6 shows the examples of categorical-specific joint saliency computed by integrated gradients. It shows the original inputs on the left and the overlaid saliency on the right. CAMs showed in Figure 4 only depict the spatial distribution of infection. However, it can not be used for precise localisation of the lesions. The saliency maps, on the other hand, can provide pixel-level information that delineates the exact extent of the lesions so providing a precise localisation of the lesions.

FIGURE 6. - 
Results of the categorical-specific joint saliency.
FIGURE 6.

Results of the categorical-specific joint saliency.

The saliency maps can also be useful for diagnosis that the percentage of infection to lung areas can be estimated automatically. These saliency maps highlight the pixels that contribute to increasing categorical-specific scores: the brighter the pixels, the more significant the contribution. Intuitively, one can also interpret this as the brighter the pixels are, the more critical features to the network to make the decision (prediction). It is of note that in Figure 4 and Figure 6, there is not only an inter-class contrast variation (due to the data are collected from multi-institutions) but also an intra-class contrast variation, especially in COVID-19 group. In our experiments, we found that histogram matching can suppress lesions, especially on COVID-19 images; for instance, GGO disappears or become less apparent. Besides, this leads to inferior performance of detection. Therefore, instead of directly applying histogram matching, we applied random on-the-fly contrast adjustment for data augmentation at training time. This turns out to be very effective, as demonstrated in Figure 6, our proposed model learns to be invariant to image contrast, and precisely capture the lesions.

In particular, in Figure 8, we randomly selected typical example images to illustrate the variations of the image contrast in COVID-19 cases and compared the saliency maps obtained from models trained with and without contrast augmentation (CA vs. NCA). We found that without contrast augmentation, the saliency maps tend to be noisy and poor in localisation, as mis-detection can be observed often in the cases such as either only partial instances of infection being captured or the regions without infection being captured. Whereas, with contrast augmentation, the learned models generate more discriminative saliency maps and localisation of infected areas is robust and more accurate against the contrast variation. As can be seen (enclosed by green box), our model with contrast augmentation is capable of capturing all the diseased regions and highlighting their extent precisely, regardless single or multiple instances of infection.

FIGURE 7. - 
Bounding boxes extracted from saliency for COVID-19 and CAP examples. (Corresponding to the examples in Figure 6).
FIGURE 7.

Bounding boxes extracted from saliency for COVID-19 and CAP examples. (Corresponding to the examples in Figure 6).

FIGURE 8. - 
Effect of applying random contrast augmentation (in data augmentation). Contrast adjustment leads to better saliency quality (less noisy) and more precise and contrast-invariant detection of infected areas. Cyan arrows: false positives of the saliency maps; Pink arrows: false negatives of the saliency maps; NCA: No Contrast Adjustment; CA: with Contrast Adjustment.
FIGURE 8.

Effect of applying random contrast augmentation (in data augmentation). Contrast adjustment leads to better saliency quality (less noisy) and more precise and contrast-invariant detection of infected areas. Cyan arrows: false positives of the saliency maps; Pink arrows: false negatives of the saliency maps; NCA: No Contrast Adjustment; CA: with Contrast Adjustment.

In addition, from the COVID-19 and CAP saliency, we found that the CAP lesions are generally smaller and more constrained locally compare to COVID-19 cases that often have multiple infected regions and lesions are massive and scattered. It should also be noted that COVID-19 and CAP lesions do share similar radiographic features, such as GGO and air space consolidation. Besides, GGOs appear frequently in subpleural regions as well in CAP cases. Interestingly, from the saliency map for the NP cases, we found the network takes the pulmonary arteries as the salient feature. Finally, Figure 7 shows the bounding boxes extracted from COVID-19 and CAP saliency maps (corresponding to the examples in Figure 6). We found the results agree with our primary findings that CAP cases have less infected areas and often there is single-instance of infection, in contrast, COVID-19 cases often have more infected areas (multi-instances of infection), and the COVID-19 lesions vary a lot in terms of extent. Overall, CAP infection areas are smaller compare to those of COVID-19.

C. Classification Performance

Performance of our proposed model for each specific task was evaluated with 5-fold cross-validation, and the results on the test set are reported and summarised in Table 3. We use five evaluation metrics, which are accuracy (ACC), precision (PRC), sensitivity (SEN), specificity (SPE) and the area under the ROC curve (AUC). We report the mean of 5-fold cross-validation results in each metric with the 95% confidence interval. We also compared our proposed method with a reimplementation of the Navigator-Teacher-Scrutinizer Network (NTS-NET) [37].

TABLE 3 The Overall Classification Performance Comparison Between Different Tasks on the Test Set. Values in Brackets are 95% Confidence Intervals. AUC: Area Under the Receiver Operating Characteristic Curve, COVID-19: Coronovirus Disease 2019, CAP: Community Acquired Pneumonia, NP: Non-Pneumonia. *: Three-Way Classification Tasks (i.e., NP/CAP/COVID-19). ≀: Binary Classification Tasks. †: single-scale learning. ‡: Multi-Scale Learning. NCA: No Contrast Adjustment (Data Augmentation)
Table 3- 
The Overall Classification Performance Comparison Between Different Tasks on the Test Set. Values in Brackets are 95% Confidence Intervals. AUC: Area Under the Receiver Operating Characteristic Curve, COVID-19: Coronovirus Disease 2019, CAP: Community Acquired Pneumonia, NP: Non-Pneumonia. *: Three-Way Classification Tasks (i.e., NP/CAP/COVID-19). ≀: Binary Classification Tasks. †: single-scale learning. ‡: Multi-Scale Learning. NCA: No Contrast Adjustment (Data Augmentation)

As described earlier in the experimental settings, basically we have two groups of tasks: three-way classification tasks (indicated by *) and binary classification tasks (indicated by ), and two learning configurations: single-scale learning (indicated by ) that assigns an auxiliary classifier to a specific feature level, and multi-scale learning (indicated by ) that aggregates the multi-level prediction scores then trained with a joint classifier. All the binary tasks listed were trained with the multi-scale learning. In terms of three-way classification, we found the multi-scale learning with joint classifier achieves superior overall performance than any of the single-scale learning tasks. It is of note that among the single-scale learning tasks, classification with Conv4 and Conv5 features achieve very similar performance in every metric, which is significantly better than classification with mid-level, i.e., Conv3 features. One possible explanation is the mid-level features are not sufficiently semantic compare to higher-level features, i.e., Conv4 and Conv5. As we know, high-level CNN representations are semantically strong but poorly at preserving spatial details, whereas mid-lower level CNN representations preserve well the local features but lack of semantic information.

Furthermore, it is of note that, overall, binary classification tasks achieve significantly better performance than three-way classification, especially in the tasks, such as NP/COVID-19 and NP/CAP. It can be seen our proposed model is reasonably good at distinguishing COVID-19 cases from NP cases as suggested by the results, showing that it achieves a mean ACC of 96.2%, PRC of 97.3%, SEN of 94.5%, SPE of 95.3% and AUC of 0.970, respectively. One can explain this is because binary classification is less complicated, and there is also less uncertainty than three-way classification. This may also because COVID-19 and CAP image features are intrinsically discriminative compare to the NP cases. For instance, as the COVID-19 cases demonstrated earlier, there is often a combination of various diseased patterns and large areas of infection on the scans.

Last but not least, we found that the performance of COVID-19/CAP classification is the least superior among all the binary classification tasks. One possible reason is COVID-19 shares the similar radiographic features with CAP, such as GGO and airspace consolidation and the network capacity may not be enough to learn disease-specific representations. Nevertheless, the results obtained using our proposed method outperformed the ones obtained by the NTS-NET.

We also break down the overall performance, i.e., the joint classifier into classes, and the classification metrics are reported for each class, as shown in Table 4 and Figure. 9. We found that the models learned without contrast augmentation are biased that the classification performance for COVID-19 is significantly better than the other two classes. This may because models learn to discriminate the classes based on image style (contrast) rather than the content (normal or disease patterns) and the COVID-19 class in our data has the most discriminative contrast style (high variability in brightness) among all three classes. In comparison, learning with contrast augmentation results in superior overall classification performance (Table 3) and no class bias (Table 4). In addition, the “COVID-19” and the “NP” classes achieve the comparable performance in each metric and the “NP” class has higher sensitivity (91.3%) than the COVID-19 (87.6%) and CAP (83.0%). Besides, we found, overall, the “COVID” remains the best performed and the most discriminative class with a mean AUC of 0.923, compared to the “CAP” (0.864) and the “NP” (0.901). It can also be noted that the overall results for the class “CAP” are moderately lower than those of the “NP” and “COVID-19”. This could be correlated with our finding in the COVID-19/CAP classification that because of similar appearance, the “CAP” class is likely to be misclassified as the “COVID-19” sometimes. Also, another possible reason is that the network could have learned and be distracted by the few “NP noises”, and there might be a fractional number of non-infected slices in between the CAP training samples. This is because we sampled all the available slices from each subject, and there might be a few slices having no infections.

TABLE 4 The Performance (Breakdown Into Each Individual Class) of Three-Way Classification on the Test Set. Values in Brackets are 95% Confidence Intervals. AUC: Area Under the Receiver Operating Characteristic Curve, COVID-19: Coronovirus Disease 2019, CAP: Community Acquired Pneumonia, NP: Non-Pneumonia. *: no Random Contrast Adjustment. †: With Random Contrast Adjustment
Table 4- 
The Performance (Breakdown Into Each Individual Class) of Three-Way Classification on the Test Set. Values in Brackets are 95% Confidence Intervals. AUC: Area Under the Receiver Operating Characteristic Curve, COVID-19: Coronovirus Disease 2019, CAP: Community Acquired Pneumonia, NP: Non-Pneumonia. *: no Random Contrast Adjustment. †: With Random Contrast Adjustment
FIGURE 9. - 
Receiver operating characteristic (ROC) of individual categories for three-way classification (5-fold cross-validated). (a) NP with AUC of 0.90± 0.03 (mean±standard deviation); (b) CAP with AUC of 0.86± 0.03 (c) COVID-19 with AUC of 0.92± 0.02. The green region indicates the 95%CI. COVID-19: coronavirus disease 2019, CAP: Community Acquired Pneumonia, NP: Non-Pneumonia, CI: Confidence Interval.
FIGURE 9.

Receiver operating characteristic (ROC) of individual categories for three-way classification (5-fold cross-validated). (a) NP with AUC of 0.90± 0.03 (mean±standard deviation); (b) CAP with AUC of 0.86± 0.03 (c) COVID-19 with AUC of 0.92± 0.02. The green region indicates the 95%CI. COVID-19: coronavirus disease 2019, CAP: Community Acquired Pneumonia, NP: Non-Pneumonia, CI: Confidence Interval.

SECTION IV.

Discussions

In this work, we have presented a novel weakly supervised deep learning framework that is capable of learning to detect and localise lesions on COVID-19 and CAP CT scans from image-level label only. Different from other works, we leverage the representation learning on multiple feature levels and have explained what features can be learned at each level. For instance, the high-level representation, i.e., Conv5 captures the patch-like lesions that generally have a large extent. However, it tends to discard small local lesions. This is well complemented by the mid-level representations (Figure 4), i.e., Conv4 and Conv5, from which the lesions detected also correspond to our clinical findings that the infections usually located in the peripheral lung (95%), mainly in the inferior lobe of the lungs (65%), especially in the posterior segment (51%). We speculate that it is mainly because there are more well-developed bronchioles, alveoli, rich blood flows and immune cells such as lymphatic cells in the periphery. These immune cells played a vital role in the inflammation caused by the virus. We have also demonstrated that combing multi-scale saliency maps, generated by integrated gradients, is the key to achieve a precise localisation of multi-instance lesions.

Furthermore, from a clinical perspective, the joint saliency is useful that it provides a reasonable estimation of the percentage of infected lung areas, which is a crucial factor that clinicians take account for evaluating the severity of a COVID-19 patient. Besides, the classification performance of the proposed network has been studied extensively that we have not only conducted three-way classification but also binary classification by combining any two of the classes.

We found one limitation of the proposed network is that it is not discriminative enough when it comes to separate the CAP from COVID-19. We suspect this is due to the limited capacity of the backbone CNN that a straightforward way of boosting CNN capacity is to increase the number of feature channels at each level. Another attempt in the future would be employing more advanced backbone architecture, such as Resnet and Inception. Another limitation in this work is that we have trained the networks on individual slices (images) that we use all available samples for each subject. However, for the CAP or COVID-19 subjects, there might be fractional non-infection slices in between which could introduce noises in training, which have been confirmed by scrutinisation by our clinicians. In the future, we can address the limitation by attention-based multiple instances learning that instead of training on individual slices, we put the patient-specific slices into a bag and train on bags. The network will learn to assign weights to individual slices in a COVDI-19 or CAP positive bag and automatically sample those high weighted slices for infection detection. Further supervision via labelled non-infection slices may also boost the performance of our proposed model, but at a cost of time-consuming manual labelling procedure.

SECTION V.

Conclusion

In this study, we designed a weakly supervised deep learning framework for fast and fully-automated detection and classification of COVID-19 infection using retrospectively extracted CT images from multi-scanners and multi-centres. Our framework can distinguish COVID-19 cases accurately from CAP and NP patients. It can also pinpoint the exact position of the lesions or inflammations caused by the COVID-19, and therefore can also potentially provide advice on patient severity in order to guide the following triage and treatment. Experimental findings have indicated that the proposed model achieves high accuracy, precision and AUC for the classification, as well as promising qualitative visualisation for the lesion detections. Based on these findings we can envisage a large-scale deployment of the developed framework.

ACKNOWLEDGMENT

(Shaoping Hu, Yuan Gao, and Zhangming Niu contributed equally to this work.)

References

References is not available for this document.