Abstract

Aim. COVID-19 has caused large death tolls all over the world. Accurate diagnosis is of significant importance for early treatment. Methods. In this study, we proposed a novel PSSPNN model for classification between COVID-19, secondary pulmonary tuberculosis, community-captured pneumonia, and healthy subjects. PSSPNN entails five improvements: we first proposed the n-conv stochastic pooling module. Second, a novel stochastic pooling neural network was proposed. Third, PatchShuffle was introduced as a regularization term. Fourth, an improved multiple-way data augmentation was used. Fifth, Grad-CAM was utilized to interpret our AI model. Results. The 10 runs with random seed on the test set showed our algorithm achieved a microaveraged F1 score of 95.79%. Moreover, our method is better than nine state-of-the-art approaches. Conclusion. This proposed PSSPNN will help assist radiologists to make diagnosis more quickly and accurately on COVID-19 cases.

1. Introduction

COVID-19 is a type of disease caused by a new strain of coronavirus. “CO” means corona, “VI” virus, and “D” disease. Up to 22 December 2020, COVID-19 has caused more than 78.0 million confirmed cases and over 1.71 million deaths (US 326.5 k, Brazil 188.2 k, India 146.1 k, Mexico 119.4 k, Italy 69.8 k, and UK 68.3 k).

To diagnose COVID-19, two methods exist: (i) real-time reverse transcriptase PCR with nasopharyngeal swab samples to test the existence of RNA fragments and (ii) chest imaging (CI) examines the evidence of COVID-19. The first type of rRT-PCR method needs to wait for a few days to get the results, while the second type of CI approach could get quick results within minutes. The CI approaches have several advantages compared to rRT-PCR. First, the swab may be contaminated [1, 2]. Second, CI can detect the lesions of lungs where “ground-glass opacity (GGO)” will be observed to distinguish COVID-19 from healthy subjects. Third, CI can provide an immediate result as soon as imaging is complete. Fourth, reports show that chest computed tomography (CCT), one type of CI approach, can detect 97% of COVID-19 infections [3].

Currently, there are three types of CI approaches: chest X-ray (CXR) [4], chest CT (CCT), and chest ultrasound (CUS) [5]. Among all three types of approaches, CCT can provide the finest resolution than the other two CI approaches, allowing visualization of extremely small nodules in the lung. The additional advantage of CCT is that it can provide high-quality, three-dimensional chest data where radiologists can clearly view the COVID-19 lesions, which may be obscure in the other two CI approaches.

However, manual labeling by human experts is tedious, labor-intensive, and time-consuming [6]. Besides, the labeling performances are easily affected by interexpert and intraexpert factors (e.g., emotion, tiredness, and lethargy). Moreover, the labeling throughputs of radiologists are not comparable with artificial intelligence (AI) models. For example, senior radiologists may diagnose one scanning within five minutes, but AI can analyze thousands of samples within one minute. Particularly, early lesions are small and similar to surrounding normal tissues, which make them more challenging to measure and hence can potentially be neglected by radiologists.

Traditional AI methods have successfully been applied in many medical fields. For instance, Wang et al. [7] chose wavelet packet Tsallis entropy as a feature descriptor and employed a real-coded biogeography-based optimization (RCBO) classifier. Jiang and Zhang [8] proposed a 6-level convolutional neural network (6L-CNN) for therapy and rehabilitation. Their performances were improved by replacing the traditional rectified linear unit with a leaky rectified linear unit. Fulton et al. [9] used ResNet-50 (RN-50) to classify Alzheimer’s disease with and without imagery. The authors found that ResNet-50 models help identify AD patients. Guo and Du [10] utilized a ResNet-18 (RN-18) model to recognize thyroid ultrasound standard plane (TUSP), achieving a classification accuracy of 83.88%. The experiments verified the effectiveness of RN-18. The aforementioned four algorithms can be transferred to the multiclass classification task of COVID-19 diagnosis.

On the COVID-19 datasets, several recent publications reported promising results. For example, Cohen et al. [11] proposed a COVID severity score network (CSSNet), which achieved a mean absolute error (MAE) of 1.14 on geographic extent score and an MAE of 0.78 on lung opacity score. Li et al. [12] developed a fully automatic model (COVNet) to detect COVID-19 using chest CT and evaluated its performance. Wang et al. [13] proposed a 3D deep convolutional neural network to detect COVID-19 (DeCovNet). Zhang et al. [14] proposed a seven-layer convolutional neural network for COVID-19 diagnosis (7L-CCD). Their performance achieved an accuracy of for the binary classification task (COVID-19 against healthy subjects). Ko et al. [15] proposed a fast-track COVID-19 classification network (FCONet). For the sake of the page limit, the details of those methods are not described, but we shall compare our method with those state-of-the-art methods in the following sections. Wang et al. [16] presented a CCSHNet via transfer learning and discriminant correlation analysis.

Our study’s inspiration is to improve recognition performances of COVID-19 infection in CCT images by developing a novel deep neural network, PSSPNN, short for PatchShuffle stochastic pooling neural network. Our contributions entail the following five angles: (i)The “n-conv stochastic pooling module (NCSPM)” is proposed, which comprises n-times repetitions of convolution layers and batch normalization layers, followed by stochastic pooling(ii)A novel “stochastic pooling neural network (SPNN)” is proposed, the structure of which is inspired by VGG-16(iii)A more advanced neural network, PatchShuffle SPNN (PSSPNN), is proposed where PatchShuffle is introduced as the regularization term in the loss function of SPNN(iv)An improved multiple-way data augmentation is utilized to help the network avoid overfitting(v)Grad-CAM is used to show the explainable heatmap, which displays association with lung lesions

2. Dataset

This retrospective study was exempt by Institutional Review Board of local hospitals. Four types of CCT were used: (i) COVID-19-positive patients, (ii) community-acquired pneumonia (CAP), (iii) second pulmonary tuberculosis (SPT), and (iv) healthy control (HC). Three diseased classes (COVID-19, CAP, and SPT) were chosen since they are all infectious diseases of the chest regions. We intend to include the fifth category (chest tumors) in our future studies.

For each subject, slices of CCT were chosen via a slice level selection (SLS) method. For the three diseased groups (COVID-19, CAP, and SPT represented as ), the slice displaying the largest number of lesions and size was chosen. For HC subjects (), any slice within the 3D image was randomly chosen. The slice-to-subject ratio is defined as where stands for the number of slices via the SLS method and is the number of patients.

In all, we enrolled 521 subjects and produced 1164 slice images using the SLS method. Table 1 lists the demographics of the four-category subject cohort with the values of triplets , where of the total set equals to 2.23.

Three experienced radiologists (two juniors: and and one senior: ) were convened to curate all the images. Suppose means one CCT scan, stands for the labeling of each individual radiologist. The last labeling of the CCT scan is obtained by where MAV denotes majority voting and the labeling of all radiologists, viz.,

The above two equations mean that in situations of disagreement between the analyses of two junior radiologists , we consult a senior radiologist to reach a MAV-type consensus. Data is available on request due to privacy/ethical restrictions.

3. Methodology

Table 2 gives the abbreviation and full meanings in this study. Section 3.1 shows the preprocessing procedure. Sections 3.23.5 offer four improvements. Finally, Section 3.6 gives the implementation, measure indicators, and explainable technology used in our method.

3.1. Preprocessing

The original raw dataset contained slice images . The size of each image was . Figure 1 presents the pipeline for preprocessing of this dataset.

First, the color CCT images of four classes were converted into grayscale by retaining the luminance channel and obtaining the grayscale data set : where means the grayscale operation.

In the second step, the histogram stretching (HS) was utilized to increase the contrast of all images. Take the -th image as an example; its minimum and maximum grayscale values and were calculated as follows: Here, (w, h, c) means the index of width, height, and channel directions along image , respectively. means the maximum values of width, height, and channel to the image set . The new histogram stretched image set was calculated as follows: where stands for the HS operation.

In the third step, cropping was performed to remove the checkup bed at the bottom area and eliminate the texts at the margin regions. Cropped dataset is yielded as where represents crop operation. Parameters mean pixels to be cropped from four directions of the top, bottom, left, and right, respectively (unit: pixel).

In the fourth step, each image was downsampled to a size of , obtaining the resized image set as where represents the downsampling (DS) procedure, in which stands for the downsampled image of the raw image .

Figure 2 displays example images of the four categories, in which three are diseased, and one is healthy. The original size of each image in is , and the final preprocessed image is .

3.2. Improvement I: n-conv Stochastic Pooling Module

First, stochastic pooling (SP) [17] was introduced. In the standard convolutional neural networks, pooling is an essential component after each convolution layer, which was applied to reduce the size of feature maps (FMs). SP was shown to give better performance than average pooling and max pooling in recent publications [1821]. Recently, strided convolution (SC) is commonly used, which also can shrink the FMs [22, 23]. Nevertheless, SC could be considered a simple pooling method, which always outputs the region’s fixed-position value [24].

Suppose we have a postconvolution FM . The FM can be separated into blocks, in which every block has the size of . Now we focus on the block which stands for the -th row and -th column blocks.

The strided convolution (SC) traverses the input activation map with the strides, which equals the size of the block , so here its output is set to

The l2-norm pooling (L2P), average pooling (AP), and max pooling (MP) generate the l2-norm value, average value, and maximum value within the block , respectively.

The SP provides a solution to the shortcomings of AP and MP. The AP outputs the average, so it will downscale the largest value, where the important features may sit on. On the other hand, MP reserves the maximum value but worsens the overfitting problem. SP is a three-step procedure. First, it generates the probability map (PM) for each entry in the block . where stands for the PM value at pixel .

In the second step, create a random variable that takes the discrete probability distribution (DPD) as where represents the probability.

In the third step, a sample is drawn from the random variable , and the corresponding position . Then, the output of SP is at location , namely,

Figure 3 presents the comparison of four different pooling techniques. The top left shows the raw FM in which the pooling will take place at a kernel. If we take the top-right block (in an orange rectangle) as an example, the L2P outputs 4.96, while AP and MP output 4.08 and 9.3, respectively. For the SP method, it will first generate the PM and then sample a position based on the PM (see the red fonts), and thus, SP outputs the value of 6.

A new “n-conv stochastic pooling module” (NCSPM) is proposed in this study based on the SP layer discussed in previous paragraphs. The NCSPM entails -repetitions of a conv layer and a batch normalization layer, followed by an SP layer. Figure 4 shows the schematic of the proposed NCSPM module. In this study, , since we experimented using , but the performance using did not improve.

3.3. Improvement II: Stochastic Pooling Neural Network

The second improvement of this study is to propose a stochastic pooling neural network (SPNN), whose structure was inspired by VGG-16 [25]. In VGG-16, the network used small kernels instead of large kernels and always used filters with a stride of 2 for pooling. In the end, VGG-16 has two fully connected layers (FCLs).

This proposed SPNN will follow the same structure design of VGG-16 but using the NCSPM module to replace the convolution block in VGG-16. The details of SPNN are shown in Table 3, where NWL means the number of weighted layers, HS is the hyperparameter size, and FMS is the feature map size.

Compared to ordinary CNN, the advantages of SPNN are two folds: (i) SPNN helps prevent overfitting; (ii) SPNN is parameter-free. (iii) SPNN can be easily combined with other advanced techniques, such as batch normalization and dropout. In total, we create this 11-layer SPNN. We have attempted to insert more NCSPMs or more FCLs, which does not show performance improvement but more computation burden. The structure of the proposed model is summarized in Table 3. The related to NCSPM stands for repetitions of filters with size of . For the FCL, the stands for a weight matrix is with size of and a bias matrix is with size of . In the last column of Table 3, the format of represents the feature map’s size in three dimensions: height, width, and channel. Directly using transfer learning is another alternative.

In this study, we chose to create a custom neural network by designing its structure and training the whole network using our own data. The reason is some reports have shown this “built your own network from scratch” can achieve better performance than transfer learning [26, 27].

3.4. Improvement III: PatchShuffle SPNN

Kang et al. [28] presented a new PatchShuffle method. In each minibatch, images and feature maps undergo a transformation such that pixels with that patch are shuffled. By generating fake images/feature maps with interior order-less patches, PatchShuffle creates local variations and reduces the possibility of the AI model overfitting. Therefore, PatchShuffle is a beneficial supplement to various existing training regularization methods [28].

Assume there is a matrix of entries. A random variable controls whether the matrix to be PatchShuffled or not. The random variable obeys the Bernoulli distribution

Namely, with probability , and with probability . The resulted matrix is written as where is the PatchShuffle operation. Suppose the size of each patch is , we can express the matrix as where stands for a nonoverlapping patch at -th row and -th column. The PatchShuffle transformation works on every patch.

The shuffled patch is formulated as where is the row permutation matrix and is the column permutation matrix [29]. In practice, a randomly shuffle operation is used to replace the row and column permutation operation. Each patch will undergo one of the possible permutations.

We proposed integrating PatchShuffle into our SPNN, and this new network model is named as PatchShuffle stochastic pooling neural network (PSSPNN). The PatchShuffle operation acts on both the input image layer (see grayscale image Figure 5 and colorful image Figure 6 with different values of ) and the feature maps of all the convolutional layers (9 conv layers from NCSPM-1 to NCSPM-5).

The schematics of building PSSPNN from the SPNN are drawn in Figure 7, where either inputs or feature maps are randomly selected to undergo the PatchShuffle operation. To reach the best bias-variance trade-off, only a small percentage () of the images/feature maps will undergo operation.

For simplicity, we consider the PatchShuffling images as an example, and the training loss function of the proposed PSSPNN is written as where stands for the ordinary loss function and the loss function of PSSPNN. represents the original images and the PatchShuffled images. The label is symbolized as , and the weights are symbolized as .

Considering the extreme situations when , we have which means the loss function degrades to ordinary loss function when , and meanwhile, the loss function equals to training all images PatchShuffled when . Taking mathematical expectation of , equation (19) turns to where acts as a regularization term.

3.5. Improvement IV: Improved Multiple-Way Data Augmentation

This small four-category dataset makes our AI model prone to overfitting. In order to alleviate the overfitting and handle the low sample-size problem, the multiple-way data augmentation (MDA) [30] method was chosen and further improved. In the original 14-way MDA [30], the authors used seven different data augmentation (DA) techniques to the raw image and its horizontal image. Their seven DA techniques are as follows: noise injection, horizontal shear, vertical shear, rotation, Gamma correction, scaling, and translation.

Figure 8 shows a 16-way data augmentation method. The difference between the proposed 16-way DA with the traditional 14-way DA is that we add the salt-and-pepper noise (SAPN). Although the SAPN defies intuition as it never takes place in realistic CCT images, we found that it can increase performance. The same observation was reported by Li et al. [31], where the authors used salt and pepper noise for the identification of early esophageal cancer. Table 4 shows the pseudocode of this proposed 16-way improved data augmentation. where in this study. We tested a greater value of , but it does not bring about significant improvement. Hence, one image will generate images (including original image), as shown in Figure 8.

Step 1. Eight geometric or photometric DA transforms were utilized on raw image , as shown in Figure 8. We use to denote each DA operation. Note each DA operation will yield new images. So, for a given image , we will produce an enhanced dataset , where stands for concatenation function.

Step 2. Horizontal mirror image is generated as , where means horizontal mirror function.

Step 3. The raw image , the mirrored image , all the above 8-way DA results of raw image , and 8-way DA results of horizontal mirrored image are combined. Mathematically, one training image will generate to a dataset which contains new images.

Taking Figure 2(a) as an example raw image, Figure 9 shows the 8-way DA results, i.e., . Due to the page limit, the mirror image and its corresponding 8-way DA results are not shown here.

3.6. Implementation, Measure, and Explainability

Table 5 itemizes the nontest and test sets for each category. The whole dataset contains four nonoverlapping categories . For each category, the dataset will be split into nontest set and test set .

Our experiment entails two phases. At phase I, 10-fold cross-validation was used for validation on the nontest set to select the best hyperparameters and best network structure. The 16-way DA was used on the training set of 10-fold cross-validation. The hyperparameter of the proposed PSSPNN was determined over the nontest set . Afterward at phase II, we train our model using the nontest set 10 times with different initial seeds and attain the test results over the test set . After combining the runs, we attain a summation of the test confusion matrix (TCM) . Table 5 shows the dataset splitting, where stands for the number of elements in the dataset .

The ideal TCM is a diagonal matrix with the form of in which all the off-diagonal elements are zero, meaning no prediction errors. In realistic scenarios, the AI model will make errors, and the performance is calculated per category. For each class , we set the label of that class as positive, and the labels of all the rest classes as negative. Three performance metrics (sensitivity, precision, and F1 score) per category are defined below:

The performance can be measured over all four categories. The microaveraged (MA) F1 (symbolized as ) is used since our dataset is slightly unbalanced: where

Finally, gradient-weighted class activation mapping (Grad-CAM) [32] was employed to provide explanations on how our model makes the decision. It exploits the gradient of the categorization score regarding the convolutional features decided by the deep model to visualize the regions of the image that are the most vital for the image classification task [33]. The output of NCSPM-5 in Table 3 was used for Grad-CAM.

4. Experiments, Results, and Discussions

The experiment was carried out on the programming platform of Matlab 2020b. The programs ran on Windows 10 with 16GB RAM and 2.20GHz Intel Core i7-8750H CPU. The performances are reported over the test set with 10 runs.

4.1. Comparison of SPNN and Other Pooling Methods

In the first experiment, we compared the proposed SPNN with four standard CNNs with different pooling methods. The first CNN uses strided convolution in five modules to replace the stochastic pooling. The second to fourth comparison CNN models use L2P, AP, and MP, respectively. Those four baseline methods are called SC-CNN, L2P-CNN, AP-CNN, and MP-CNN, respectively. The results of 10 runs over the test set are shown in Table 6. The bar plot is displayed in Figure 10, where “-,” “-,” and “”-F1” stand for the sensitivity, precision, and F1 score for category .

The results in Table 6 and Figure 10 are coherent with our expectation that SPNN obtained the best results among all FM reduction approaches. The SPNN arrives at the sensitivities of all four categories are 98.07%, 91.79%, 94.92%, and 95.25%, respectively. The precisions of all four categories are 94.59%, 95.19%, 94.92%, and 95.40%, respectively. The F1-scores of the four categories are 96.30%, 93.45%, 94.92%, and 95.32%, respectively. The overall microaveraged F1 is 95.02%.

In terms of microaveraged F1, the second-best algorithm is MP-CNN, which obtains a microaveraged F1 score of 93.91%. The third best is AP-CNN, with a microaveraged F1 score of 93.18%. The two comparably worst algorithms are SC-CNN and L2P-CNN, with the microaveraged F1 scores of 92.40% and 92.53%, respectively.

SPNN obtains the best results because SP can prevent overfitting [17], which is the main shortcoming of max pooling. On the other hand, AP and L2P will average out the maximum activation values, which will impair the performances of convolutional neural network models. For SC-CNN, it only uses one quarter information of the input FM, and therefore may neglect those greatest values [34]. In all, this proposed SPNN can be regarded as an improved version of vanilla CNN models, where the SP is used to replace traditional MP.

4.2. PSSPNN versus SPNN

In this second experiment, we compared our two proposed network models, PSSPNN against SPNN, to validate the effectiveness of PatchShuffle. The results of 10 runs over the test set with different combinations of hyperparameters are shown in Table 7, and the 3D bar chart is shown in Figure 11. The optimal hyperparameter we found from the 10-fold cross-validation of the nontest set is and patch size is , which are coherent with reference [28].

In addition, the PSSPNN with optimal hyperparameter is compared with SPNN. The results are shown in Table 8. From the table, we can observe that PSSPNN provides better F1 values for all four categories and the overall microaverage, which shows the effectiveness of PatchShuffle. The reason is PatchShuffle adds regularization terms [28] in the loss function, and thus can improve the generalization ability of our SPNN model.

4.3. Comparison to State-of-the-Art Approaches

We compared our proposed PSSPNN method with 9 state-of-the-art methods: RCBO [7], 6L-CNN [8], RN-50 [9], RN-18 [10], CSSNet [11], COVNet [12], DeCovNet [13], 7L-CNN-CD [14], and FCONet [15]. All the comparison was carried on the same test set of 10 runs. The comparison results are shown in Table 9.

For ease of comparison, Figure 12 only compares the microaveraged F1 (MA F1) score of all algorithms, from which we can observe this proposed PSSPNN achieves the best performance among all the algorithms. This experiment is a simulation-based comparison. In the future, we will apply our algorithm to rigorous clinical testing and verification.

4.4. Explainability of Proposed PSSPNN

We take Figure 2 images as examples; the heatmaps of those four images are shown in Figures 13(a)13(d), and the manual delineation, shown in Figures 13(e)13(h), delineates the lesions of the three disease samples. Note there are no lesions of healthy control (HC) image. The NCSPM-5 feature map in PSSPNN was used to generate heatmaps with the Grad-CAM approach.

We can observe from Figure 13 that the heatmaps via our PSSPNN model and Grad-CAM can capture the lesions effectively and meanwhile neglect those nonlesion regions. Traditionally, AI is regarded as a “black box,” which impairs its widespread usage, e.g., the black box properties of traditional AI are problematic for the FDA. Nevertheless, with the help of explainability of modern AI techniques [35], the radiologist and patients will gain confidence in our proposed AI model, as the heatmap provides a clear and understandable interpretation of how AI predicts COVID-19 and other chest infectious disease from healthy subjects, which was also stated in reference [36]. Many new AI-based anatomic pathological systems now pass through FDA approval, such as whole slide images (WSI) [37], since the doctors know the relationships between the diagnosis and the explained answer.

In the future, the explainability of our proposed AI model can be used in patient monitoring [38] and health big data [39]. Some novel network improvement and signal processing techniques may help our AI model in future researches, such as filters [40, 41], fuzzy [42, 43], edge computing [44], knowledge-aid [45, 46], autofocus [47], graph integration, and cross-domain knowledge exploitation [4850].

5. Conclusion

In this paper, we proposed a PSSPNN, which entails five improvements: (i) proposed NCSPM module, (ii) usage of stochastic pooling, (iii) usage of PatchShuffle, (iv) improved multiple-way data augmentation, and (v) explainability via Grad-CAM. Those five improvements enable our AI model to deliver improved performances compared to 9 state-of-the-art approaches. The 10 runs on the test set showed our algorithm achieved a microaveraged F1 score of 95.79%.

There are three shortcomings of our method, which will be resolved in the future: (i) the dataset currently contains three chest infectious diseases. In the future, we shall try to include more classes of chest diseases, such as thoracic cancer. (ii) Some new network techniques and models are not tested, such as transfer learning, wide network module design, attention mechanism, and graph neural network. Those advanced AI technologies will be studied. (iii) Our model does not go through strict clinical validation, so we will attempt to release our software to hospitals and get feedback from radiologists and consultants.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Shui-Hua Wang and Yin Zhang contributed equally to this paper.

Acknowledgments

We appreciate Qinghua Zhou for helping us revise our English. This paper is partially supported by the Royal Society International Exchanges Cost Share Award, UK (RP202G0230); Medical Research Council Confidence in Concept Award, UK (MC_PC_17171); Hope Foundation for Cancer Research, UK (RM60G0680); British Heart Foundation Accelerator Award, UK; Open fund for Jiangsu Key Laboratory of Advanced Manufacturing Technology (HGAMTL-1703); Guangxi Key Laboratory of Trusted Software (kx201901).