Introduction

Chest radiography is a widely used imaging modality in the regular medical practice for the evaluation, management and follow-up of patients with several diseases, such as COVID-19 pneumonia1. However, chest radiography is a complex imaging modality to interpret2, and its evaluation requires experience and expertise3. Deep learning (DL) algorithms have potential to improve the quality of radiographic interpretation and lead to more accurate diagnoses3.

Several DL algorithms for chest radiographs (CXRs) classification have been published in the past years, especially in the context of COVID-19 pandemic4,5,6. However, these algorithms often show low generalization performance when data distribution shifts7,8,9,10,11,12.

This generalization deficiency is often overlooked during algorithm evaluation, because algorithms are often assessed on test subsets that come from the same population sample as the training and validation subsets. Therefore, typically only the internal validation performance is estimated, and generalizability is not usually properly evaluated. For this reason, DL algorithms’ performance should also be assessed on test subsets coming from a different source from which training and validation subsets were obtained7,8,9,10,11,12.

In medical data, the difference between the source from which training, validation, and usually test subsets are obtained and the real-world environment can cause particularly considerable distribution shifts, which may lead DL networks to significantly reduce their generalization performance9,10,12. This issue remains a major challenge in machine learning research13.

Nevertheless, generalization deficiency for image classification in DL networks has only been discussed by a few authors in the medical field. While the causes remain unclear, most of these authors have only studied generalization through testing different DL networks on external datasets from external institutions8,10,12,14. Unlike previous works, this novel research separately assesses the influence of institution and X-ray device on both algorithm’s internal validation and generalization performances.

For this purpose, factors that may potentially affect the performance of DL networks were divided into two categories:

  • X-ray device related factors: aspects that affect image pixel values, which include the acquisition protocol, the type of response function of the radiology device, and the image processing applied by the X-ray device.

  • Institutional related factors: differences among hospitals that do not change the pixel values, such as labeling criteria, population demographics, disease epidemiology, and radiology workflow.

Thus, this work is aimed to study the role of the aforementioned factors, but is not intended to provide a new DL algorithm for CXR classification, as there are already plenty of these in the literature which show satisfactory performance, particularly, in internal validation. In contrast, this novel study aims to evaluate the influence of X-ray device associated and institutional related factors on DL algorithms performance both in internal validation and generalization.

To achieve this purpose, one DL network will be trained and tested using different subsets in which the variables institution and X-ray device, (including device’s image processing and response function) will be completely controlled. This represents a differentiating issue from other previous publications, as to the best of our knowledge, it is the first time that these variables are fully controlled.

Materials and methods

Three experiments were carried out to study the influence of institutional and X-ray device related factors on the internal validation and generalization performance of a DL network for CXR classification (Fig. 1).

Figure 1
figure 1

Experimental design.

Ethical approval

This research involved patients from two different medical institutions: Hospital Universitario Marqués de Valdecilla, located in Santander, Cantabria, Spain—referred to as Institution 1 in the text; and Hospital de Sierrallana in Torrelavega, Cantabria, Spain—referred to as Institution 2. The Ethics Committee of both institutions, Comité de Ética de la Investigación con Medicamentos y Productos Sanitarios de Cantabria, approved this research. Since this study was approved by the Comité de Ética de la Investigación con Medicamentos y Productos Sanitarios de Cantabria, without direct interaction with patients or use of tissue samples, and using retrospective images acquired in the past and anonymized, informed consent was not required15. All methods reported in this work were carried out in accordance with the pertinent guidelines and regulations.

Dataset and subsets

Images for this research were all frontal view CXRs manually labeled by three expert radiologists, with more than 5 years of experience, in two classes (COVID-19 and Control), according to the inclusion criteria summarized in Table 1. In the text, these classes are named as target classes.

Table 1 Inclusion criteria for target classes.

The main image dataset was created by simple random sampling from four image databases, as described in Supplementary Appendix A1. This main dataset contained images acquired by three different X-ray devices in two institutions: 394 images acquired by a Fujifilm FDR smart FGX device in institution 1; 244 images acquired by the same device model (Fujifilm FDR smart FGX) in institution 2; 192 images acquired by a general electric (GE) revolution XRD device in institution 2; and 94 images acquired by a Carestream DRX Evolution Plus device in institution 2. Note that Fujifilm and Carestream devices have the same type of response function (logarithmic), while GE has a different type (linear)16,17.

Finally, eight subsets described in Table 2 were created by random sampling without repetition of the main dataset. Random sampling was performed with stratification to ensure an equal distribution of COVID-19 and Control images within each subset (50 percent of each class).

Table 2 Image subsets.

Image preprocessing

Images were collected as 16-bit unsigned integer monochrome pixels stored in DICOM format. After data collection, images were resized to 512 × 512 pixels using cubic spline interpolation, and pixel values were rescaled to [0, 1].

Experiments to test the influence of institutional and device related factors

Experiment 1: evaluation of internal validation performance

The first experiment analyzed the influence of institutional and device related factors on the internal validation performance of a DL algorithm (Fig. 1). For this purpose, the same DL network (a VGG16) was trained three times for the classification of CXR images. Each time, with the same architecture, hyperparameters (details in Supplementary Appendix A2), and number of images (300), but using different training subsets (Table 3).

Table 3 Trainings and models.

First training was performed with subsets Fuji_Inst1_TRAIN_A and Fuji_Inst1_TRAIN_B, so it includes 300 images acquired by a Fujifilm FDR smart FGX device from institution 1. The resulting model was named Model-F1A_F1B, because of the subsets used (F1A = Fuji_Inst1_TRAIN_A, etc.).

Second training included subsets Fuji_Inst1_TRAIN_A and Fuji_Inst2_TRAIN, so it was performed with all images acquired by a Fujifilm FDR smart FGX device, but half in institution 1 and half in institution 2. This model was named Model-F1A_F2.

The third and last training was performed with 101 random images from Fuji_Inst1_TRAIN_A, 101 random images from Fuji_Inst2_TRAIN, and the 98 images from GE_Inst2_TRAIN. This means that the resulting model, named Model-F1A’_ F2’_GE2, was trained including images acquired by two different manufacturers, Fujifilm FDR smart FGX and general electric (GE) revolution XRD, and in two different institutions (1 and 2).

Later, internal validation performance of these three models were tested on Fuji_Inst1_TEST subset, which contained all images acquired by a Fujifilm FDR smart FGX device in institution 1. This subset was the only test subset that could give results of internal validation for the three models, since all models were trained using sets of images which contained CXRs acquired by a Fujifilm FDR smart FGX device in institution 1. Finally, the internal validation performances of the three models were compared to assess the influence of institutional and X-ray machine related factors.

The influence of institutional factors was studied by comparing performances of Model-F1A_F1B and Model-F1A_F2, as both of them were trained with 300 images of the same X-ray device model, but Model-F1A_F1B coming from institution 1, and Model-F1A_F2 coming from institutions 1 and 2. Therefore, performance differences among these models could be probably attributable to institutional related factors.

The influence of X-ray machine related factors was assessed comparing the performances of Model-F1A’_ F2’_GE2 with Model-F1A_F1B and Model-F1A_F2. While Model-F1A_F1B and Model-F1A_F2 were trained with 300 images acquired by a Fujifilm FDR smart FGX, Model-F1A’_ F2’_GE2 was trained with 300 images, some of them acquired by a Fujifilm FDR smart FGX, and the rest by a GE revolution XRD device. Thus, this comparison revealed the effect of adding images from a different manufacturer to the training sample.

The metric used to evaluate the performances was the area under the receiver operating characteristic curve (AUC)18. Additionally, gradient-weighted class activation mapping (Grad-CAM) heatmaps were used to identify the significant regions in the image for making the prediction19. This analysis evaluated how the addition of images acquired in a distinct institution and by a different X-ray device affected the DL network’s ability to learn causal relationships.

Experiment 2: evaluation of generalization

The second experiment studied the influence of both institutional and device-related factors on a DL network’s generalization (Fig. 1). Model-F1A_F1B was evaluated on the four test subsets that included images from two different institutions and three distinct X-ray devices (Table 2). Therefore, four AUCs were obtained: AUC on subset Fuji_Inst1_TEST (internal validation); AUC on Fuji_Inst2_TEST (generalization to an external institution with the same X-ray device); AUC on Care_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing but the same type of response function); and AUC on GE_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing and different type of response function).

Later, considering the shifting factor, AUCs were compared to evaluate the influence of institutional factors on generalization (AUC-Fuji_Inst1_TEST vs AUC-Fuji_Inst2_TEST), as well as X-ray device factors, including device’s image processing (AUC-Fuji_Inst2_TEST vs AUC-Care_Inst2_TEST) and device’s type of response function (AUC-Care_Inst2_TEST vs AUC-GE_Inst2_TEST).

Experiment 3: evaluation of the dependence of CNN-features on textures

The third experiment investigated the influence of institutional and device related factors on the values of the features extracted by the pre-trained CNN (Fig. 1). The hypothesis was that radiological image textures depend on the X-ray device used to acquire the image. Differences in device´s image processing and response functions result in differences among imaging textures based on the X-ray device.

Therefore, feature values extracted by CNNs could also be highly dependent on the X-ray device that acquires the image. This issue could hinder the generalization of DL networks, making them only suitable to images obtained from the same devices as the ones used in training.

In this context, Model-F1A_F1B was used to extract features from the test subsets. Later, features from the last convolutional layer were clustered using a hierarchical clustering algorithm20, which was implemented using python’s scientific graphics library, seaborn (version 0.11.1)21. This unsupervised approach allowed us to examine which image classes were more evident for the CNN, namely target classes (COVID-19 and control) or other hidden classes (such as the X-ray device which acquired the image, or the institution where the images were obtained). To ensure that results were not biased by CXRs metallic tokens, this experiment was repeated by cropping the images to preserve only their central part.

This experiment aimed to evaluate whether the difference in pixel values between a COVID-19 image from a Fujifilm device and a COVID-19 image from a GE device is greater or smaller than the difference between a COVID-19 image and a control image, both acquired by the same X-ray device.

Statistical analysis

Cross-Validated AUCs 95% confidence intervals (CIs) were computed with the R package cvAUC22. The AUC differences with their 95% CIs were calculated using the bootstrap method23. Any difference where the CI excluded the 0 value was considered to be statistically significant with a p-value < 0.05.

Results

Patients

This research included 874 patients, 45.1% (394) from institution 1 and the remaining from institution 2. The study sample comprised 42.6% (372) females and 57.4% (502) males. The median age was 62 years, while the average age was 60 ± 17 years (5–96). Detailed population descriptive statistics are summarized in Table 4.

Table 4 Descriptive statistics of population age and gender.

Experiments to test the influence of institutional and device related factors

Experiment 1: evaluation of internal validation performance

Internal validation performances of Model-F1A_F1B and Model-F1A_F2 did not show significant statistical differences (Fig. 2 and Table 5). Thus, the addition of images to the training sample acquired in a different institution by the same X-ray model did not have a significant impact on the algorithm’s internal validation.

Figure 2
figure 2

Receiver operating characteristic (ROC) curves on subset Fuji_Inst1_TEST (internal validation).

Table 5 AUC differences on subset Fuji_Inst1_TEST (internal validation).

Grad-CAM heatmaps showed similar activation patterns between the two models. Heatmaps depicted activations over COVID-19 lung opacities, and absence of activations within any region of the image in Control patients (Fig. 3). This result suggests that both models were able to learn the radiological findings of COVID-19 and, thus, made predictions based on causal relationships.

Figure 3
figure 3

Comparison of four Grad-CAM heatmaps based on the predictions of the three models.

In contrast, internal validation performances of Model-F1A_F1B and Model-F1A_F2 were 8% (p < 0.05) and 5.2% (p < 0.05) significantly higher than Model-F1A’_ F2’_GE2 internal validation performance (Table 6). Hence, the addition of images acquired by another X-ray device of a different manufacturer to the training sample decreased the algorithm’s internal validation performance.

Table 6 Generalization of Model-F1A_F1B across institutions and devices from different manufacturers.

This was also evidenced in Grad-CAM heatmaps, where Model-F1A’_ F2’_GE2 showed activation areas located outside the lungs without any clinical or radiological interpretation. Unlike Model-F1A_F1B and Model-F1A_F2, Model-F1A’_ F2’_GE2 learnt spurious relationships (confounding factors) instead of causal relationships (Fig. 3).

Experiment 2: evaluation of generalization

Model-F1A_F1B generalized to Fujifilm and Carestream images from institution 2 (Fuji_Inst2_TEST and Care_Inst2_TEST) with a performance decrease of 9.8% (p < 0.05) and 18.9% (p < 0.05), respectively. In contrast, this model did not generalize to GE images from institution 2 (GE_Inst2_TEST), as it showed a loss in the AUC of 33.5% (p < 0.05) which caused the model to perform randomly (Fig. 4 and Table 6).

Figure 4
figure 4

ROC curves of Model-F1A_F1B on the tests subsets: Fuji_Inst1_TEST (internal validation); Fuji_Inst2_TEST (generalization to an external institution with same X-ray device); Care_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing but the same type of response function); and GE_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing and a different type of response function).

Thus, Model-F1A_F1B generalized across institutions and across X-ray devices from different manufacturers with the same type of response function, however, it did not generalize across X-ray devices with different types of response function. A hierarchy of factors influencing the generalization capability of the DL network is presented in Fig. 5.

Figure 5
figure 5

Hierarchy of factors that affect the generalization of a deep learning network in medical image classification.

Experiment 3: evaluation of the dependence of CNN-features on textures

The hierarchical clustering algorithm grouped images from the test subsets into three evident clusters, which corresponded to images from each of the three X-ray devices used to acquire the images (Fujifilm, GE and Carestream). Radiographies acquired by both Fujifilm devices (subsets Fuji_Inst1_TEST and Fuji_Inst2_TEST) were mixed, despite being acquired in different institutions. Ultimately, images from different target classes (COVID-19 and control) were not separated (Fig. 6).

Figure 6
figure 6

Hierarchical clustering of test subset images (Fuji_Inst1_TEST, Fuji_Inst2_TEST, GE_Inst2_TEST, Care_Inst2_TEST) based on features extracted by Model-F1A_F1B. Hierarchical clustering was generated using python's scientific graphics library, seaborn (version 0.11.1)21.

Besides, the two clusters corresponding to images acquired by the two X-ray devices with the same type of response function (Fujifilm and Carestream) were next to each other, grouped together in a higher cluster level, and separated from the cluster containing GE images, which had a different type of response function. Same results were observed when the experiment was repeated using a cropped version of the images.

In summary, the hierarchical clustering algorithm found that the feature values extracted by the pretrained ImageNet CNN were more dissimilar regarding the hidden classes (X-ray device and type of response function) than the real target classes (COVID-19 and control).

Discussion

Experiment 1: evaluation of internal validation performance

The similarity in performance between Model-F1A_F1B and Model-F1A_F2 suggests that institutional related factors may not have a significant impact on the internal validation performance of the algorithm. Besides, the addition of images acquired by a different model of X-ray device to the training set led to a significant performance reduction in the internal validation of Model-F1A’_ F2’_GE2. This result indicates a potentially important influence of device related factors on the algorithm’s internal validation performance.

Grad-CAM heatmaps were in line with the aforementioned results. Heatmaps of Model-F1A_F1B and Model-F1A_F2 showed activations over lung opacities in COVID-19 images and absence of activations in control images. These activation patterns suggest that those two models were able to learn causal relationships. Conversely, Model-F1A’_ F2’_GE2 did not show human-recognizable activations. COVID-19 lung consolidations were not properly identified and several activations without clinical meaning appeared both in COVID-19 and control images. In summary, Grad-CAM heatmaps also provided evidence of a variable level of influence of device related factors on the internal validation performance of the algorithm.

Experiment 2: evaluation of generalization

This study found that a DL network can generalize across institutions and X-ray devices with the same type of response function, however, it may suffer a variable decrease in performance when deployed on external datasets. In contrast, generalization across X-ray devices with a different type of response function was not observed in this research.

Generalization of DL networks for CXRs classification to external datasets has been argued by a few authors. Pooch et al.9 concluded that state-of-the-art DL algorithms do not generalize to external data which differs from the data used for training. Similar to this, Zech et al.12 and Sathitratanacheewin et al.10 defend that CNNs do not generalize to external sites. Additionally, Zech et al.12 and Maguolo and Nanni7 warn that neural networks can often distinguish the dataset or the hospital where the images come from. For Maguolo and Nanni7, this issue is very important since most papers obtain images of each class to predict from different datasets. Trying to understand how CNNs distinguish the source of the dataset, Cohen et al.24 proposed discrepancies in image labeling criteria among medical centers to be the potential cause. Furthermore, for Rajpurkar et al.14 and Pan et al.8 DL algorithms for CXR classification can generalize to datasets from external institutions with a decrease in their performance. Our results agree with these last two authors.

In an attempt to shed light on the controversy surrounding the generalization of DL networks, we separately assessed the influence of multiple factors on generalization. Our research found that the X-ray device’s response function is probably the most important factor to enable generalization, followed by the device’s image processing, which hindered but did not impede the algorithm to generalize.

Furthermore, institutional related factors were also found to reduce algorithm’s performance, but to a minor extent than X-ray device related factors (Fig. 5).

Experiment 3: evaluation of the dependence of CNN-features on textures

Hierarchical clustering showed that feature values extracted by a CNN could be highly dependent on the X-ray device that acquires the image. The reason is that each X-ray device model applies a unique image processing and has a distinct response function, which generates different textures in radiographic images. These textural differences may lead to disparities in CNN-feature values among images from different devices and vendors that could hinder generalization. Therefore, the application of DL networks to images acquired by devices from manufacturers that are different from those used to acquire the training images should be carefully accomplished.

In fact, the results of experiment 3 also suggest that the influence of X-ray devices on CNN-feature values could be even higher than the influence of the target classes or the institution. Nevertheless, the impact of this issue may be more significant in challenging classification tasks, while in relatively easy tasks, such as body parts classification in radiography, it may not pose a significant obstacle.

The dependence of CNN-feature values on the X-ray device indicates that at least some of the pre-trained CNNs extract features mainly based on textures rather than shapes. This is an important issue for generalization since shape-based features are potentially more robust and invariant than texture-based features. Accordingly, outside the medical field, Geirhos et al.25 previously argued that ImageNet-trained CNNs are biased towards recognizing textures rather than shapes. These authors also suggest that shape biased networks are inherently more robust than texture biased networks25.

Finally, this research introduces hierarchical clustering as a potentially useful tool to detect hidden classes in a dataset which can be more relevant than target classes. Therefore, training a different algorithm for each hidden class to classify the target classes instead of training a single algorithm may be a prudent approach to consider.

Taking all these findings into account, this paper argues that generalization across institutions is possible; however, the influence of the X-ray device on the performance of DL networks is highly significant. In light of these findings, we propose a new strategy for developing algorithms for interpreting radiological images: training a different algorithm for each device model. We believe that this strategy could achieve higher-performing DL models, using a smaller training dataset. This strategy could also help the algorithms to learn causal relationships, as in our research Grad-CAM heatmaps showed. In cases where acquisition equipment is unknown, hierarchical clustering can help to separate images into homogeneous clusters. This new strategy should be studied in future papers.

Conclusion

The performance of DL algorithms in medical imaging can be influenced by mainly two types of factors: institutional and device related. On the one hand, institutional related factors are those that do not modify pixel values (labeling criteria, radiology workflow, etc.). Although these factors do not impede generalization, they can produce a relevant performance decrease when adopted in an external institution.

On the other hand, device related factors (device’s image processing, response function, and acquisition protocol) modify image pixel values, and they can have a significant impact on internal validation and generalization performances. The device’s type of response function was found to be the most critical factor, as a change on it prevented the algorithm from generalizing, while other device related factors hindered, but did not impede, generalization.

Thereby, radiography devices apply a unique image processing and response function which generate different textures in radiographic images. Hence, feature values extracted by CNNs were found to be highly dependent on the X-ray device from which the image was acquired (hidden class). This is an especially relevant issue, which may compromise generalization to external X-ray models. Clustering algorithms are deemed useful to identify hidden classes in the dataset, and we propose them as a potential strategy to evaluate CNN feature values.