A retrospective study of deep learning generalization across two centers and multiple models of X-ray devices using COVID-19 chest-X rays

Fernández-Miranda, Pablo Menéndez; Fraguela, Enrique Marqués; de Linera-Alperi, Marta Álvarez; Cobo, Miriam; del Barrio, Amaia Pérez; González, David Rodríguez; Vega, José A.; Iglesias, Lara Lloret

doi:10.1038/s41598-024-64941-5

Download PDF

Article
Open access
Published: 25 June 2024

A retrospective study of deep learning generalization across two centers and multiple models of X-ray devices using COVID-19 chest-X rays

Scientific Reports volume 14, Article number: 14657 (2024) Cite this article

1535 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Generalization of deep learning (DL) algorithms is critical for the secure implementation of computer-aided diagnosis systems in clinical practice. However, broad generalization remains to be a challenge in machine learning. This research aims to identify and study potential factors that can affect the internal validation and generalization of DL networks, namely the institution where the images come from, the image processing applied by the X-ray device, and the type of response function of the X-ray device. For these purposes, a pre-trained convolutional neural network (CNN) (VGG16) was trained three times for classifying COVID-19 and control chest radiographs with the same hyperparameters, but using different combinations of data acquired in two institutions by three different X-ray device manufacturers. Regarding internal validation, the addition of images from an external institution to the training set did not modify the algorithm’s internal performance, however, the inclusion of images acquired by a device from a different manufacturer decreased the performance up to 8% (p < 0.05). In contrast, generalization across institutions and X-ray devices with the same type of response function was achieved. Nonetheless, generalization was not observed across devices with different types of response function. This factor was the key impediment to achieving broad generalization in our research, followed by the device’s image-processing and the inter-institutional differences, which both reduced generalization performance to 18.9% (p < 0.05), and 9.8% (p < 0.05), respectively. Finally, clustering analysis with features extracted by the CNN was performed, revealing a substantial dependence of feature values extracted by the pre-trained CNN on the X-ray device which acquired the images.

Generalizable disease detection using model ensemble on chest X-ray images

Article Open access 11 March 2024

Training certified detectives to track down the intrinsic shortcuts in COVID-19 chest x-ray data sets

Article Open access 04 August 2023

Comparing different deep learning architectures for classification of chest radiographs

Article Open access 12 August 2020

Introduction

Chest radiography is a widely used imaging modality in the regular medical practice for the evaluation, management and follow-up of patients with several diseases, such as COVID-19 pneumonia¹. However, chest radiography is a complex imaging modality to interpret², and its evaluation requires experience and expertise³. Deep learning (DL) algorithms have potential to improve the quality of radiographic interpretation and lead to more accurate diagnoses³.

Several DL algorithms for chest radiographs (CXRs) classification have been published in the past years, especially in the context of COVID-19 pandemic^4,5,6. However, these algorithms often show low generalization performance when data distribution shifts^{7,8,9,10,11,12}.

This generalization deficiency is often overlooked during algorithm evaluation, because algorithms are often assessed on test subsets that come from the same population sample as the training and validation subsets. Therefore, typically only the internal validation performance is estimated, and generalizability is not usually properly evaluated. For this reason, DL algorithms’ performance should also be assessed on test subsets coming from a different source from which training and validation subsets were obtained^{7,8,9,10,11,12}.

In medical data, the difference between the source from which training, validation, and usually test subsets are obtained and the real-world environment can cause particularly considerable distribution shifts, which may lead DL networks to significantly reduce their generalization performance^9,10,12. This issue remains a major challenge in machine learning research¹³.

Nevertheless, generalization deficiency for image classification in DL networks has only been discussed by a few authors in the medical field. While the causes remain unclear, most of these authors have only studied generalization through testing different DL networks on external datasets from external institutions^8,10,12,14. Unlike previous works, this novel research separately assesses the influence of institution and X-ray device on both algorithm’s internal validation and generalization performances.

For this purpose, factors that may potentially affect the performance of DL networks were divided into two categories:

X-ray device related factors: aspects that affect image pixel values, which include the acquisition protocol, the type of response function of the radiology device, and the image processing applied by the X-ray device.
Institutional related factors: differences among hospitals that do not change the pixel values, such as labeling criteria, population demographics, disease epidemiology, and radiology workflow.

Thus, this work is aimed to study the role of the aforementioned factors, but is not intended to provide a new DL algorithm for CXR classification, as there are already plenty of these in the literature which show satisfactory performance, particularly, in internal validation. In contrast, this novel study aims to evaluate the influence of X-ray device associated and institutional related factors on DL algorithms performance both in internal validation and generalization.

To achieve this purpose, one DL network will be trained and tested using different subsets in which the variables institution and X-ray device, (including device’s image processing and response function) will be completely controlled. This represents a differentiating issue from other previous publications, as to the best of our knowledge, it is the first time that these variables are fully controlled.

Materials and methods

Three experiments were carried out to study the influence of institutional and X-ray device related factors on the internal validation and generalization performance of a DL network for CXR classification (Fig. 1).

Ethical approval

This research involved patients from two different medical institutions: Hospital Universitario Marqués de Valdecilla, located in Santander, Cantabria, Spain—referred to as Institution 1 in the text; and Hospital de Sierrallana in Torrelavega, Cantabria, Spain—referred to as Institution 2. The Ethics Committee of both institutions, Comité de Ética de la Investigación con Medicamentos y Productos Sanitarios de Cantabria, approved this research. Since this study was approved by the Comité de Ética de la Investigación con Medicamentos y Productos Sanitarios de Cantabria, without direct interaction with patients or use of tissue samples, and using retrospective images acquired in the past and anonymized, informed consent was not required¹⁵. All methods reported in this work were carried out in accordance with the pertinent guidelines and regulations.

Dataset and subsets

Images for this research were all frontal view CXRs manually labeled by three expert radiologists, with more than 5 years of experience, in two classes (COVID-19 and Control), according to the inclusion criteria summarized in Table 1. In the text, these classes are named as target classes.

Table 1 Inclusion criteria for target classes.

Full size table

The main image dataset was created by simple random sampling from four image databases, as described in Supplementary Appendix A1. This main dataset contained images acquired by three different X-ray devices in two institutions: 394 images acquired by a Fujifilm FDR smart FGX device in institution 1; 244 images acquired by the same device model (Fujifilm FDR smart FGX) in institution 2; 192 images acquired by a general electric (GE) revolution XRD device in institution 2; and 94 images acquired by a Carestream DRX Evolution Plus device in institution 2. Note that Fujifilm and Carestream devices have the same type of response function (logarithmic), while GE has a different type (linear)^16,17.

Finally, eight subsets described in Table 2 were created by random sampling without repetition of the main dataset. Random sampling was performed with stratification to ensure an equal distribution of COVID-19 and Control images within each subset (50 percent of each class).

Table 2 Image subsets.

Full size table

Image preprocessing

Images were collected as 16-bit unsigned integer monochrome pixels stored in DICOM format. After data collection, images were resized to 512 × 512 pixels using cubic spline interpolation, and pixel values were rescaled to [0, 1].

Experiments to test the influence of institutional and device related factors

Experiment 1: evaluation of internal validation performance

The first experiment analyzed the influence of institutional and device related factors on the internal validation performance of a DL algorithm (Fig. 1). For this purpose, the same DL network (a VGG16) was trained three times for the classification of CXR images. Each time, with the same architecture, hyperparameters (details in Supplementary Appendix A2), and number of images (300), but using different training subsets (Table 3).

Table 3 Trainings and models.

Full size table

First training was performed with subsets Fuji_Inst1_TRAIN_A and Fuji_Inst1_TRAIN_B, so it includes 300 images acquired by a Fujifilm FDR smart FGX device from institution 1. The resulting model was named Model-F1A_F1B, because of the subsets used (F1A = Fuji_Inst1_TRAIN_A, etc.).

Second training included subsets Fuji_Inst1_TRAIN_A and Fuji_Inst2_TRAIN, so it was performed with all images acquired by a Fujifilm FDR smart FGX device, but half in institution 1 and half in institution 2. This model was named Model-F1A_F2.

The third and last training was performed with 101 random images from Fuji_Inst1_TRAIN_A, 101 random images from Fuji_Inst2_TRAIN, and the 98 images from GE_Inst2_TRAIN. This means that the resulting model, named Model-F1A’_ F2’_GE2, was trained including images acquired by two different manufacturers, Fujifilm FDR smart FGX and general electric (GE) revolution XRD, and in two different institutions (1 and 2).

Later, internal validation performance of these three models were tested on Fuji_Inst1_TEST subset, which contained all images acquired by a Fujifilm FDR smart FGX device in institution 1. This subset was the only test subset that could give results of internal validation for the three models, since all models were trained using sets of images which contained CXRs acquired by a Fujifilm FDR smart FGX device in institution 1. Finally, the internal validation performances of the three models were compared to assess the influence of institutional and X-ray machine related factors.

The influence of institutional factors was studied by comparing performances of Model-F1A_F1B and Model-F1A_F2, as both of them were trained with 300 images of the same X-ray device model, but Model-F1A_F1B coming from institution 1, and Model-F1A_F2 coming from institutions 1 and 2. Therefore, performance differences among these models could be probably attributable to institutional related factors.

The influence of X-ray machine related factors was assessed comparing the performances of Model-F1A’_ F2’_GE2 with Model-F1A_F1B and Model-F1A_F2. While Model-F1A_F1B and Model-F1A_F2 were trained with 300 images acquired by a Fujifilm FDR smart FGX, Model-F1A’_ F2’_GE2 was trained with 300 images, some of them acquired by a Fujifilm FDR smart FGX, and the rest by a GE revolution XRD device. Thus, this comparison revealed the effect of adding images from a different manufacturer to the training sample.

The metric used to evaluate the performances was the area under the receiver operating characteristic curve (AUC)¹⁸. Additionally, gradient-weighted class activation mapping (Grad-CAM) heatmaps were used to identify the significant regions in the image for making the prediction¹⁹. This analysis evaluated how the addition of images acquired in a distinct institution and by a different X-ray device affected the DL network’s ability to learn causal relationships.

Experiment 2: evaluation of generalization

The second experiment studied the influence of both institutional and device-related factors on a DL network’s generalization (Fig. 1). Model-F1A_F1B was evaluated on the four test subsets that included images from two different institutions and three distinct X-ray devices (Table 2). Therefore, four AUCs were obtained: AUC on subset Fuji_Inst1_TEST (internal validation); AUC on Fuji_Inst2_TEST (generalization to an external institution with the same X-ray device); AUC on Care_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing but the same type of response function); and AUC on GE_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing and different type of response function).

Later, considering the shifting factor, AUCs were compared to evaluate the influence of institutional factors on generalization (AUC-Fuji_Inst1_TEST vs AUC-Fuji_Inst2_TEST), as well as X-ray device factors, including device’s image processing (AUC-Fuji_Inst2_TEST vs AUC-Care_Inst2_TEST) and device’s type of response function (AUC-Care_Inst2_TEST vs AUC-GE_Inst2_TEST).

Experiment 3: evaluation of the dependence of CNN-features on textures

The third experiment investigated the influence of institutional and device related factors on the values of the features extracted by the pre-trained CNN (Fig. 1). The hypothesis was that radiological image textures depend on the X-ray device used to acquire the image. Differences in device´s image processing and response functions result in differences among imaging textures based on the X-ray device.

Therefore, feature values extracted by CNNs could also be highly dependent on the X-ray device that acquires the image. This issue could hinder the generalization of DL networks, making them only suitable to images obtained from the same devices as the ones used in training.

In this context, Model-F1A_F1B was used to extract features from the test subsets. Later, features from the last convolutional layer were clustered using a hierarchical clustering algorithm²⁰, which was implemented using python’s scientific graphics library, seaborn (version 0.11.1)²¹. This unsupervised approach allowed us to examine which image classes were more evident for the CNN, namely target classes (COVID-19 and control) or other hidden classes (such as the X-ray device which acquired the image, or the institution where the images were obtained). To ensure that results were not biased by CXRs metallic tokens, this experiment was repeated by cropping the images to preserve only their central part.

This experiment aimed to evaluate whether the difference in pixel values between a COVID-19 image from a Fujifilm device and a COVID-19 image from a GE device is greater or smaller than the difference between a COVID-19 image and a control image, both acquired by the same X-ray device.

Statistical analysis

Cross-Validated AUCs 95% confidence intervals (CIs) were computed with the R package cvAUC²². The AUC differences with their 95% CIs were calculated using the bootstrap method²³. Any difference where the CI excluded the 0 value was considered to be statistically significant with a p-value < 0.05.

Results

Patients

This research included 874 patients, 45.1% (394) from institution 1 and the remaining from institution 2. The study sample comprised 42.6% (372) females and 57.4% (502) males. The median age was 62 years, while the average age was 60 ± 17 years (5–96). Detailed population descriptive statistics are summarized in Table 4.

Table 4 Descriptive statistics of population age and gender.

Full size table

Experiments to test the influence of institutional and device related factors

Experiment 1: evaluation of internal validation performance

Internal validation performances of Model-F1A_F1B and Model-F1A_F2 did not show significant statistical differences (Fig. 2 and Table 5). Thus, the addition of images to the training sample acquired in a different institution by the same X-ray model did not have a significant impact on the algorithm’s internal validation.

Table 5 AUC differences on subset Fuji_Inst1_TEST (internal validation).

Full size table

Grad-CAM heatmaps showed similar activation patterns between the two models. Heatmaps depicted activations over COVID-19 lung opacities, and absence of activations within any region of the image in Control patients (Fig. 3). This result suggests that both models were able to learn the radiological findings of COVID-19 and, thus, made predictions based on causal relationships.

In contrast, internal validation performances of Model-F1A_F1B and Model-F1A_F2 were 8% (p < 0.05) and 5.2% (p < 0.05) significantly higher than Model-F1A’_ F2’_GE2 internal validation performance (Table 6). Hence, the addition of images acquired by another X-ray device of a different manufacturer to the training sample decreased the algorithm’s internal validation performance.

Table 6 Generalization of Model-F1A_F1B across institutions and devices from different manufacturers.

Full size table

This was also evidenced in Grad-CAM heatmaps, where Model-F1A’_ F2’_GE2 showed activation areas located outside the lungs without any clinical or radiological interpretation. Unlike Model-F1A_F1B and Model-F1A_F2, Model-F1A’_ F2’_GE2 learnt spurious relationships (confounding factors) instead of causal relationships (Fig. 3).

Experiment 2: evaluation of generalization

Model-F1A_F1B generalized to Fujifilm and Carestream images from institution 2 (Fuji_Inst2_TEST and Care_Inst2_TEST) with a performance decrease of 9.8% (p < 0.05) and 18.9% (p < 0.05), respectively. In contrast, this model did not generalize to GE images from institution 2 (GE_Inst2_TEST), as it showed a loss in the AUC of 33.5% (p < 0.05) which caused the model to perform randomly (Fig. 4 and Table 6).

Thus, Model-F1A_F1B generalized across institutions and across X-ray devices from different manufacturers with the same type of response function, however, it did not generalize across X-ray devices with different types of response function. A hierarchy of factors influencing the generalization capability of the DL network is presented in Fig. 5.

Experiment 3: evaluation of the dependence of CNN-features on textures

The hierarchical clustering algorithm grouped images from the test subsets into three evident clusters, which corresponded to images from each of the three X-ray devices used to acquire the images (Fujifilm, GE and Carestream). Radiographies acquired by both Fujifilm devices (subsets Fuji_Inst1_TEST and Fuji_Inst2_TEST) were mixed, despite being acquired in different institutions. Ultimately, images from different target classes (COVID-19 and control) were not separated (Fig. 6).

Besides, the two clusters corresponding to images acquired by the two X-ray devices with the same type of response function (Fujifilm and Carestream) were next to each other, grouped together in a higher cluster level, and separated from the cluster containing GE images, which had a different type of response function. Same results were observed when the experiment was repeated using a cropped version of the images.

In summary, the hierarchical clustering algorithm found that the feature values extracted by the pretrained ImageNet CNN were more dissimilar regarding the hidden classes (X-ray device and type of response function) than the real target classes (COVID-19 and control).

Discussion

Experiment 1: evaluation of internal validation performance

The similarity in performance between Model-F1A_F1B and Model-F1A_F2 suggests that institutional related factors may not have a significant impact on the internal validation performance of the algorithm. Besides, the addition of images acquired by a different model of X-ray device to the training set led to a significant performance reduction in the internal validation of Model-F1A’_ F2’_GE2. This result indicates a potentially important influence of device related factors on the algorithm’s internal validation performance.

Grad-CAM heatmaps were in line with the aforementioned results. Heatmaps of Model-F1A_F1B and Model-F1A_F2 showed activations over lung opacities in COVID-19 images and absence of activations in control images. These activation patterns suggest that those two models were able to learn causal relationships. Conversely, Model-F1A’_ F2’_GE2 did not show human-recognizable activations. COVID-19 lung consolidations were not properly identified and several activations without clinical meaning appeared both in COVID-19 and control images. In summary, Grad-CAM heatmaps also provided evidence of a variable level of influence of device related factors on the internal validation performance of the algorithm.

Experiment 2: evaluation of generalization

This study found that a DL network can generalize across institutions and X-ray devices with the same type of response function, however, it may suffer a variable decrease in performance when deployed on external datasets. In contrast, generalization across X-ray devices with a different type of response function was not observed in this research.

Generalization of DL networks for CXRs classification to external datasets has been argued by a few authors. Pooch et al.⁹ concluded that state-of-the-art DL algorithms do not generalize to external data which differs from the data used for training. Similar to this, Zech et al.¹² and Sathitratanacheewin et al.¹⁰ defend that CNNs do not generalize to external sites. Additionally, Zech et al.¹² and Maguolo and Nanni⁷ warn that neural networks can often distinguish the dataset or the hospital where the images come from. For Maguolo and Nanni⁷, this issue is very important since most papers obtain images of each class to predict from different datasets. Trying to understand how CNNs distinguish the source of the dataset, Cohen et al.²⁴ proposed discrepancies in image labeling criteria among medical centers to be the potential cause. Furthermore, for Rajpurkar et al.¹⁴ and Pan et al.⁸ DL algorithms for CXR classification can generalize to datasets from external institutions with a decrease in their performance. Our results agree with these last two authors.

In an attempt to shed light on the controversy surrounding the generalization of DL networks, we separately assessed the influence of multiple factors on generalization. Our research found that the X-ray device’s response function is probably the most important factor to enable generalization, followed by the device’s image processing, which hindered but did not impede the algorithm to generalize.

Furthermore, institutional related factors were also found to reduce algorithm’s performance, but to a minor extent than X-ray device related factors (Fig. 5).

Experiment 3: evaluation of the dependence of CNN-features on textures

Hierarchical clustering showed that feature values extracted by a CNN could be highly dependent on the X-ray device that acquires the image. The reason is that each X-ray device model applies a unique image processing and has a distinct response function, which generates different textures in radiographic images. These textural differences may lead to disparities in CNN-feature values among images from different devices and vendors that could hinder generalization. Therefore, the application of DL networks to images acquired by devices from manufacturers that are different from those used to acquire the training images should be carefully accomplished.

In fact, the results of experiment 3 also suggest that the influence of X-ray devices on CNN-feature values could be even higher than the influence of the target classes or the institution. Nevertheless, the impact of this issue may be more significant in challenging classification tasks, while in relatively easy tasks, such as body parts classification in radiography, it may not pose a significant obstacle.

The dependence of CNN-feature values on the X-ray device indicates that at least some of the pre-trained CNNs extract features mainly based on textures rather than shapes. This is an important issue for generalization since shape-based features are potentially more robust and invariant than texture-based features. Accordingly, outside the medical field, Geirhos et al.²⁵ previously argued that ImageNet-trained CNNs are biased towards recognizing textures rather than shapes. These authors also suggest that shape biased networks are inherently more robust than texture biased networks²⁵.

Finally, this research introduces hierarchical clustering as a potentially useful tool to detect hidden classes in a dataset which can be more relevant than target classes. Therefore, training a different algorithm for each hidden class to classify the target classes instead of training a single algorithm may be a prudent approach to consider.

Taking all these findings into account, this paper argues that generalization across institutions is possible; however, the influence of the X-ray device on the performance of DL networks is highly significant. In light of these findings, we propose a new strategy for developing algorithms for interpreting radiological images: training a different algorithm for each device model. We believe that this strategy could achieve higher-performing DL models, using a smaller training dataset. This strategy could also help the algorithms to learn causal relationships, as in our research Grad-CAM heatmaps showed. In cases where acquisition equipment is unknown, hierarchical clustering can help to separate images into homogeneous clusters. This new strategy should be studied in future papers.

Conclusion

The performance of DL algorithms in medical imaging can be influenced by mainly two types of factors: institutional and device related. On the one hand, institutional related factors are those that do not modify pixel values (labeling criteria, radiology workflow, etc.). Although these factors do not impede generalization, they can produce a relevant performance decrease when adopted in an external institution.

On the other hand, device related factors (device’s image processing, response function, and acquisition protocol) modify image pixel values, and they can have a significant impact on internal validation and generalization performances. The device’s type of response function was found to be the most critical factor, as a change on it prevented the algorithm from generalizing, while other device related factors hindered, but did not impede, generalization.

Thereby, radiography devices apply a unique image processing and response function which generate different textures in radiographic images. Hence, feature values extracted by CNNs were found to be highly dependent on the X-ray device from which the image was acquired (hidden class). This is an especially relevant issue, which may compromise generalization to external X-ray models. Clustering algorithms are deemed useful to identify hidden classes in the dataset, and we propose them as a potential strategy to evaluate CNN feature values.

Data availability

Data from this study were anonymized and can be made available upon reasonable request by contacting the first author at pablomenendezfernandezmiranda@gmail.com or the corresponding author.

References

Borghesi, A. & Roberto, M. Covid-19 outbreak in Italy: Experimental chest x-ray scoring system for quantifying and monitoring disease progression. Radiol. Med. 125, 509–513 (2020).
Article PubMed PubMed Central Google Scholar
Al Aseri, Z. Accuracy of chest radiograph interpretation by emergency physicians. Emerg. Radiol. 16, 111–114 (2009).
Article PubMed Google Scholar
Hwang, E. J. et al. Deep learning for chest radiograph diagnosis in the emergency department. Radiology 293, 573–580 (2019).
Article PubMed Google Scholar
Irvin J et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. Proceedings of the AAAI Conference on Artificial Intelligence. 33, 590–597. https://stanfordmlgroup.github.io/competitions/chexpert/. (2019). Accessed 12 March 2022.
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
Article PubMed PubMed Central Google Scholar
Wang, L., Lin, Z. Q. & Wong, A. COVID-Net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci. Rep. 10, 19549 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Maguolo, G. & Nanni, L. A critic evaluation of methods for covid-19 automatic detection from x-ray images. Inf. Fusion 76, 1–7 (2021).
Article PubMed PubMed Central Google Scholar
Pan, I., Agarwal, S. & Merck, D. Generalizable inter-institutional classification of abnormal chest radiographs using efficient convolutional neural networks. J. Digit. Imaging 32, 888–896 (2019).
Article PubMed PubMed Central Google Scholar
Pooch, E. H. P., Ballester, P. & Barros, R. C. Can we trust deep learning based diagnosis? The impact of domain shift in chest radiograph classification. In Thoracic Image Analysis, TIA 2020. Lecture Notes in Computer Science (eds Petersen, J. et al.) 74–83 (Springer, 2020).
Google Scholar
Sathitratanacheewin, S., Sunanta, P. & Pongpirul, K. Deep learning for automated classification of tuberculosis-related chest x-ray: Dataset distribution shift limits diagnostic performance generalizability. Heliyon 6, 04614 (2020).
Article Google Scholar
Subbaswamy, A. & Saria, S. Counterfactual normalization: Proactively addressing dataset shift using causal mechanisms. In 34th Conference on Uncertainty in Artificial Intelligence 2018, Vol. 2 (eds Silva, R. et al.) 947–957 (Association For Uncertainty in Artificial Intelligence (AUAI), 2018).
Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med. 15, 1–17 (2018).
Article Google Scholar
Eche, T., Schwartz, L. H., Mokrane, F. Z. & Dercle, L. Toward generalizability in the deployment of artificial intelligence in radiology: Role of computation stress testing to overcome underspecification. Radiol. Artif. Intell. 3, e210097 (2021).
Article PubMed PubMed Central Google Scholar
Rajpurkar P et al. CheXpedition: Investigating Generalization Challenges for Translation of Chest X-Ray Algorithms to the Clinical Setting. https://arxiv.org/abs/2002.11379. (2020). Accessed 12 December 2022.
World Medical Association. World medical association declaration of Helsinki: Ethical principles for medical research involving human subjects. JAMA 310, 2191–2194 (2013).
Article Google Scholar
Lanca, L. & Silva, A. Digital Imaging Systems for Plain Radiography (Springer, 2013).
Book Google Scholar
KCARE Reports. Technical Report 05078: Quantitative evaluation of digital detectors for general radiography. https://kcare.co.uk. (2005). Accessed 8 March 2022.
Calì, C. & Longobardi, M. Some mathematical properties of the ROC curve and their applications. Ricerche Mat. 64, 391–402 (2015).
Article MathSciNet Google Scholar
Selvaraju RR et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision (ICCV). https://ieeexplore.ieee.org/document/8237336. (2017). Accessed 22 June 2022.
Guess, M. J. & Wilson, S. Introduction to hierarchical clustering. J. Clin. Neurophysiol. 19, 144–151 (2002).
Article PubMed Google Scholar
Waskom, M. L. seaborn: Statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
Article ADS Google Scholar
LeDell E, Petersen M, van der Laan M. cvAUC: Cross-Validated Area Under the ROC Curve Confidence Intervals. R package. http://CRAN.R-project.org/package=cvAUC. (2014). Accessed 14 December 2022.
Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1, 54–57 (1986).
MathSciNet Google Scholar
Cohen JP, Hashir M, Brooks R, Bertrand H. On the limits of cross-domain generalization in automated x-ray prediction. Proceedings of the Third Conference on Medical Imaging with Deep Learning (PMLR). 121, 136–155. https://proceedings.mlr.press/v121/cohen20a. (2020). Accessed 05 December 2022.
Geirhos R et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. https://arxiv.org/abs/2002.02497. (2020). Accessed 20 November 2022.
Sitaula, C. & Hossain, M. B. Attention-based vgg-16 model for covid-19 chest x-ray image classification. Appl. Intell. 17, 1–14 (2020).
Google Scholar
Mason, D. SU-E-T-33: Pydicom: An open source DICOM library. Med. Phys. 38, 3493–3493 (2011).
Article Google Scholar
Abadi M et al. TensorFlow: Large-scale machine learning on heterogeneous systems. Python Software. https://tensorflow.org. (2015). Accessed 01 March 2022.
Chollet F Keras. Python library. https://keras.io. (2015). Accessed 01 March 2022.
Virtanen, P. et al. Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge the support from the project AI4EOSC ‘‘Artificial Intelligence for the European Open Science Cloud” that has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement number 101058593. The authors would like to thank the institutions that participated in this study, the patients who made this research possible by providing their images, the radiologists who generously labeled the images, and CSIC (Spanish National Research Council) Global Health Platform. M.C. acknowledges the support received by the Ministry of Education of Spain (FPU grant, reference FPU21-04458).

Funding

This research work was funded by the European Commission—NextGenerationEU (Regulation EU 2020/2094), through CSIC Global Health Platform (PTI+ Salud Global), 202050E107.

Author information

Authors and Affiliations

Departamento de Radiología, Hospital Universitario Rey Juan Carlos, Calle Gladiolo, s/n, 28933, Móstoles, Spain
Pablo Menéndez Fernández-Miranda
Departamento de Tecnologías de La Información, Universidad CEU San Pablo, Calle Julián Romea, 22, 28003, Madrid, Spain
Pablo Menéndez Fernández-Miranda
Departamento de Radiofísica y Protección Radiológica, Hospital Universitario Marqués de Valdecilla, Avenida de Valdecilla s/n, Santander, Spain
Enrique Marqués Fraguela
Departamento de Otorrinolaringología, Clínica Universidad de Navarra, Calle del Marquesado de Santa Marta, 1, 28027, Madrid, Spain
Marta Álvarez de Linera-Alperi
Advanced Computing and E-Science Research Group, Institute of Physics of Cantabria, Grupo de Computación y E-Ciencia, CSIC-UC, IFCA-CSIC, Avenida de los Castros s/n, 39005, Santander, Spain
Miriam Cobo, David Rodríguez González & Lara Lloret Iglesias
Servicio de Radiología, Complejo Hospitalario de Navarra, C. de Irunlarrea, 3, 31008, Pamplona, Spain
Amaia Pérez del Barrio
Departamento de Morfología y Biología Celular, Grupo SINPOS, Universidad de Oviedo, Avenida Julián Clavería, 6, 33006, Oviedo, Spain
José A. Vega
Facultad de Ciencias de La Salud, Universidad Autónoma de Chile, Avenida Pedro de Valdivía, 425, 751 1185, Providencia-Santiago de Chile, Chile
José A. Vega

Authors

Pablo Menéndez Fernández-Miranda
View author publications
You can also search for this author inPubMed Google Scholar
Enrique Marqués Fraguela
View author publications
You can also search for this author inPubMed Google Scholar
Marta Álvarez de Linera-Alperi
View author publications
You can also search for this author inPubMed Google Scholar
Miriam Cobo
View author publications
You can also search for this author inPubMed Google Scholar
Amaia Pérez del Barrio
View author publications
You can also search for this author inPubMed Google Scholar
David Rodríguez González
View author publications
You can also search for this author inPubMed Google Scholar
José A. Vega
View author publications
You can also search for this author inPubMed Google Scholar
Lara Lloret Iglesias
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

P.M.F-M: original idea; study design; data collection and cleaning; image labeling assistance; deep learning models training; results interpretation; elaboration of the manuscript draft; and manuscript revision. E.M.F.: original idea; study design; contribution of the data referring to the technical specifications of the radiology equipment; results interpretation. M.A.L.A.: data cleaning; elaboration of the manuscript draft; and manuscript revision. M.C.: training of deep learning models; and manuscript revision. A.P.B.: data collection and cleaning; labeling assistance. D.R.G.: results interpretation. J.V.A.: supervision of the medical part of the project and manuscript revision. L.L.I.: supervision of the deep learning part of the project and manuscript revision.

Corresponding author

Correspondence to Miriam Cobo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Figures.

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fernández-Miranda, P.M., Fraguela, E.M., de Linera-Alperi, M.Á. et al. A retrospective study of deep learning generalization across two centers and multiple models of X-ray devices using COVID-19 chest-X rays. Sci Rep 14, 14657 (2024). https://doi.org/10.1038/s41598-024-64941-5

Download citation

Received: 27 July 2023
Accepted: 14 June 2024
Published: 25 June 2024
DOI: https://doi.org/10.1038/s41598-024-64941-5

Subjects

Abstract

Similar content being viewed by others

Generalizable disease detection using model ensemble on chest X-ray images

Training certified detectives to track down the intrinsic shortcuts in COVID-19 chest x-ray data sets

Comparing different deep learning architectures for classification of chest radiographs

Introduction

Materials and methods

Ethical approval

Dataset and subsets

Image preprocessing

Experiments to test the influence of institutional and device related factors

Experiment 1: evaluation of internal validation performance

Experiment 2: evaluation of generalization

Experiment 3: evaluation of the dependence of CNN-features on textures

Statistical analysis

Results

Patients

Experiments to test the influence of institutional and device related factors

Experiment 1: evaluation of internal validation performance

Experiment 2: evaluation of generalization

Experiment 3: evaluation of the dependence of CNN-features on textures

Discussion

Experiment 1: evaluation of internal validation performance

Experiment 2: evaluation of generalization

Experiment 3: evaluation of the dependence of CNN-features on textures

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Figures.

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links