Abstract
Generalization of deep learning (DL) algorithms is critical for the secure implementation of computer-aided diagnosis systems in clinical practice. However, broad generalization remains to be a challenge in machine learning. This research aims to identify and study potential factors that can affect the internal validation and generalization of DL networks, namely the institution where the images come from, the image processing applied by the X-ray device, and the type of response function of the X-ray device. For these purposes, a pre-trained convolutional neural network (CNN) (VGG16) was trained three times for classifying COVID-19 and control chest radiographs with the same hyperparameters, but using different combinations of data acquired in two institutions by three different X-ray device manufacturers. Regarding internal validation, the addition of images from an external institution to the training set did not modify the algorithm’s internal performance, however, the inclusion of images acquired by a device from a different manufacturer decreased the performance up to 8% (p < 0.05). In contrast, generalization across institutions and X-ray devices with the same type of response function was achieved. Nonetheless, generalization was not observed across devices with different types of response function. This factor was the key impediment to achieving broad generalization in our research, followed by the device’s image-processing and the inter-institutional differences, which both reduced generalization performance to 18.9% (p < 0.05), and 9.8% (p < 0.05), respectively. Finally, clustering analysis with features extracted by the CNN was performed, revealing a substantial dependence of feature values extracted by the pre-trained CNN on the X-ray device which acquired the images.
Similar content being viewed by others
Introduction
Chest radiography is a widely used imaging modality in the regular medical practice for the evaluation, management and follow-up of patients with several diseases, such as COVID-19 pneumonia1. However, chest radiography is a complex imaging modality to interpret2, and its evaluation requires experience and expertise3. Deep learning (DL) algorithms have potential to improve the quality of radiographic interpretation and lead to more accurate diagnoses3.
Several DL algorithms for chest radiographs (CXRs) classification have been published in the past years, especially in the context of COVID-19 pandemic4,5,6. However, these algorithms often show low generalization performance when data distribution shifts7,8,9,10,11,12.
This generalization deficiency is often overlooked during algorithm evaluation, because algorithms are often assessed on test subsets that come from the same population sample as the training and validation subsets. Therefore, typically only the internal validation performance is estimated, and generalizability is not usually properly evaluated. For this reason, DL algorithms’ performance should also be assessed on test subsets coming from a different source from which training and validation subsets were obtained7,8,9,10,11,12.
In medical data, the difference between the source from which training, validation, and usually test subsets are obtained and the real-world environment can cause particularly considerable distribution shifts, which may lead DL networks to significantly reduce their generalization performance9,10,12. This issue remains a major challenge in machine learning research13.
Nevertheless, generalization deficiency for image classification in DL networks has only been discussed by a few authors in the medical field. While the causes remain unclear, most of these authors have only studied generalization through testing different DL networks on external datasets from external institutions8,10,12,14. Unlike previous works, this novel research separately assesses the influence of institution and X-ray device on both algorithm’s internal validation and generalization performances.
For this purpose, factors that may potentially affect the performance of DL networks were divided into two categories:
-
X-ray device related factors: aspects that affect image pixel values, which include the acquisition protocol, the type of response function of the radiology device, and the image processing applied by the X-ray device.
-
Institutional related factors: differences among hospitals that do not change the pixel values, such as labeling criteria, population demographics, disease epidemiology, and radiology workflow.
Thus, this work is aimed to study the role of the aforementioned factors, but is not intended to provide a new DL algorithm for CXR classification, as there are already plenty of these in the literature which show satisfactory performance, particularly, in internal validation. In contrast, this novel study aims to evaluate the influence of X-ray device associated and institutional related factors on DL algorithms performance both in internal validation and generalization.
To achieve this purpose, one DL network will be trained and tested using different subsets in which the variables institution and X-ray device, (including device’s image processing and response function) will be completely controlled. This represents a differentiating issue from other previous publications, as to the best of our knowledge, it is the first time that these variables are fully controlled.
Materials and methods
Three experiments were carried out to study the influence of institutional and X-ray device related factors on the internal validation and generalization performance of a DL network for CXR classification (Fig. 1).
Ethical approval
This research involved patients from two different medical institutions: Hospital Universitario Marqués de Valdecilla, located in Santander, Cantabria, Spain—referred to as Institution 1 in the text; and Hospital de Sierrallana in Torrelavega, Cantabria, Spain—referred to as Institution 2. The Ethics Committee of both institutions, Comité de Ética de la Investigación con Medicamentos y Productos Sanitarios de Cantabria, approved this research. Since this study was approved by the Comité de Ética de la Investigación con Medicamentos y Productos Sanitarios de Cantabria, without direct interaction with patients or use of tissue samples, and using retrospective images acquired in the past and anonymized, informed consent was not required15. All methods reported in this work were carried out in accordance with the pertinent guidelines and regulations.
Dataset and subsets
Images for this research were all frontal view CXRs manually labeled by three expert radiologists, with more than 5 years of experience, in two classes (COVID-19 and Control), according to the inclusion criteria summarized in Table 1. In the text, these classes are named as target classes.
The main image dataset was created by simple random sampling from four image databases, as described in Supplementary Appendix A1. This main dataset contained images acquired by three different X-ray devices in two institutions: 394 images acquired by a Fujifilm FDR smart FGX device in institution 1; 244 images acquired by the same device model (Fujifilm FDR smart FGX) in institution 2; 192 images acquired by a general electric (GE) revolution XRD device in institution 2; and 94 images acquired by a Carestream DRX Evolution Plus device in institution 2. Note that Fujifilm and Carestream devices have the same type of response function (logarithmic), while GE has a different type (linear)16,17.
Finally, eight subsets described in Table 2 were created by random sampling without repetition of the main dataset. Random sampling was performed with stratification to ensure an equal distribution of COVID-19 and Control images within each subset (50 percent of each class).
Image preprocessing
Images were collected as 16-bit unsigned integer monochrome pixels stored in DICOM format. After data collection, images were resized to 512 × 512 pixels using cubic spline interpolation, and pixel values were rescaled to [0, 1].
Experiments to test the influence of institutional and device related factors
Experiment 1: evaluation of internal validation performance
The first experiment analyzed the influence of institutional and device related factors on the internal validation performance of a DL algorithm (Fig. 1). For this purpose, the same DL network (a VGG16) was trained three times for the classification of CXR images. Each time, with the same architecture, hyperparameters (details in Supplementary Appendix A2), and number of images (300), but using different training subsets (Table 3).
First training was performed with subsets Fuji_Inst1_TRAIN_A and Fuji_Inst1_TRAIN_B, so it includes 300 images acquired by a Fujifilm FDR smart FGX device from institution 1. The resulting model was named Model-F1A_F1B, because of the subsets used (F1A = Fuji_Inst1_TRAIN_A, etc.).
Second training included subsets Fuji_Inst1_TRAIN_A and Fuji_Inst2_TRAIN, so it was performed with all images acquired by a Fujifilm FDR smart FGX device, but half in institution 1 and half in institution 2. This model was named Model-F1A_F2.
The third and last training was performed with 101 random images from Fuji_Inst1_TRAIN_A, 101 random images from Fuji_Inst2_TRAIN, and the 98 images from GE_Inst2_TRAIN. This means that the resulting model, named Model-F1A’_ F2’_GE2, was trained including images acquired by two different manufacturers, Fujifilm FDR smart FGX and general electric (GE) revolution XRD, and in two different institutions (1 and 2).
Later, internal validation performance of these three models were tested on Fuji_Inst1_TEST subset, which contained all images acquired by a Fujifilm FDR smart FGX device in institution 1. This subset was the only test subset that could give results of internal validation for the three models, since all models were trained using sets of images which contained CXRs acquired by a Fujifilm FDR smart FGX device in institution 1. Finally, the internal validation performances of the three models were compared to assess the influence of institutional and X-ray machine related factors.
The influence of institutional factors was studied by comparing performances of Model-F1A_F1B and Model-F1A_F2, as both of them were trained with 300 images of the same X-ray device model, but Model-F1A_F1B coming from institution 1, and Model-F1A_F2 coming from institutions 1 and 2. Therefore, performance differences among these models could be probably attributable to institutional related factors.
The influence of X-ray machine related factors was assessed comparing the performances of Model-F1A’_ F2’_GE2 with Model-F1A_F1B and Model-F1A_F2. While Model-F1A_F1B and Model-F1A_F2 were trained with 300 images acquired by a Fujifilm FDR smart FGX, Model-F1A’_ F2’_GE2 was trained with 300 images, some of them acquired by a Fujifilm FDR smart FGX, and the rest by a GE revolution XRD device. Thus, this comparison revealed the effect of adding images from a different manufacturer to the training sample.
The metric used to evaluate the performances was the area under the receiver operating characteristic curve (AUC)18. Additionally, gradient-weighted class activation mapping (Grad-CAM) heatmaps were used to identify the significant regions in the image for making the prediction19. This analysis evaluated how the addition of images acquired in a distinct institution and by a different X-ray device affected the DL network’s ability to learn causal relationships.
Experiment 2: evaluation of generalization
The second experiment studied the influence of both institutional and device-related factors on a DL network’s generalization (Fig. 1). Model-F1A_F1B was evaluated on the four test subsets that included images from two different institutions and three distinct X-ray devices (Table 2). Therefore, four AUCs were obtained: AUC on subset Fuji_Inst1_TEST (internal validation); AUC on Fuji_Inst2_TEST (generalization to an external institution with the same X-ray device); AUC on Care_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing but the same type of response function); and AUC on GE_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing and different type of response function).
Later, considering the shifting factor, AUCs were compared to evaluate the influence of institutional factors on generalization (AUC-Fuji_Inst1_TEST vs AUC-Fuji_Inst2_TEST), as well as X-ray device factors, including device’s image processing (AUC-Fuji_Inst2_TEST vs AUC-Care_Inst2_TEST) and device’s type of response function (AUC-Care_Inst2_TEST vs AUC-GE_Inst2_TEST).
Experiment 3: evaluation of the dependence of CNN-features on textures
The third experiment investigated the influence of institutional and device related factors on the values of the features extracted by the pre-trained CNN (Fig. 1). The hypothesis was that radiological image textures depend on the X-ray device used to acquire the image. Differences in device´s image processing and response functions result in differences among imaging textures based on the X-ray device.
Therefore, feature values extracted by CNNs could also be highly dependent on the X-ray device that acquires the image. This issue could hinder the generalization of DL networks, making them only suitable to images obtained from the same devices as the ones used in training.
In this context, Model-F1A_F1B was used to extract features from the test subsets. Later, features from the last convolutional layer were clustered using a hierarchical clustering algorithm20, which was implemented using python’s scientific graphics library, seaborn (version 0.11.1)21. This unsupervised approach allowed us to examine which image classes were more evident for the CNN, namely target classes (COVID-19 and control) or other hidden classes (such as the X-ray device which acquired the image, or the institution where the images were obtained). To ensure that results were not biased by CXRs metallic tokens, this experiment was repeated by cropping the images to preserve only their central part.
This experiment aimed to evaluate whether the difference in pixel values between a COVID-19 image from a Fujifilm device and a COVID-19 image from a GE device is greater or smaller than the difference between a COVID-19 image and a control image, both acquired by the same X-ray device.
Statistical analysis
Cross-Validated AUCs 95% confidence intervals (CIs) were computed with the R package cvAUC22. The AUC differences with their 95% CIs were calculated using the bootstrap method23. Any difference where the CI excluded the 0 value was considered to be statistically significant with a p-value < 0.05.
Results
Patients
This research included 874 patients, 45.1% (394) from institution 1 and the remaining from institution 2. The study sample comprised 42.6% (372) females and 57.4% (502) males. The median age was 62 years, while the average age was 60 ± 17 years (5–96). Detailed population descriptive statistics are summarized in Table 4.
Experiments to test the influence of institutional and device related factors
Experiment 1: evaluation of internal validation performance
Internal validation performances of Model-F1A_F1B and Model-F1A_F2 did not show significant statistical differences (Fig. 2 and Table 5). Thus, the addition of images to the training sample acquired in a different institution by the same X-ray model did not have a significant impact on the algorithm’s internal validation.
Grad-CAM heatmaps showed similar activation patterns between the two models. Heatmaps depicted activations over COVID-19 lung opacities, and absence of activations within any region of the image in Control patients (Fig. 3). This result suggests that both models were able to learn the radiological findings of COVID-19 and, thus, made predictions based on causal relationships.
In contrast, internal validation performances of Model-F1A_F1B and Model-F1A_F2 were 8% (p < 0.05) and 5.2% (p < 0.05) significantly higher than Model-F1A’_ F2’_GE2 internal validation performance (Table 6). Hence, the addition of images acquired by another X-ray device of a different manufacturer to the training sample decreased the algorithm’s internal validation performance.
This was also evidenced in Grad-CAM heatmaps, where Model-F1A’_ F2’_GE2 showed activation areas located outside the lungs without any clinical or radiological interpretation. Unlike Model-F1A_F1B and Model-F1A_F2, Model-F1A’_ F2’_GE2 learnt spurious relationships (confounding factors) instead of causal relationships (Fig. 3).
Experiment 2: evaluation of generalization
Model-F1A_F1B generalized to Fujifilm and Carestream images from institution 2 (Fuji_Inst2_TEST and Care_Inst2_TEST) with a performance decrease of 9.8% (p < 0.05) and 18.9% (p < 0.05), respectively. In contrast, this model did not generalize to GE images from institution 2 (GE_Inst2_TEST), as it showed a loss in the AUC of 33.5% (p < 0.05) which caused the model to perform randomly (Fig. 4 and Table 6).
ROC curves of Model-F1A_F1B on the tests subsets: Fuji_Inst1_TEST (internal validation); Fuji_Inst2_TEST (generalization to an external institution with same X-ray device); Care_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing but the same type of response function); and GE_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing and a different type of response function).
Thus, Model-F1A_F1B generalized across institutions and across X-ray devices from different manufacturers with the same type of response function, however, it did not generalize across X-ray devices with different types of response function. A hierarchy of factors influencing the generalization capability of the DL network is presented in Fig. 5.
Experiment 3: evaluation of the dependence of CNN-features on textures
The hierarchical clustering algorithm grouped images from the test subsets into three evident clusters, which corresponded to images from each of the three X-ray devices used to acquire the images (Fujifilm, GE and Carestream). Radiographies acquired by both Fujifilm devices (subsets Fuji_Inst1_TEST and Fuji_Inst2_TEST) were mixed, despite being acquired in different institutions. Ultimately, images from different target classes (COVID-19 and control) were not separated (Fig. 6).
Hierarchical clustering of test subset images (Fuji_Inst1_TEST, Fuji_Inst2_TEST, GE_Inst2_TEST, Care_Inst2_TEST) based on features extracted by Model-F1A_F1B. Hierarchical clustering was generated using python's scientific graphics library, seaborn (version 0.11.1)21.
Besides, the two clusters corresponding to images acquired by the two X-ray devices with the same type of response function (Fujifilm and Carestream) were next to each other, grouped together in a higher cluster level, and separated from the cluster containing GE images, which had a different type of response function. Same results were observed when the experiment was repeated using a cropped version of the images.
In summary, the hierarchical clustering algorithm found that the feature values extracted by the pretrained ImageNet CNN were more dissimilar regarding the hidden classes (X-ray device and type of response function) than the real target classes (COVID-19 and control).
Discussion
Experiment 1: evaluation of internal validation performance
The similarity in performance between Model-F1A_F1B and Model-F1A_F2 suggests that institutional related factors may not have a significant impact on the internal validation performance of the algorithm. Besides, the addition of images acquired by a different model of X-ray device to the training set led to a significant performance reduction in the internal validation of Model-F1A’_ F2’_GE2. This result indicates a potentially important influence of device related factors on the algorithm’s internal validation performance.
Grad-CAM heatmaps were in line with the aforementioned results. Heatmaps of Model-F1A_F1B and Model-F1A_F2 showed activations over lung opacities in COVID-19 images and absence of activations in control images. These activation patterns suggest that those two models were able to learn causal relationships. Conversely, Model-F1A’_ F2’_GE2 did not show human-recognizable activations. COVID-19 lung consolidations were not properly identified and several activations without clinical meaning appeared both in COVID-19 and control images. In summary, Grad-CAM heatmaps also provided evidence of a variable level of influence of device related factors on the internal validation performance of the algorithm.
Experiment 2: evaluation of generalization
This study found that a DL network can generalize across institutions and X-ray devices with the same type of response function, however, it may suffer a variable decrease in performance when deployed on external datasets. In contrast, generalization across X-ray devices with a different type of response function was not observed in this research.
Generalization of DL networks for CXRs classification to external datasets has been argued by a few authors. Pooch et al.9 concluded that state-of-the-art DL algorithms do not generalize to external data which differs from the data used for training. Similar to this, Zech et al.12 and Sathitratanacheewin et al.10 defend that CNNs do not generalize to external sites. Additionally, Zech et al.12 and Maguolo and Nanni7 warn that neural networks can often distinguish the dataset or the hospital where the images come from. For Maguolo and Nanni7, this issue is very important since most papers obtain images of each class to predict from different datasets. Trying to understand how CNNs distinguish the source of the dataset, Cohen et al.24 proposed discrepancies in image labeling criteria among medical centers to be the potential cause. Furthermore, for Rajpurkar et al.14 and Pan et al.8 DL algorithms for CXR classification can generalize to datasets from external institutions with a decrease in their performance. Our results agree with these last two authors.
In an attempt to shed light on the controversy surrounding the generalization of DL networks, we separately assessed the influence of multiple factors on generalization. Our research found that the X-ray device’s response function is probably the most important factor to enable generalization, followed by the device’s image processing, which hindered but did not impede the algorithm to generalize.
Furthermore, institutional related factors were also found to reduce algorithm’s performance, but to a minor extent than X-ray device related factors (Fig. 5).
Experiment 3: evaluation of the dependence of CNN-features on textures
Hierarchical clustering showed that feature values extracted by a CNN could be highly dependent on the X-ray device that acquires the image. The reason is that each X-ray device model applies a unique image processing and has a distinct response function, which generates different textures in radiographic images. These textural differences may lead to disparities in CNN-feature values among images from different devices and vendors that could hinder generalization. Therefore, the application of DL networks to images acquired by devices from manufacturers that are different from those used to acquire the training images should be carefully accomplished.
In fact, the results of experiment 3 also suggest that the influence of X-ray devices on CNN-feature values could be even higher than the influence of the target classes or the institution. Nevertheless, the impact of this issue may be more significant in challenging classification tasks, while in relatively easy tasks, such as body parts classification in radiography, it may not pose a significant obstacle.
The dependence of CNN-feature values on the X-ray device indicates that at least some of the pre-trained CNNs extract features mainly based on textures rather than shapes. This is an important issue for generalization since shape-based features are potentially more robust and invariant than texture-based features. Accordingly, outside the medical field, Geirhos et al.25 previously argued that ImageNet-trained CNNs are biased towards recognizing textures rather than shapes. These authors also suggest that shape biased networks are inherently more robust than texture biased networks25.
Finally, this research introduces hierarchical clustering as a potentially useful tool to detect hidden classes in a dataset which can be more relevant than target classes. Therefore, training a different algorithm for each hidden class to classify the target classes instead of training a single algorithm may be a prudent approach to consider.
Taking all these findings into account, this paper argues that generalization across institutions is possible; however, the influence of the X-ray device on the performance of DL networks is highly significant. In light of these findings, we propose a new strategy for developing algorithms for interpreting radiological images: training a different algorithm for each device model. We believe that this strategy could achieve higher-performing DL models, using a smaller training dataset. This strategy could also help the algorithms to learn causal relationships, as in our research Grad-CAM heatmaps showed. In cases where acquisition equipment is unknown, hierarchical clustering can help to separate images into homogeneous clusters. This new strategy should be studied in future papers.
Conclusion
The performance of DL algorithms in medical imaging can be influenced by mainly two types of factors: institutional and device related. On the one hand, institutional related factors are those that do not modify pixel values (labeling criteria, radiology workflow, etc.). Although these factors do not impede generalization, they can produce a relevant performance decrease when adopted in an external institution.
On the other hand, device related factors (device’s image processing, response function, and acquisition protocol) modify image pixel values, and they can have a significant impact on internal validation and generalization performances. The device’s type of response function was found to be the most critical factor, as a change on it prevented the algorithm from generalizing, while other device related factors hindered, but did not impede, generalization.
Thereby, radiography devices apply a unique image processing and response function which generate different textures in radiographic images. Hence, feature values extracted by CNNs were found to be highly dependent on the X-ray device from which the image was acquired (hidden class). This is an especially relevant issue, which may compromise generalization to external X-ray models. Clustering algorithms are deemed useful to identify hidden classes in the dataset, and we propose them as a potential strategy to evaluate CNN feature values.
Data availability
Data from this study were anonymized and can be made available upon reasonable request by contacting the first author at pablomenendezfernandezmiranda@gmail.com or the corresponding author.
References
Borghesi, A. & Roberto, M. Covid-19 outbreak in Italy: Experimental chest x-ray scoring system for quantifying and monitoring disease progression. Radiol. Med. 125, 509–513 (2020).
Al Aseri, Z. Accuracy of chest radiograph interpretation by emergency physicians. Emerg. Radiol. 16, 111–114 (2009).
Hwang, E. J. et al. Deep learning for chest radiograph diagnosis in the emergency department. Radiology 293, 573–580 (2019).
Irvin J et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. Proceedings of the AAAI Conference on Artificial Intelligence. 33, 590–597. https://stanfordmlgroup.github.io/competitions/chexpert/. (2019). Accessed 12 March 2022.
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
Wang, L., Lin, Z. Q. & Wong, A. COVID-Net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci. Rep. 10, 19549 (2020).
Maguolo, G. & Nanni, L. A critic evaluation of methods for covid-19 automatic detection from x-ray images. Inf. Fusion 76, 1–7 (2021).
Pan, I., Agarwal, S. & Merck, D. Generalizable inter-institutional classification of abnormal chest radiographs using efficient convolutional neural networks. J. Digit. Imaging 32, 888–896 (2019).
Pooch, E. H. P., Ballester, P. & Barros, R. C. Can we trust deep learning based diagnosis? The impact of domain shift in chest radiograph classification. In Thoracic Image Analysis, TIA 2020. Lecture Notes in Computer Science (eds Petersen, J. et al.) 74–83 (Springer, 2020).
Sathitratanacheewin, S., Sunanta, P. & Pongpirul, K. Deep learning for automated classification of tuberculosis-related chest x-ray: Dataset distribution shift limits diagnostic performance generalizability. Heliyon 6, 04614 (2020).
Subbaswamy, A. & Saria, S. Counterfactual normalization: Proactively addressing dataset shift using causal mechanisms. In 34th Conference on Uncertainty in Artificial Intelligence 2018, Vol. 2 (eds Silva, R. et al.) 947–957 (Association For Uncertainty in Artificial Intelligence (AUAI), 2018).
Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med. 15, 1–17 (2018).
Eche, T., Schwartz, L. H., Mokrane, F. Z. & Dercle, L. Toward generalizability in the deployment of artificial intelligence in radiology: Role of computation stress testing to overcome underspecification. Radiol. Artif. Intell. 3, e210097 (2021).
Rajpurkar P et al. CheXpedition: Investigating Generalization Challenges for Translation of Chest X-Ray Algorithms to the Clinical Setting. https://arxiv.org/abs/2002.11379. (2020). Accessed 12 December 2022.
World Medical Association. World medical association declaration of Helsinki: Ethical principles for medical research involving human subjects. JAMA 310, 2191–2194 (2013).
Lanca, L. & Silva, A. Digital Imaging Systems for Plain Radiography (Springer, 2013).
KCARE Reports. Technical Report 05078: Quantitative evaluation of digital detectors for general radiography. https://kcare.co.uk. (2005). Accessed 8 March 2022.
Calì, C. & Longobardi, M. Some mathematical properties of the ROC curve and their applications. Ricerche Mat. 64, 391–402 (2015).
Selvaraju RR et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision (ICCV). https://ieeexplore.ieee.org/document/8237336. (2017). Accessed 22 June 2022.
Guess, M. J. & Wilson, S. Introduction to hierarchical clustering. J. Clin. Neurophysiol. 19, 144–151 (2002).
Waskom, M. L. seaborn: Statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
LeDell E, Petersen M, van der Laan M. cvAUC: Cross-Validated Area Under the ROC Curve Confidence Intervals. R package. http://CRAN.R-project.org/package=cvAUC. (2014). Accessed 14 December 2022.
Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1, 54–57 (1986).
Cohen JP, Hashir M, Brooks R, Bertrand H. On the limits of cross-domain generalization in automated x-ray prediction. Proceedings of the Third Conference on Medical Imaging with Deep Learning (PMLR). 121, 136–155. https://proceedings.mlr.press/v121/cohen20a. (2020). Accessed 05 December 2022.
Geirhos R et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. https://arxiv.org/abs/2002.02497. (2020). Accessed 20 November 2022.
Sitaula, C. & Hossain, M. B. Attention-based vgg-16 model for covid-19 chest x-ray image classification. Appl. Intell. 17, 1–14 (2020).
Mason, D. SU-E-T-33: Pydicom: An open source DICOM library. Med. Phys. 38, 3493–3493 (2011).
Abadi M et al. TensorFlow: Large-scale machine learning on heterogeneous systems. Python Software. https://tensorflow.org. (2015). Accessed 01 March 2022.
Chollet F Keras. Python library. https://keras.io. (2015). Accessed 01 March 2022.
Virtanen, P. et al. Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Acknowledgements
The authors would like to acknowledge the support from the project AI4EOSC ‘‘Artificial Intelligence for the European Open Science Cloud” that has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement number 101058593. The authors would like to thank the institutions that participated in this study, the patients who made this research possible by providing their images, the radiologists who generously labeled the images, and CSIC (Spanish National Research Council) Global Health Platform. M.C. acknowledges the support received by the Ministry of Education of Spain (FPU grant, reference FPU21-04458).
Funding
This research work was funded by the European Commission—NextGenerationEU (Regulation EU 2020/2094), through CSIC Global Health Platform (PTI+ Salud Global), 202050E107.
Author information
Authors and Affiliations
Contributions
P.M.F-M: original idea; study design; data collection and cleaning; image labeling assistance; deep learning models training; results interpretation; elaboration of the manuscript draft; and manuscript revision. E.M.F.: original idea; study design; contribution of the data referring to the technical specifications of the radiology equipment; results interpretation. M.A.L.A.: data cleaning; elaboration of the manuscript draft; and manuscript revision. M.C.: training of deep learning models; and manuscript revision. A.P.B.: data collection and cleaning; labeling assistance. D.R.G.: results interpretation. J.V.A.: supervision of the medical part of the project and manuscript revision. L.L.I.: supervision of the deep learning part of the project and manuscript revision.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fernández-Miranda, P.M., Fraguela, E.M., de Linera-Alperi, M.Á. et al. A retrospective study of deep learning generalization across two centers and multiple models of X-ray devices using COVID-19 chest-X rays. Sci Rep 14, 14657 (2024). https://doi.org/10.1038/s41598-024-64941-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-64941-5