1 Introduction

In December 2019, a new viral infection caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV2), a member of the β-coronavirus single-stranded RNA [1], was discovered in China. In March 2020, the World Health Organization (WHO) proclaimed coronavirus disease 2019 (COVID-19) a pandemic. This disease has the characteristic of being highly contagious, which has led to its arrival in almost all corners of the planet. To date, more than 180 million people are reported to be infected and more than 3 million have diedFootnote 1. This situation has caused the total or partial lockdown of many regions leading to wide-spread, adverse public health, economic and social outcomes.

Isolation of positive patients is key to cutting off the chain of infection. The gold standard for diagnosing COVID-19 is from the identification of viral RNA by Reverse Transcription-Polymerase Chain Reaction (RT-PCR). However, this method has some limitations such as its modest diagnostic performance and the delay in obtaining results. For example, the method may take between 6 and 9 h to confirm infection [2]. In addition, sampling can be quite variable, depending on the site, personnel and viral load of the individual at the time [3]. Furthermore, this test decreases its sensitivity if not applied within a specific period of time [4, 5].

The rapid spread of the coronavirus and the serious effects it causes in humans make early diagnosis of the disease imperative [6]. The fact that COVID-19 often presents with pulmonary pathology has led a large number of studies on the utility of chest radiography to determine the presence of disease [4]. However, prestigious radiological societies have questioned the role of chest imaging alone as a diagnostic method [7, 8].

A large number of studies to determine the presence of disease using chest radiography (CXR) using computer vision, often based on deep learning (DL), have been reported [9]. These papers have reported performance rates much higher than those of human expert observers. One of the potential shortcomings of these techniques is the introduction of the bias referred to as “shortcut learning” [10]. That is, the models may rely on features that are not related to the pathology they are trying to classify. This bias can lead to models with very high-performance rates when evaluated on sets coming from the same distribution as the training set (called independent and identically distributed or iid). However, the same may not hold true when to a data set not from the same distribution (called out-of-distribution or ood). In such a case, the generalizability of the model may be severely limited.

In this research we will focus on the review of articles based on CXR as this imaging modality is widely used in the diagnosis and follow-up of patients and has some advantage compared to CT modality especially in COVID-19 positive patients as will be explained in this report. This is a continuation of previous work [11], providing further evidence of errors and biases made by researchers in the automatic identification of COVID-19 using CXR. This report seeks to reveal the weaknesses of models proposed so far to diagnose COVID-19 from CXR autonomously, using artificial intelligence (AI). In particular, we seek to alert researchers and reviewers to the concerns of shortcut learning which has been ignored in almost all the papers reviewed in the context of COVID-19 [12] as well as in other field of image classification [10]. In this research, two analyzes are proposed to verify whether the methods are being affected by this issue. Specifically, studies that use explainable artificial intelligence to determine the regions that contribute the most to the classification are reviewed. Similarly, works that make use of an external validation set are analyzed to determine the generalizability of methods based on DL. In addition, studies based on traditional computer vision approaches are examined.

This paper is organized as follows. Firstly, we will discuss the “IMPORTANCE OF CXR AND CT IMAGING IN THE TIMELY MANAGEMENT OF COVID-19”. After that, we will state the criteria of important radiological society about the “USE OF CXR AND CT AS A DIAGNOSTIC METHOD FOR COVID-19”. Afterwards, in section entitled “THE USE OF AI IN RADIOLOGICAL IMAGING” we will show the performance index achieved by the radiologist and the artificial intelligence method for classifying COVID-19, which reflect contradictions. Thereafter, section “DEEP LEARNING TECHNIQUES AND SHORTCUT LEARNING” will introduce the Deep Learning techniques and how Shortcut Learning is affecting these methods. The next section, “EVIDENCE OF SHORTCUTS LEARNING IN CXR CLASSIFICATION” will discuss studies that show the effect of this phenomenon, specifically on CXR images. One of the ways to determine the presence of Shortcut Learning is by using Explanatory Artificial Intelligence, these methods will be reviewed in the section “EXPLANATORY AI METHODS IN THE IDENTIFICATION OF COVID-19 USING CXR”. On the other hand, another way to determine Shortcut Learning is by using an external dataset to validate the models, many few studies are reported in the scientific literature which are discussed in the section “EXTERNAL VALIDATION SET TO DETERMINE GENERALIZATION CAPABILITY OF THE MODELS”. Because Deep Learning algorithms compute their own features and therefore may exacerbate the shortcut learning phenomenon, in the section “BEHAVIOR OF TRADITIONAL COMPUTER VISION METHODS” the results achieved using traditional computer vision methods will be discussed. Subsequently, in section “DISCUSSION AND FUTURE WORK” the main limitation of the analyzed research will be explained and some issues will be suggested to avoid the encountered limitations. Finally, the “CONCLUSIONS” reached in the research will be provided.

2 Importance of CXR and CT imaging in the timely management of COVID-19

Undoubtedly, medical imaging of the lungs is an important tool to assist specialists, both in the management of patients with acute respiratory infections (ARIs) as well as other diseases. In the case of COVID-19, studies confirm visible abnormalities in the lung region for some patients, thus serving as a decision-making tool for human specialists [13]. It is important to take into account that, there are patients with positive PCR who do not develop signs or symptoms, so it is likely impossible to make the diagnosis using CXR alone.

CT images present greater sensitivity as a diagnostic and follow-up method compared to CXR. For example, there are reported cases of COVID-19 with visible lesions on CT but not visible on CXR [14]. In fact, one of the main CT findings in patients with COVID-19 are ground-glass opacities in the peripheral regions of the lower lobes, which may not be seen on CXR (Fig. 1). However, CT imaging capability may not be available in many medical centers where COVID-19 is diagnosed around the world. In addition, where CT equipment does exist, it is not possible to dedicate it exclusively to COVID-19 diagnosis, given the high virology of the disease and the pressure of care. On the other hand, CXR has the advantage of being available in most healthcare facilities. Its cost is much lower compared to CT imaging, and the portability of CXR can help to prevent the patient from moving about the medial center, and, thus, minimizing the possibility of spreading the virus. In many instances, this makes CXR preferable, even though it may be less sensitive for diagnosis and patient follow-up.

Fig. 1
figure 1

Example of CXR image (A) and CT image (B) for a COVID-19 positive patient. Red arrows show a lesion visible on CT, but not detectable using CXR, extracted from [15]

The most frequent findings on CXR for COVID-19 are bilateral consolidated, absence of plural effusion, bilateral ground-glass pattern, peripheral and in basal lobes, which appear as the clinical disease progresses from ten to twelve days after the onset of symptoms [16]. However, the use of this technique as a diagnostic method has shown low sensitivity and specificity in current radiological practice in asymptomatic patients with mild to medium grade disease [17]. For example, according to Ref [18] the sensitivity using CXR to detect SARS-CoV-2 pneumonia is 57%. In older patients the sensitivity was slightly higher compared to younger patients, but, in both cases, it was low. On the other hand, in [19], higher sensitivity values were recorded by radiologists, 65%. These values demonstrate the difficulty on the part of radiologists in making a diagnosis of COVID-19 using CXR alone.

3 Use of cxr and ct as a diagnostic method for covid-19

Due to the increase of COVID-19 positive cases, since March 2020 prestigious radiology organizations (Fleischner Society [7], American College of Radiology (ACR), Canadian Association of Radiologists (CAR) [8], Canadian Society of Thoracic Radiology (CSTR) and British Society of Thoracic Imaging (BSTI) [20]) issued recommendations on the use of CT and CXR as a method of screening, diagnosis and patient management for COVID-19. These organizations agree that, the use of these chest images alone should not be used to diagnose COVID-19, nor should they be used routinely in all patients with suspected COVID-19. These imaging techniques should also not be used to inform the decision to test a patient for COVID-19, as normal chest imaging findings do not exclude the possibility of COVID-19 infection. In addition, abnormal chest imaging findings are not specific for diagnosis of COVID-19. In general, chest imaging findings in COVID-19 are nonspecific and can overlap with other infections, such as influenza, H1N1, SARS and MERS. There are patients who present with positive PCR, but do not develop signs and symptoms of disease and thus may have a normal CXR or CT. Therefore, they cannot be diagnosed as positive using lung imaging alone [21]. Consequently, CXR or CT may be used to assess the status of patients at risk of disease progression and with worsening respiratory status, but should not be used as the primary diagnostic screening tool.

4 The use of ai in radiological imaging

The non-specific, subtle and difficult manifestations on both CT and CXR makes it difficult to achieve a high success rate in the diagnosis of COVID-19. Despite this, investigators from the fields of AI, DL and computer vision, have published a number of reports in this arena [9]. Articles in peer-reviewed journals have reported on automatic disease identification using CT and CXR. Such investigations should be considered with caution so as not to create false expectations, since, in many cases, the results reported far exceed those achieved by expert observers such as radiologists [22]. In general, radiologists may consider AI as a useful diagnostic support tool, but are concerned that the diagnostic accuracy using these techniques alone is overstated. [23].

Advances in automatic COVID-19 identification using CXR and CT imaging are reviewed in many studies [9, 24,25,26,27,28,29,30,31,32,33] the average accuracy of AI methods of approximately 90%. On the other hand, the values for CXR compared to CT are even higher at 96%. In some studies, the reported sensitivity is 100% [34, 35]. These reported values contradict, firstly, the fact that CT, in general, has higher sensitivity than CXR. Secondly, the lack of specificity of CXR would lead many radiology expert opinions, these images should not be used as a diagnostic tool for these patients. Note that radiologists only achieve a sensitivity between 57 and 65% [19] in the review of similar cases. None of the aforementioned review papers, question or analyze these high results achieved. In fact, in one report [36] it is stated that their method was able to identify with 100% effectiveness patients presenting lesions visible using CT, while not detectable on CXR.

According to one report [37] the advances achieved in the automatic classification of COVID-19 from CXR have little or no utility in clinical practice. Despite reporting encouraging results, the use of these models as specialist decision support systems must undergo more rigorous investigations and meet regional regulatory and quality control requirements. In particular, their performance must be validated and their efficacy demonstrated in the clinical workflow. On the other hand, in many investigations, the image sets used were small and poorly balanced. One review on the current limitations of studies using CXR to perform diagnosis [11] indicated that the use of datasets from different sources leads to models that learn features not related to the disease they are trying to identify, i.e. they demonstrate the phenomenon known as shortcut learning.

5 Deep learning techniques and shortcut learning

Most of the techniques used in the COVID-19 automatic identification task based on CXR are based on DL that specifically rely on convolutional neural networks (CNN). These approached have achieved substantial, recent success in biomedical applications [38]. CNNs specialize in classifying images autonomously, without the need to consider previously defined features to perform the classification, as with traditional CV methods. As a result, the process of feature extraction and classification can be performed in a single stage. In short, CNNs consist of the serial connection of a feature extraction network and a classification network. Through the training process, the weights of both networks are determined. The feature extraction stage contains the filters for convolution, clustering, normalization, evaluation of an activation function, and so on. Meanwhile, the fully connected layers in the last stage act similarly to a conventional Multi-Layer Perceptron (MLP)[39]. That is, a CNN in its training phase learns the coefficients that minimize the classification error, having to adjust millions of parameters. This explosion in the use of CNNs, even for complicated applications such as the analysis of medical images, has been made possible by the increase in computational power and capability [40] [41].

However, these DL methods are beginning to be evaluated critically and some limitations have been reported. One of the difficulties studied is the bias known as shortcut learning where decision rules may that perform well on standard benchmark sets, but do not transfer to more difficult test conditions, such as real-world scenarios. For example, models achieve superhuman performance in object recognition, but even small changes invisible to humans [42] or modifications to the image background [43, 44]. Furthermore, models can correctly classify an image, but worryingly, they may do so without taking into account what actually confers that classification [45]. That is, the models use features that are capable of correctly separating the class, but are not directly related to the task at hand. For instance, these models rely on differences in the background rather than the object to be classified. For example, the approach may be able recognize faces accurately, but show high error rates for faces from marginalized groups that were not adequately represented in the training set [46]. These types of observations are beginning to cause concern in the scientific community.

Shortcut learning can present a major obstacle to achieving more reliable models. Overcoming this issue entirety may be exceedingly difficult if not impossible, but any progress in mitigating it will lead to more reliable solutions. The hope is that the models can behave in a similar way even in situations outside their learning. In other words, we want the model to have high generalizability outside of its training set. Currently, research on shortcut learning and its mitigation remains fragmented. Many studies do not address these limitations and do not take into account this important issue as evident in their research. However, there are others that attempt to foster discussions and raise awareness of these issues among researchers, trying to make the rule of what has so far been only the exception. For example, it is recommended [10] that the results be considered carefully, using explainable AI techniques. It is also proposed as an essential rule to determine the generalization power of the model, using a set that does not come from any of the sets used in the training stage. These recommendations can apply to the COVID-19 identification task using CXR as will be discussed.

6 Evidence of shortcuts learning in cxr classification

The use of DL methods has been extensively studied in the field of CXR imaging [30]. As mentioned above, the evaluation of the generalizability of the proposed method from an ood set has been limited. However, some have reported evidence about existence of shortcut learning [47,48,49]. In one case [47], irregularities are reported when training a model on an image set and evaluating on an ood set. Specifically, from four sets A, B, C and D, it was observed that when training and evaluating on set A, the results are superior compared to training using sets B, C and D, and evaluating with set A. Another study demonstrated the presence of shortcut learning [50] when the model was able to identify the originating hospital with more than 95% accuracy. According to the network activation map, it was observed that, to achieve this result, the model relied on the text labels on the CXR images, instead of the lung region. This demonstrates that the performance of CNNs in disease diagnosis using radiographs may reflect not only their ability to identify specific disease findings in the image, but also their ability to exploit confounding information such as text labels.

Current DL models for identifying COVID-19 using CXR do not escape of shortcut learning. For example, one study [51] performed a classification with more than 90% accuracy without using the lung region demonstrating that the models have a lot of information to exploit that is not related to the disease manifestations in the lung region particularly when the entire image is utilized. In the same study, the absence of an ood set to assess generalizability is strongly criticized. On the other hand, another report [52] recognized that most of published work has not performed any analysis to demonstrate the reliability of network predictions. In the context of medical tasks, this is particularly relevant. That is, most of the state-of-the-art studies have validated their results with datasets containing tens or a few hundred COVID-19 samples, which may limit the general impact of the proposed solution. As proposed in this study, one of the ways to obtain greater reliability of the methods is to use techniques that visualize the regions on which the findings of the models are centered.

Most of the research published on the application of AI and DL in the context of COVID-19 is based on images from different sources. After the publication of the GitHub-Cohen [53] image dataset, in which a set of COVID-19 positive images was made freely available to the international scientific community, there have been numerous reports applying AI techniques for automatic disease classification. That is, to date, this has been the most widely used source of COVID-19 positive images by the scientific community. The formula used by most research to increase the number of negative (non-COVID-19) images has been to add images from sets available from other sources. A detailed explanation of the current sets, as well as their limitations, has been reported [54, 55]. In fact, in one study [54] it was determined that only five of the 256 datasets identified met the criteria for an adequate assessment of the risk of bias. In that study, it was observed that most of the data sets used in 78 published articles are not among these five data sets, resulting in models with a high risk of shortcut learning and other forms of bias.

7 Explanatory ai methods in the identification of covid-19 using cxr

Automatic diagnostic methods rely on interpretations from expert human observers on which they base their decisions. One of the current lines of research is the development of explainable artificial intelligence (XAI) methods [56]. Specifically, in the field of image-based medical applications, an adequate explanation of the decision obtained is essential. That is, a decision support system should be able to suggest the diagnosis and show, to best of its ability, what image content contributed to the decision reached by the algorithm. Such methods allow for the assessment of the veracity of the models. Therefore, through these techniques it is possible to verify if the decisions determined by the models are centered on regions that should be used for diagnosis. For example, is the determination of presence of pulmonary complications from COVID-19 based on an analysis of CXR findings in the lungs?

XAI techniques have also been applied in the environment of automatic detection of COVID-19 from CXR. Table 1 lists some of the papers published to date that make use of these XAI tools. As can be seen, several techniques are reported, among the most used are LIME [57], Grad-Cam [58] and Grad-Cam +  + [59]. The table also records the presence of segmentation methods to determine the lung region. This is of vital importance, since as will be shown below, when not only the lung region is used, the models tend to focus on regions where its association with disease in question is unclear. Figure 2 (extracted from [52]) shows an example of how when using the whole image, CNNs may use as most important regions for classification areas that are not within the lungs. This means that there are regions that provide enough information to adequately separate the classes with features not related to the disease that they are trying to classify. This is likely a case where the model is using shortcut learning.

Table 1 Main studies using XAI techniques to identify COVID-19 using CXR
Fig. 2
figure 2

Activation map for a modification of the CNN COVID-Net [60], obtained from the Grad-Cam method, by using the whole image to perform the classification. Image “a” belongs to the normal class, “b” belongs to the pneumonia class and “c” to COVID-19 class. In all cases, the regions on which the network is basing its decision are outside the lungs

An example of lung segmentation necessity is reported in one study [61], where the outcome of attention maps was evaluated by two radiologists. Reportedly, the model half focused on regions outside the lungs to perform the classification in half of the cases. The recommendation put forward by the authors was to train on a much larger data set, so that the model would show a more robust performance in that aspect. In response to this recommendation, another study [12] image sets of a greater number of were used. The objective was to determine which regions were most used by the models to assign an image to a class. It was evident that, at times, saliency maps marked lung fields as important, suggesting that the models took into account genuine pathology of COVID-19. However, the saliency maps in some cases, highlighted regions outside the lung fields that may represent confounds. For example, the saliency maps frequently highlighted laterality markers as differing between the COVID-19 negative and COVID-19 positive datasets, and similarly highlighted arrows and other annotations that appear exclusively in the GitHub-Cohen dataset. Also, by applying the CycleGAN technique, images were generated that showed textual markings as important patterns for determining class. It is worth noting that this study made use of an external validation set. In this case, it was evident that the performance of the models decreased drastically when evaluated in the external (ood) set.

On the other hand, applying lung segmentation prior to classification and thus eliminating the use of regions outside the lungs leads to models that perform the classification based on features unrelated to the disease. Therefore, studies that use the complete image to perform the classification and achieve spectacular results (note that they are more than 30 percentage points in relation to specialists in radiology) but, in truth, are not valid.

For example, in one report [52] the imaging regions that contributed most to the identification of COVID-19 were evaluated using three imaging variants. In the first experiment, the full image was used and again, it was observed that the model took regions outside the lungs to perform classification. In the second experiment, the bounding box region of the lungs was used, where the same problem as in the previous experiment also appeared. Finally, in the third experiment an image of the segmented lungs was used, which forced the method to find the features within these regions. This time the results obtained indicated lower performance than the previous variants demonstrating that when using the previous variants, the models use features that are not related with the pathology in question.

Another attempt to visually assess the regions that a model uses to determine class was reported [63] showing that, when using the whole image as input, the CNNs pointed out as important regions those that did not belong to the lungs. In this case, the models focused their attention on the text labels present on the images. As a result, the models were able to identify with high accuracy the site where the images were acquired and thus whether they were likely to be cases of COVID-19. This occurred even after applying lung segmentation. Therefore, there are hidden features in the images that can be exploited by the models to perform classification and need to be handled coutiously to achieve reliable models.

In other cases, the demographic characteristics of the population can be a strong confounding factor. Several papers have used image sets where one of the classes belongs to children [55]. On the other hand, patients with COVID-19 often showed artifacts such as electrodes and their wires while other patients are intubated. Also, the position of the patients can have an effect, since in healthy patients, the X-ray view wass usually AP while in the COVID-19 patients, patients were more often supine, and the view is PA.

XAI methods have been used to determine the regions of the image that contribute most to the classification and thus build more reliable models. Furthermore, it became evident that, when using the whole image, the regions marked as important may not be related with the classification label, invalidating the results achieved. On the other hand, the segmentation of the lungs does not guarantee that the models really focus on appropriate regions and may contain underlying features that may still mask the good performance of the models. Nevertheless, as can be seen in Table 1, there are studies that, reporting the regions on which their models are based and knowing that they do not correspond to the disease they are trying to identify, report high rates of effectiveness in the classification. Again, this is evidence of the presence of shortcuts learning, as well as the omission of this issue by scientific community. Hence, an external evaluation set is needed as a complement to demonstrate that the models maintain their behavior. Despite this, an evaluation methodology is not reported in any of the studies using the XAI techniques.

8 External validation set to determine generalization capability of the models

One way to eliminate biases in CXR image sets is use image processing techniques to pre-process the image before applying the AI and DL methods. One approach is to automatically limit the portion of the image to be analyzed to a bounding box region enclosing the lungs. A second approach is to segment the lung region automatically. With these techniques, spurious labeling marks that could artificially assist the model with classification are removed. However, the removal of these marks does not guarantee improvement in the model’s generalizability. One way to test the validity and generalizability of the model is to evaluate it with an external, ood data set. To date, few studies report the use of an ood validation.

A discussion regarding validation on a set of external, ood image set has been reported. Table 2 presents an update of the published studies that when evaluated on external validation sets showed evidence of a lack of generalizability. These studies demonstrate that the algorithms learn features related to the source dataset, rather than the disease they are trying to classify, that is, studies being affected by shortcut learning. Note that the results of studies using an internal validation set report extremely good performance. However, when using the external evaluation set, these resulting performance decreases considerably. In fact, the reported performance measures have values close to that of a random classifier in most cases. Table 2 also presents the link to the image sets used by these investigations, which can constitute a starting point to carry out a more rigorous evaluation of the proposed models.

Table 2 Summary of research using an external image set (ood) as a method of evaluating their models

The creation of an appropriate evaluation strategy to address such biases is imperative. In other words, making a correct assessment reveals the existence of an issue that may otherwise remain hidden. Understanding the existence of the problem is the first step towards a solution. This issue needs to be taken seriously, especially since these systems are intended for use in clinical settings for the identification of COVID-19.

9 Behavior of traditional computer vision methods

According to the review studies analyzed, the majority of the investigations (27 articles) used CNN to identify COVID-19, most commonly ResNet, using different amounts of layers. DL techniques may tend to overfitting the classification models by generating their own features in the training process. Therefore, the use of traditional computer vision (CV) methods could lead to models with greater generalizability, especially, when using data sets that present marked differences [55].

Traditional CV algorithms involve four main stages: 1) image preprocessing is performed by applying noise filtering, enhancement, resizing techniques, etc., 2) the detection of regions of interest is performed based on different sampling strategies or using segmentation techniques, 3) feature extraction is performed by means of some generally hand-constructed descriptor, e.g., SIFT [81], Local Binary Pattern (LBP) [82] among others, 4) the features describing the image are used by automatic classification algorithms to find the boundaries separating each class. These computed features can have high dimensionality, something that deters the good performance of the methods. One of the ways to eliminate this problem has been through feature selection techniques [83]

Such approaches have also been used in the COVID-19 automatic classification task using CXR. Table 3 presents a summary of some of studies that make use of this methodology. There is a tendency in these studies to use pre-trained CNN networks as the method for feature extraction. However, other studies use traditional methods to extract features such as LBP and GLCM, among others. Likewise, in another study [84], a new descriptor based on orthogonal moments is proposed. Also, the use of algorithms for dimensionality reduction has also been studied, although it has not been a common practice. On the other hand, a great diversity of classification methods is also observed in Table 3. In fact, Support Vector Machine (SVM) and Random Forest (RF) are the most used, [85], where these classifiers are reported as the best performing ones. The performance indices achieved are comparable with those achieved by the CNNs analyzed in previous sections, also with values above what is reported by expert observers such as radiologists.

Table 3 Summary of works using traditional Computer Vision approach to identify COVID-19 using chest X-Ray imaging

These studies have not taken into consideration the elimination of features that are not related to the disease since, in all cases, the complete image was used to extract the features. Thus, the same mistakes associated with the use of the whole image and the marks it contains can be made. Only one study addressed this issue [86], in which a manual segmentation of the images was performed such that the bounding box region enclosed the lungs and thus eliminated the labels from the analysis. That study also evaluated class imbalance distribution issues using resampling techniques. However, the authors of that study themselves in a new investigation [63] state that, although the experimental results achieved in [86] showed that it may be possible to identify COVID-19 using CXR, it was a challenge to ensure that other patterns not belonging to the lungs did not contribute to the classification.

Finally, Table 3 also shows that none of these investigations make use of an external validation set. In all cases a partition of the training set was used. It should be noted that, in all cases, the image sets used were obtained in a similar way as in the studies using CNN. That is, the image datasets present the same issues and biases discussed above.

10 Discussion and future work

Automatic COVID-19 classification using CXR imaging is an active topic by the scientific community. Most papers report high performance (Tables 1, 2 and 3). The majority of these studies use the DL approach, although the use of traditional CV methods to address the task has also been reported. In both cases, the results are far superior to those achieved by experienced radiologists. However, most of the studies using automated approaches utilized internationally available image sets. In these data sets, the positive and negative cases may have come from different sources, and the methods may learn to recognize the source rather than the disease. This can result in a lack of generalizability of the models as seen in Table 2.

The main concern has been the absence of a correct evaluation protocol on the proposed models. In the studies analyzed, the results of using images that do not belong to any of the sources of the image sets used in the training of the models are rarely presented. In the case of making use of an ood set, a notable decrease in performance has been reported. In one review [94], it was determined that none of the articles analyzed in their research met the requirements to be considered reliable. The authors found no sufficiently documented manuscript describing a reproducible method. Also, no method was identified that follows best practices for developing a machine learning model with sufficient external validation to justify the applicability of the model. These are issues that should be taken into account to ensure the development of better quality, reproducible models that were free from biases such as shortcut learning.

An important step in this process lies in the proper selection by computer vision specialists together with radiologists and medical physicists of an adequate training set. Special care must be taken to select the training set minimizes the potential for biases in the resulting models such as those that learn by shortcuts. Otherwise, the models may yield good results in the iid sets but poor results in the ood sets, as has often been the case. In addition, one should be aware of the need to demonstrate as well as possible what the decisions reached by model are based upon. This will make the decision process of AI techniques more transparent from which human specialists can learn. So far, it seems unlikely that CXR alone can provide an accurate diagnosis of COVID-19. In fact, radiologists typically rely on other patient characteristics and information to make a diagnosis. Thus, the union of several clinical features seems to be the way forward to achieve a system that really helps human specialists.

11 Conclusions

This paper reviewed the main approaches presented in the scientific literature to address the issues of automatic COVID-19 classification using CXR. According to the reviewed papers, the performance rates reported by automatic classifiers outperform human specialists by more than 30 percentage points. However, a review of published papers using XAI that, CNNs base more of their classifications on regions outside the lung area. This suggests that these networks are performing shortcut learning. One approach to test the generalizability of these models is to base evaluation on an external, ood data set. However, this methodology has not been applied in most of the studies reviewed. In fact, the papers that have evaluated models on ood sets report performance rates close to random classification. This is evidence that the models proposed so far learn patterns that are not related to the disease they are trying to classify. That is, evaluating the performance of the models on an iid as a validation set (as most current benchmark tests do) is insufficient to distinguish the generalization power of the models. Therefore, as a fundamental step in model evaluation is to require the use an external, ood data set. Studies based on traditional computer vision methods showed the same issues as DL approaches. Hence, ood generalization tests should become the rule rather than the exception, especially in biomedical solutions where inadequate diagnoses may be applied to patients that negatively impact the choice of treatment for serious diseases such as COVID-19. When properly validated, AI and DL methods can provide the radiologist with valuable tools to assist in the diagnosis and classification of these diseases.