Introduction

The novel coronavirus disease 2019 or COVID-19 has become a serious health hazard throughout the globe. Because of its excessive infectivity, spreading capability, and ubiquitous nature, COVID-19 has been categorized as pandemic by the World Health Organization (WHO). The recent reports have already proved that relying on classic infection-control and public-health measures for tackling this pandemic are not sufficient [17, 23]. Consequently, several research initiatives have been undertaken across the globe to fight against this pandemic by leveraging the recent advancements in science and technology.

Fig. 1
figure 1

Example showing semantic relatedness between climate type of spatial region-2 and that of the others, as per the commonality in climate type (denoted by colors)

Analyzing the impact of geographic climate variations in modulating the COVID-19 outbreak is one of such current research concerns which has gained substantial attention. Numerous research publications in this context can be found in the literature. Nevertheless, owing to the diverse screening strategies, poor quality of collected data, and inadequate number of COVID tests, these research outcomes often become contradictory to each other. For example, a collaborative research of China-USA team has observed that low humidity, low temperature, and mild diurnal temperature range may promote this disease transmission [13], whereas another group of researchers from USA has found that the hot and humid climate does not help controlling the COVID-19 outbreak [2].

In our proposed impact analysis scheme, we aim at addressing such uncertainty issues by intelligent incorporation of theoretical knowledge from the domain of epidemiology and semantic knowledge from the domain of climatology. The theoretical guidance from epidemiological model helps our impact analysis to remain consistent with the underlying theory of the infectious disease spread. On the other side, the study of semantic relatedness in the datasets aids in reducing uncertainty at the time of learning causal relationships between disease development and climate variability. Unlike majority of the existing models, instead of considering the individual climate factors, we deal with the overall climatic pattern of the geographical areas. Though our present approach of semantics-driven theory-guided impact analysis is influenced from our recently introduced framework [9], the fundamental difference between the two lies in the following two aspects. Firstly, in contrast to [9], the semantic analysis in the presently proposed framework is probabilistically augmented to account for the real-life scenario where the same climate zone may be expanded over multiple spatial regions. Secondly, the impact analysis in our present scheme is performed with consideration to the expected values of relative-recovered cases and per million new infected/confirmed cases over the various regions, which eventually make our present scheme capable of providing a more realistic view.

Motivation In essence, our research is motivated by the fact that epistemic uncertainty [7] can be reduced using more information and knowledge regarding the relevant domain. For example, added insights from mathematical models can help in handling uncertainty that emerges because of our unawareness about the basic tenet of epidemiology [16]. Further, the use of additional data samples can also help in tackling uncertainty by reducing sampling error during inference generation. In this regard, we may exploit the semantic relatedness between data collected from different spatial regions [9]. For instance, as explained in Fig. 1, the climate class (BWk) of the spatial region-2 is quite similar to that of region-3 (BWh) and region-4 (BSk), since all these belong to arid type of climate. Accordingly, at the time of analyzing climatological impact on COVID-19 spread in region-2, we may utilize the data samples from region-3 and region-4 as well, along with the relevant measures of semantic relatedness. However, designing appropriate measure of semantic relatedness, especially when a region contains multiple climate types in its various parts, becomes a challenging task. Our earlier work [9] assumes that each spatial region belongs to strictly one climate zone. Accordingly, it cannot measure the semantic relatedness between the data collected from region-1 (climate type: BSh + Aw) and that collected from other considered spatial regions, as depicted in Fig. 1b, c. To overcome this limitation and to make the impact analysis model more appropriate for dealing with real-life scenario, in this work, we introduce the concepts of ‘regional semantic relatedness’ and ‘semantic generic index’, which aid in enhancing the prediction model developed in [9].

Contributions It may be noted from the motivational example that, the primary challenges involved in this research are, firstly, to define suitable measure for quantifying semantic relatedness, and secondly, to develop a scheme for combining climate domain semantics as well as epidemiological knowledge into a data-driven model. In this work, we address both the aforementioned challenges by considering real-life scenario where a spatial region may partly belong to multiple climate zones. Accordingly, the key contributions of our work are as follows.

Fig. 2
figure 2

Proposed framework: overall process flow. [The red boxes indicate our major contributing steps.]

  • Exploring effective means for representing climatological domain knowledge/semantics.

  • Defining probabilistic measure of regional semantic relatedness (regSR) to account for the existence of multiple climate types within a spatial region.

  • Developing an advanced version of data-driven framework that can offer theory-guided semantics-driven analysis of climatological impact on COVID-19 spread, while considering real-life scenario for regional distribution of climate.

  • Proposing Semantic-GI as a semantics-driven generic index for analyzing the influence of climate variability on the regional outbreak of COVID-19.

The proposed model is validated using daily time series of COVID-19 data over 15 states (provinces) belonging to diverse climate regions in India. Our experimental findings indicate that dry (arid/semi-arid) climate zones, like BSh, BWh etc., are most susceptible for COVID-19 transmission. The temperate climate zones, e.g. Cwa, are also quite vulnerable as the daily relative-recoveryFootnote 1 in these zones are comparatively lower than that in tropical climate zones. Our study also identifies humid climate to be a principal factor favoring the daily relative-recovery in India.

Incidentally, our proposed framework is not only applicable for the present purpose of assessing climatological impact on COVID-19 spread, but also it is potential enough for analyzing the impact of any other categorical factor on the possible transmission of COVID-19. For example, given the classification of the cities in terms of house rent allowance (HRA) grade, the similar framework can be used for analyzing the impact of population density on the COVID-19 transmission on sub-regional basis. More significantly, the proposed measures of regional semantic relatedness and Semantic-GI, are some generic notions, which can be successfully utilized for semantics-driven explainable analyses in diverse application areas of machine learning and computational intelligence.

The rest of the paper is structured as follows. Section “Problem Scenario” describes the overall problem scenario. Section “Methodological Overview” discusses on the methodological details of the proposed impact analysis model. Section “Experimental Evaluation” presents the experimental evaluation of our proposed model in comparison with state-of-the-art approaches. Section “Related Works” provides a summary of the various related works, and finally, we conclude in Section “Conclusions”.

Fig. 3
figure 3

Semantic network corresponding to the climatological concept defined in Fig. 1a [9]

Problem Scenario

Given the daily statistics of COVID-19 case count, including confirmed cases, recovered cases, and active cases over a set of regions with known climate patterns, the prime goal of this research is to explore any possible correlation between the various climate types and the development of COVID-19 in the regions. Accordingly, we aim at answering the following research questions (RQs).

  • RQ1: Do the climate patterns of the regions have any correlation with the daily statistics of confirmed/ recovered case counts?

  • RQ2: In case such correlation is identified, which climate type(s) help(s) increasing/decreasing the infected/recovered case count the most?

By the term “climate pattern” here we indicate spatial distribution of various climate classes, including equatorial/ tropical (e.g. ‘Am’, ‘As’ etc.), arid (e.g. ‘BWh’, ‘BWk’ etc.), and temperate (e.g. ‘Cwa’, ‘Cfa’ etc.) etc., as defined in [12] (refer Fig. 1a).

Methodological Overview

An overview of process flow within our proposed framework is shown in Fig. 2. Primarily, the overall process is comprised of five major activities: (i) representing climate domain semantics, (ii) estimating semantic relatedness on regional basis,( iii) modeling probabilistic relationship between COVID development and regional climate types, based on enhanced semantics-driven theory-guided analysis, (iv) conducting predictive analysis for COVID-19 cases, and (v) assessing climatological impact on the disease outbreak. The various steps in this regard are discussed in the subsequent subsections. Both the research questions (RQ1 and RQ2) are primarily answered in the final step.

Representing Climate Domain Semantics

The objective here is to represent the climatological domain knowledge in the form of a semantic graph/network which can be utilized in successive step to reduce epistemic uncertainty of learning probabilistic relationship between regional variability of climate types and COVID development.

Formation of Semantic Network

The typical semantic network used in our proposed model is shown in Fig. 3. The network is generated by combining semantic hierarchies rooted at the various major concepts related to climate pattern, namely precipitation (P), main climate (MC), and temperature (T), as mentioned in Köppen Geiger climate classification [12]. More specific concepts (such as ‘summer dry’(s), ‘desert’ (W), ‘hot arid’ (h), ‘cold arid’ (k), etc.), which can be directly related to any of these root concepts, are represented by the intermediate nodes in the network. Finally, the exact climate classes (such as As, BWh, Cwa, etc.) which are indirectly related to multiple root concepts via intermediate ones, are represented by the leaf nodes in the semantic network (refer Fig. 3).

Measuring Semantic Relatedness

To measure the semantic relatedness (SR) between any pair of climate classes (say \(CT_1\) and \(CT_2\)), the conceptual relationship as represented through the semantic network is utilized [9]. Formally, SR can be presented as follows.

$$\begin{aligned} SR\left( CT_1,CT_2\right) =e^{-\lambda \cdot \mathrm{max}_{l}}\cdot \frac{e^{\delta \cdot {D\cdot \mathrm{max}_d}}-e^{-\delta \cdot {D\cdot \mathrm{max}_d}}}{e^{\delta \cdot {D\cdot \mathrm{max}_d}}+e^{-\delta \cdot {D\cdot \mathrm{max}_d}}}. \end{aligned}$$
(1)

Here, \(max_l={\rm max}(l_1,\ldots ,l_\mathcal {R})\) represents the maximum of the shortest path lengths \(l_1,l_2,\ldots , l_\mathcal {R}\) between \(CT_1\) and \(CT_2\) via each of the R root concepts; \(max_d={\rm max}(l_1,\cdots ,l_\mathcal {R})\) represents the maximum of the subsumer depth \(d_1,d_2,\ldots , d_\mathcal {R}\) relevant to \(CT_1\) and \(CT_2\), measured in regards to each of the \(\mathcal {R}\) root concepts. \(\lambda\) and \(\delta\) are scaling parameters that help adjusting the contribution of \(\mathrm{max}_l\) and \(\mathrm{max}_d\), respectively. D represents the degree of conceptual overlap, defined as follows [9].

$$\begin{aligned} D= {\left\{ \begin{array}{ll} &{}2, if CT_1\, \mathrm{and}\, CT_2 \mathrm{overlap}\, \mathrm{in}\, \mathrm{terms}\, \mathrm{of}\, MC\\ &{}1, \mathrm{if}\, \mathrm{the}\, \mathrm{overlap}\, \mathrm{is}\, \mathrm{in}\, \mathrm{terms}\, \mathrm{of}\, P \mathrm{and}/\mathrm{or}\, T\\ &{}0, \mathrm{if}\, CT_1 \mathrm{and}\, CT_2 \mathrm{have}\, \mathrm{no}\, \mathrm{overlapping}\, \mathrm{concept}. \end{array}\right. } \end{aligned}$$
(2)

Definition 1

Semantic relatedness \(SR(CT_i,CT_j)\) is a quantitative measure of commonality between climate type \(CT_i\) and \(CT_j\), computed with respect to the semantics of main climate as well as temperature and precipitation pattern. To be noted, given the underlying semantics in the form of hierarchical representation, the SR measure can also be applied on any pair of concepts from domains beyond climatology.

Measuring Semantic Relatedness on Regional Basis

In this stage we aim at determining the semantic relatedness of data collected from various spatial regions belonging to different climate types. As per the semantic relatedness measure defined above, the semantic relatedness between data collected from a spatial region r under climate type \(CT^{r}\) and that collected from regions under climate type \(CT_j\), can be estimated as regSR \((r,CT_j)=SR(CT^r,CT_j)\). Note that, here we use subscript to denote specific climate type and superscript to denote the spatial region to which the climate type is associated with. However, it is quite common that a large spatial region is covered with multiple climate zones in its various parts. That means, \(CT^r\) can be a set of climate types, and therefore we may define it as \(CT^r=\left\{ CT^r_1,CT^r_2, \ldots \right\}\). In such case, the semantic relatedness must also consider the percentage of area covered by each climate type within the region. The higher the covered area percentage the more the probability of the data being collected from the particular climate zone. Accordingly, to tackle the issue of presence of multiple climate types within a region, we enhance the regional semantic relatedness measure in following manner.

$$\begin{aligned} regSR(r,CT_j)= & {} \frac{1}{(n-1)}\sum _{r'(\ne r)} \mathrm{max}\left( P\left( CT^{r}_i\right) \cdot P\left( CT^{r'}_j\right) \cdot SR\left( CT^r_i,CT^{r'}_j\right) \right) ,\nonumber \\&\forall CT^{r}_i \in CT^{r}, \end{aligned}$$
(3)

where \(P(CT^{r}_i)\) denotes the probability of presence of a climate type \(CT_i\) (\(i=1,2,\ldots\)) in a region r, \(P(CT^{r'}_j)\) denotes the probability of presence of a climate type \(CT_j\) in region \(r'(\ne r)\), n is the total number of spatial regions (including r). The \(P(CT^{r}_i)\) can be estimated as the area of climate type \(CT_i\) covered in per unit area of the region r. Similarly, \(P(CT^{r'}_j)\) can be estimated as the area of climate type \(CT_j\) covered in per unit area of the region \(r'\).

Definition 2

Regional semantic relatedness regSR \((r,CT_i)\) is a quantitative measure of semantic relatedness between the data collected from region r and that collected from regions having climate type \(CT_i\). The \(regSR(r,CT_i)\) assumes that the region r may be associated with multiple climate types which may or may not include \(CT_i\). Moreover, here, the data collected from \(CT_i\) may be associated with multiple regions.

Semantic Analysis of Relationship Between Climate Type and COVID Case Development

This step aims at modeling the causal influence of climate variability over the dynamics of new confirmed cases, active cases, and recovered cases of COVID-19. In this context, we employ semantically enhanced Bayesian network [6], where the Bayesian model handles the uncertainty by its probabilistic analysis for learning causal relationship. The incorporation of climate domain semantics further helps in tackling the uncertainty by reducing sampling error during inference generation step. The Bayesian network structure as used in our model is shown in Fig. 4. This directed acyclic graph primarily represents the causal dependency among climate type (CT), confirmed case (CC), recovered case (RC) and active case (AC). To generate this dependency structure, we employed structure learning based on a combination of HPC (hybrid parents and children) and pairwise mutual information algorithms [10, 14].

Fig. 4
figure 4

Causal dependency between climate type and COVID-19 case counts

Semantically-Enhanced Bayesian Modeling of Causal Relationships

Given the causal dependency graph, the conditional probability distributions for confirmed case count (CC), recovered case count (RC) and active case count (AC) in any spatial region r for each timestamp can be obtained through semantic Bayesian analysis as follows.

$$\begin{aligned} P^{\dagger }\Big (CC|CT^r\Big )=\psi \cdot \Big (\sum _{j}{regSR\Big (r,CT_j\Big )\times P\Big (CC|CT_j\Big )\Big )} \end{aligned}$$
(4)
$$\begin{aligned} P^{\dagger }\Big (RC|CC,CT^r\Big )=\nonumber \\ \psi \cdot \Big (\sum _{j}{regSR\Big (r,CT_j\Big )\times P\Big (RC|CC,CT_j\Big )\Big )} \end{aligned}$$
(5)
$$\begin{aligned} P^{\dagger }\Big (AC|CC,RC,CT^r\Big )=\nonumber \\ \psi \cdot \Big (\sum _{j}{regSR\Big (r,CT_j\Big )\times P\Big (AC|CC,RC,CT_j\Big )\Big )}, \end{aligned}$$
(6)

where,

$$\begin{aligned} P\Big (CC|CT_j\Big )=\, & {} \frac{1}{\sigma _{CC}\sqrt{2\pi }}e^{-\frac{1}{2}\left( \frac{CC-\theta _{0j}}{\sigma _{CC}}\right) ^2} \end{aligned}$$
(7)
$$\begin{aligned} P\Big (RC|CC,CT_j\Big )=\, & {} \frac{1}{\sigma _{RC}\sqrt{2\pi }}e^{-\frac{1}{2}\left( \frac{RC-(\theta _{2j}\cdot CC+\theta _{1j})}{\sigma _{RC}}\right) ^2} \end{aligned}$$
(8)
$$\begin{aligned} P\Big (AC|CC,RC,CT_j\Big )=\, & {} \frac{1}{\sigma _{AC}\sqrt{2\pi }}e^{-\frac{1}{2}\left( \frac{AC-(\theta _{5j}\cdot CC+\theta _{4j}\cdot RC+\theta _{3j})}{\sigma _{AC}}\right) ^2}. \end{aligned}$$
(9)

In Eqs. 49, \(\psi\) denotes the normalization constant, \(CT_j\) denotes the j-th climate type, and \(regSR(r,CT_j)\) denotes the semantic relatedness between COVID data collected from region r and those collected from regions having climate type \(CT_j\); the \(\sigma _{CC}\), \(\sigma _{RC}\), and \(\sigma _{AC}\) are the standard deviations corresponding to confirmed case count, recovered case count, and active case count, respectively; the \(\left\{ \theta _{0j}\right\}\), \(\left\{ \theta _{1j}, \theta _{2j}\right\}\), and \(\left\{ \theta _{3j},\theta _{4j},\theta _{5j}\right\}\) are parameters regulating the means of confirmed case count, recovered case count, and active case count, respectively. These can be computed by employing maximum likelihood analysis of Expectation Maximization (EM) algorithm [11].

Table 1 Symbols and notations used in the Section “Methodological Overview

Theoretical Analysis for Data Sample Generation

It is interesting to note here that every epidemic/pandemic, like COVID-19, has some particular temporal development pattern which is governed by several other factors including susceptible population size, contagiousness or transmissibility of the disease, and so on. Accordingly, if the disease infected case counts predominantly show upsurge during the initial phase, then the purely data-driven techniques, including the dynamic Bayesian models, cannot properly guess the declining trend of the infected case count in long run. It is therefore necessary to provide theoretical guidance to the data-driven models so that the learnt parameters remain consistent with the underlying physics of epidemiological development.

To incorporate theoretical knowledge, first, we utilize the kinetic scheme as defined by Kermack-McKendrick SIR Model [22]. This is an epidemiological model that can mathematically express (refer Eqs. 1012) the dynamic interaction among susceptible (S), infected (I), and recovered/removed (R) fraction of population in a region. Subsequently, we follow this system of differential equations to generate training samples of COVID-19 cases ensuring that the parameters learnt through our semantic Bayesian analyses are compatible with the underlying theory of epidemiology.

$$\begin{aligned} \frac{\mathrm{d}S}{\mathrm{d}t}=\, & {} -\beta S I \end{aligned}$$
(10)
$$\begin{aligned} \frac{\mathrm{d}I}{\mathrm{d}t}=\, & {} \beta S I - \gamma I \end{aligned}$$
(11)
$$\begin{aligned} \frac{\mathrm{d}R}{\mathrm{d}t}= & {} \gamma I. \end{aligned}$$
(12)

In Eqs. (10)–(12), \(\beta\) and \(\gamma\) indicate the effective contact rate and the mean recovery rate, respectively, and t indicates the time. Given the new confirmed case count (CC) and the new recovered case count (RC) as recorded on every t basis, R(t) can be estimated as cusum(RC), and I(t) can be estimated as \(\left( cusum(CC)-\right.\) \(\left. cusum(RC)\right)\) for each t, where cusum() is the function to compute cumulative sum [9]. As per the SIR model, \(S(t)+I(t)+R(t)\) is assumed to remain constant for all t and the sum is equal to the population size of the region. It may please be noted here that, to be consistent with the terms used by Mathematical Association of America (MAA), we have sometimes used the word “Recovered” to indicate ‘R’, which though includes recovered as well as death case count.

Fig. 5
figure 5

Typical example of theoretically derived temporal pattern of COVID cases in Maharashtra, India: a overall development of susceptible, recovered, and infected cases, b development patterns of new confirmed and new recovered cases, c development pattern of latest active cases

Typical pattern of temporal development of COVID-19 cases, as obtained by employing SIR model, is shown in Fig. 5a. Our framework utilizes these theory-governed temporal distributions of active cases, new confirmed cases, and new recovered cases (refer Fig. 5b, c) to produce appropriate training samples that can eventually help the parameter learning process to remain congruous with the physical understanding of COVID-19 spread. Note that, despite the availability of various enhanced versions of SIR model, we choose the basic one for our epidemiological modeling, since the recent findings [19] demonstrate that this simplest version can also adequately model COVID-19 dynamics.

Predictive Analysis for COVID-19 Cases

After the causal relationships are learnt, the semantically-enhanced theory-guided model can be used for inferring the COVID-19 case counts (confirmed case count, active case count and recovered case count) given the evidence on climate variability in a region r. For this purpose, our framework employs semantic Bayesian inference generation [6, 8] in following manner.

$$\begin{aligned} P^{\dagger } \left( CC|CT^r \right) = \sum _{RC}\sum _{AC} \left\{ P\left( CT^r\right) .P^{\dagger }\left( CC|CT^r\right) .\right. \nonumber \\ \left. P^{\dagger }\left( RC|CC,CT^r \right) .P^{\dagger } \left( AC|CC,RC,CT^r \right) \right\} \end{aligned}$$
(13)
$$\begin{aligned} P^{\dagger } \left( RC|CT^r \right) = \sum _{CC}\sum _{AC} \left\{ P\left( CT^r \right) .P^{\dagger } \left( CC|CT^r \right) .\right. \nonumber \\ \left. P^{\dagger } \left( RC|CC,CT^r \right) .P^{\dagger } \left( AC|CC,RC,CT^r \right) \right\} \end{aligned}$$
(14)

Here, \(P(CT^r)\) denotes the marginal probability of presence of various climate types in region r, and this can be estimated as \(\sum _i P(CT^r_i)\), where \(CT^r_i \in CT^r\) (refer Section “Measuring Semantic Relatedness on Regional Basis”). To be noted, this semantically enhanced inference generation for COVID-19 cases overcomes uncertainty issue, emerging due to sample scarcity, and also maintains the theoretical guidelines which are utilized at the time of estimating \(P^{\dagger }(RC|CC,CT^r)\), \(P^{\dagger }(AC|CC,\) \(RC,CT^r)\) etc., in the parameter learning phase.

Assessing Climatological Impact on COVID-19 Outbreak

This is the ultimate step which aims at assessing whether the regional climate variability has any correlation with the patterns of confirmed and recovered case development for COVID-19. Since, instead of the continuous climatic factors, our framework deals with the categorical values of climate types, we use ANOVA test [21] for analyzing the correlation with climate variability.

Additionally, we propose Semantic-GI as a generic index (correlation measure) so as to utilize the underlying semantics from climate domain. The Semantic-GI is measured as follows.

$$\begin{aligned} \text {Semantic-GI }=\frac{n\cdot \sum _{r=1}^{n}\sum _{q=1}^{n}sw_{rq}\cdot (x_r-x')(x_q-x')}{\left( \sum _{r=1}^{n}\sum _{q=1}^{n}sw_{rq}\right) \cdot \left( \sum _{r=1}^{n}(x_r-x')^2\right) }, \end{aligned}$$
(15)

where, n is the total count of study regions considered. These belong to diverse climate zones, \(x'\) is the mean of COVID case counts (confirmed case count or recovered case count) per million individual over all the regions, \(x_r\) is the COVID case count per million individual at a particular spatial region r, and \(sw_{rq}\) is the semantic weight between region r and region q. The semantic weight between any pair of regions r and q associated with climate classes \(CT^r\) and \(CT^q\), is defined as follows.

$$\begin{aligned} sw_{rq}={\left\{ \begin{array}{ll} 1, if\, r==q \\ \mathrm{max}\left( P(CT^{r}_i)\cdot P\left( CT^{q}_j\right) \cdot SR\left( CT^r_i,CT^{q}_j\right) \right) \\ \forall CT^{r}_i \in CT^{r},CT^{q}_j \in CT^{q}, \mathrm{otherwise}. \end{array}\right. } \end{aligned}$$
(16)

In case each of the considered spatial regions (r and q) are associated with single climate type, i.e., if \(CT^r\) and \(CT^q\) contain only one element each, then the Eq. 16 is simplified as follows: \(sw_{rq}=SR(CT^r,CT^q)\).

It can be interpreted from Eq. 15 that, similar to the Semantic-I measure [9], our presently proposed Semantic-GI is also founded on the concept of semantic auto-correlation. However, our computation of semantic weight (sw) wisely takes into account the presence of multiple climate types within each spatial region (refer Eq. 16), which makes our model more appropriate to deal with real-world scenario. The proposed Semantic-GI can be used to analyze whether the COVID-19 case counts associated with the various regions with semantically related climate type form a cluster pattern or a disperse pattern. In specific, a positive value of Semantic-GI indicates a cluster pattern, whereas a negative value of Semantic-GI indicates a disperse pattern. In the same way as defined for Moran’s index of spatial auto-correlation [5], the significance of Semantic-GI value can be quantified through Z-test. If Z-score \(>2.58\), it can be claimed with 99% confidence that there is a cluster pattern, whereas if Z-score \(<-2.58\), with the same level of confidence it can be claimed that there is a disperse pattern. Otherwise, the pattern is random. Together, the ANOVA test and the Semantic-GI test help answering the RQ1.

To resolve the RQ2, our framework draws more specific conclusions regarding the influence of climate variability. Accordingly, the Semantic-GI analysis is followed by comparative study of semantically averaged case count (daily relative-recovered case count and daily new confirmed case count) for each climate type. The semantically averaged case count (\(sAvg_i\)) corresponding to a climate type \(CT_i\) is measured with consideration to the presence of multiple climate types within each spatial region, in following manner.

$$\begin{aligned} sAvg_i=\alpha _1\cdot \sum _j \left( SR(CT_i,CT_j) \times regsAvg_j\right) , \end{aligned}$$
(17)

where \(\alpha _1=\frac{1}{\sum _j SR(CT_i,CT_j)}\) is the normalization constant, and \(regsAvg_j\) is the semantically averaged case count corresponding to a climate type \(CT_j\) over all the considered region r. The \(regsAvg_j\) can be mathematically presented as follows, where \(x_r\) indicates the latest count of case (daily new confirmed case or relative-recovered case) per million people in r-th spatial region, and \(\alpha _2=\frac{1}{\sum _r regSR(r,CT_j)}\) is the normalization constant.

$$\begin{aligned} regsAvg_j=\alpha _2\cdot \sum _r \left( regSR(r,CT_j) \times x_r\right) . \end{aligned}$$
(18)

Once the \(sAvg_i\) for each climate type \(CT_i\) is estimated, these can be graphically plotted to draw conclusion on which particular climate type has higher/lower impact on the development of COVID-19. The qualitative estimate of ‘higher’ or ‘lower’ can be decided based on the expected value (refer Table 2). For the confirmed case, the expected value is computed considering all the spatial regions, irrespective of the climate types. On the other side, for the relative-recovered case, the expected value becomes 1, indicating \(\frac{\text {daily recovered case count}}{\text {daily confirmed case count}}=1\).

Table 2 Qualitative estimate of COVID-19 case development with respect to expected value

The various symbols used in this section (Section “Methodological Overview”) are summarized in Table 1.

Fig. 6
figure 6

Various Indian states (along with state codes) considered in the present case study. The color codes follow the Köppen–Geiger classification of regional climate [12]

Experimental Evaluation

This section evaluates our proposed impact analysis framework with consideration to the COVID-19 spread scenario in India, which is presently found to be one of the most adversely affected countries in the world.

Table 3 Summary of the considered states in India

Dataset and Study Area

The experimentation is carried out using the daily dataFootnote 2 over COVID-19 case count. This includes active cases, new confirmed cases, and recovered cases over 15 different states in India (Fig. 6). All these states belong to variants of climate zones. Moreover, a single state may have multiple climate zones in its various parts which can be prudently handled by our model. The details of all the considered states are presented in Tables 3 and 4 which show that the considered states are associated with 6 different climate types/classes: Am, As, Aw, BSh, BWh, and Cwa [12]. The semantic relatedness (SR) among these climate types, and the regional semantic relatedness (regSR) with various climate types, have been calculated as per our proposed approach and the same are summarized in Tables 5, 6. The entire experiment is carried out considering the daily time series of active, confirmed, and recovered case count from the mid of March 2020 to the mid of November 2020.

Baselines and Experimental Set-Up

Since, in literature there has not been an agreed gold standard in terms of assessing climatological impact on COVID-19 outbreak, we primarily compare only the prediction power of our enhanced semantics-driven approach, with the existing linear regression (LR) [1] and nonlinear regression (NLR) [13] models. We also consider our recently introduced semantically-enhanced th-eory-guided model (SETG) [9] as one of the baselines, since our presently proposed enhanced semantics-driven theory-guided model is inspired from SETG.

The proposed enhanced semantics-driven theory-guided predictive analysis and all the baselines are executed in R-toolFootnote 3 (version 4.0.0) in Windows 64-bit OS (3.1 GHz CPU processor and 4 GB RAM). The SIR-based theoretical modeling of COVID-19 and the structural learning of the Bayesian network have been conducted using ‘SimInf’ and ‘bnlearn’ packages of the R-tool.

Table 4 Area of different major climate types per unit area in various Indian states
Table 5 Semantic relatedness (SR) between various climate types

Performance Metrics

The prediction performance has been measured in terms of root mean squared error (RMSE) and mean absolute error (MAE) [20], as defined below.

$$\begin{aligned} \mathrm{RMSE}\,=\, & {} \sqrt{\frac{1}{N}\sum ^{N}_{i=1}{\left( \mathcal {V}_{o_i}-\mathcal {V}_{p_i}\right) ^2}} \end{aligned}$$
(19)
$$\begin{aligned} \mathrm{MAE}\,=\, & {} \frac{1}{N}\sum ^{N}_{i=1}{\left| \mathcal {V}_{o_i}-\mathcal {V}_{p_i} \right| }. \end{aligned}$$
(20)
Table 6 Regional semantic relatedness (regSR) with data collected from various climate zones in India
Fig. 7
figure 7

Observed vs. Predicted count of daily new confirmed COVID-19 cases in some specific states in India

Fig. 8
figure 8

Observed vs. Predicted count of daily new recovered COVID-19 cases in some specific states in India

where, \(\mathcal {V}_{o_i}\) indicates the count of COVID-19 case (confirmed case or recovered case) which is actually observed on the i-th day of prediction, and the \(\mathcal {V}_{p_i}\) is the corresponding predicted value. In our experimental study, the prediction is made for succeeding 2 months, based on the daily observed data till 17-Sep-2020. Further, apart from the comparative study with respect to prediction performance, we also perform impact analysis using ANOVA test as well as using proposed Semantic-GI test.

Results and Discussions

The results of comparative prediction performance are presented in Tables 7, 8 and in Figs. 7, 8, whereas the summary of impact analysis is presented through Table 9 and in Figs. 9, 10. Our interpretations from the results are discussed below.

Table 7 Comparative study of performance regarding confirmed case count prediction
Table 8 Comparative study of performance regarding recovered case count prediction

Comparative Study of Predictive Analytics

Tables 7, 8 evidently shows that the proposed semantically-enhanced theory-guided model has better prediction potential compared to other baselines. In respect of both MAE and RMSE, the proposed model is able to outperform all the other baselines in predicting recovered and confirmed case counts for 10–11 states from amongst the total 15 Indian states considered. More importantly, the superiority of the proposed model is prominent in case of those states (e.g. MH: Maharashtra) where the confirmed case counts are substantially high in recent days. This is so, primarily because our proposed model has implanted mechanism of following epidemiological development theory, which is completely ignored by the considered LR and NLR approaches. Accordingly, it can be well anticipated that for wide-ranging prediction over future several months our proposed model would be more appropriate than the others. Though SETG works as per theoretical guidance as well, our presently proposed model outperforms SETG with average 11% improvement (reduction) in prediction error.Footnote 4. This demonstrates the effectiveness of enhancing our model by introducing the concept of regional semantic relatedness that can handle the presence of multiple climate types within each spatial region.

Our predicted count of daily new confirmed cases and new recovered cases from 18-Sep-2020 to 16-Nov-2020 are graphically presented in Figs. 7, 8. It is evident from the figures that the predicted values from our proposed semantics-driven theory-guided model match well with the observed values for both daily confirmed case and daily recovered case count, given the climate type of the region.

Fig. 9
figure 9

Assessment for specific impact of climate variability on daily confirmed/recovered case count

Impact Analysis Based on ANOVA Test and Semantic-GI

As mentioned earlier, in our proposed framework, the impact analysis is conducted using both statistical ANOVA test and Semantic-GI test (refer Table 9). The ANOVA test is performed to analyze the significance of correlation between the daily count of confirmed/recovered cases and the variability of the climate types. Besides, the Semantic-GI test is conducted to analyze the same with regard to the semantic relatedness in the climate types, where a region may contain more than one type of climate in its different parts. To handle the uncertainty issue, both the tests are carried out on the data predicted by our enhanced semantics-driven theory-guided model.

As indicated by small F-values and large p-values obtained from the ANOVA test (refer Table 9), both confirmed and recovered case counts are not significantly correlated with the climate variability over the various states or provinces.

Table 9 Summary of correlation analyses: climate variability vs. development of COVID-19 cases (confirmed and recovered)

However, with respect to the Semantic-GI measures and the associated Z-scores (refer Table 9), we can infer that the daily counts of both confirmed cases and recovered cases have significantly high semantic correlations with the climate variations. As per the very high Z-scores, it can be claimed with 99% confidence that the semantically similar climate zones have very similar statistics for the daily confirmed/recovered case count. This answers the RQ1.

Fig. 10
figure 10

Semantic weight matrix for the considered states (a), and graphical illustration for semantic neighborhood (b)

Subsequently, to answer RQ2, i.e. to understand the impact of region-specific climate, we perform in-depth analyses considering semantically averaged confirmed/recovered case count (\(sAvg_i\)) for each climate type separately (refer Fig. 9). From the figure we notice that, compared to the expected value (indicated by the red line), the daily confirmed case counts in BSh (hot semi-arid) and BWh (hot arid) type climate zones are substantially (\(\approx\) 150%) high. However, confirmed case counts in humid-sub-tropical (Cwa) and tropical (Am, As, Aw) zones are not so high. This forms a significant cluster pattern which is also reflected by positive Semantic-GI with high Z-score. It can therefore be inferred that the arid/dry climate, attributed by very low humidity or little precipitation, is more vulnerable for COVID infection. A contrasting scenario can be observed for COVID-19 recovered case. For example, the daily relative-recovery in the tropical/equatorial climate zones (Am, As, Aw etc.) is extremely high, whereas that in the hot arid (BWh) and hot semi-arid (BSh) climate zones is quite low (\(\approx\) 20%), compared to the expected value. Thus, the humidity is found to show a significant positive correlation with the recovery. In other words, the higher the humidity, the more the relative-recovery from COVID-19. Furthermore, as can be noted from the Fig. 9, the temperate climate zones (e.g. Cwa) are also quite vulnerable for severe COVID transmission, since unlike the tropical climate, the relative-recovery in temperate zones is not too high (only 60% above the expected value) while the confirmed cases per million individual in these zones are prominently (around 85%) higher than the expected value. Hence, based on the extent of the vulnerability for COVID-19 transmission, we can arrange our studied climate zones as follows: Tropical\(\left\{ Am, As, Aw\right\}\) < Temperate\(\left\{ Cwa\right\}\) < Arid/Semi-arid \(\left\{ BSh, BWh\right\}\). This answers the RQ2.

The semantic weight matrix as used in our case study is depicted in Fig. 10a and the same is represented in terms of semantic neighborhood graph (with consideration to 7 selected states) in Fig. 10b, to help in better interpretation. It can be interestingly noted from this graph that though a pair of states (e.g. GJ: Gujarat and DL: Delhi) may not be treated as neighbors from spatial perspective, they can still become semantic neighbors of each other, if their semantic weight, i.e. the degree of semantic relatedness in their climate types, is non-zero.

Significance of Our Research Outcomes

Overall, our research provides insights into climatological impact on infection as well as recovery from the novel coronavirus disease. Our semantically enhanced theory-guided analyses reveal that the regions which belong to dry climate are most susceptible for infection on everyday basis. At the same time, we also find that the daily relative-recovery in dry regions is quite unfavorable. Accordingly, there remains huge scope to more effectively control the pandemic scenario in India by not only imposing stronger isolation measures but also improving the health-care facilities in the dry/arid regions (e.g. Maharashtra, Rajasthan, Delhi, Gujarat etc.). Though the promising recovery pattern in tropical/equatorial regions indicates that an intense isolation/quarantine measure can enough help controlling the COVID-19 pandemic in the Indian states like West Bengal, Orissa, Tamil Nadu, Chhattisgarh, etc., the upcoming winter season (during December, January, February) can become vulnerable, since winter is dry in these states. To combat COVID-19 outbreak in India, additional care must also be given towards strengthening the health-care infrastructures in the states having temperate climate, such as Assam, Bihar etc., since the relative-recovery in temperate climate is found to be quite low compared to the infection.

To be noted, our proposed impact analysis based on Semantic-GI is more meaningful than that achieved with respect to ANOVA test. This is so, because the consideration of semantic knowledge effectively handles the uncertainty in the data which emerges due to unawareness about the other influencing factors that may also affect the COVID-19 spread within each state or province. The consideration of data from multiple states having semantically related climate, can indirectly neutralize the impact of these unknown or hidden factors to certain extent. Thus, the Semantic-GI test helps us achieving more robust outcome of impact analysis.

Related Works

The recent researches regarding climatological effect on COVID-19 outbreak dynamics can be split into two separate groups based on the respective research conclusions. In the research works by the first group, at least one of the climatic factors, such as humidity, minimum temperature, average temperature, etc., has been identified to be significantly correlated with COVID-19 pandemic, whereas, the second group of researches has not found any such evidence in this regard.

Regarding the first group of research, the works of Pani et al. [15], Bashir et al. [3], Liu et al. [13], Auler et al. [1], Ward et al. [25], and Tosepu et al. [24] are worth mentioning. Interestingly, though in a generic sense all these research works notice some association between COVID outbreak and climatic variables, the specific results are not very identical. For example, using Spearman and Kendall rank correlation tests, Pani et al. [15] have found the temperature, dew point, and humidity to be significantly and positively associated with COVID-19 transmission. Contrarily, by employing Spearman correlation measure, Ward et al. [25] have noticed a significant negative association between relative humidity and novel coronavirus transmission. Moreover, they have found no association with temperature. The works of Bashir et al. [3], and Tosepu et al. [24], who identified average temperature to be one of the climatic factors influencing COVID-19 spread, are therefore, contradicting with the work of Ward et al. [25]. The outcomes of the relevant researches done by Liu et al. [13] and Auler et al. [1] are also quite inconsistent. In the former work, primarily using non-linear regression model, the authors noted that low humidity, low temperature, and mild diurnal temperature range were possibly favorable for COVID-19 transmission. Contrarily, in the latter case, based on a combination of linear regression and multivariate statistical analysis, the authors noticed that higher mean temperatures and average relative humidity can also favor the transmission. The key limitation in these works remain in their purely data-driven approaches that ignore the physical understanding of infectious disease dynamics. Though our previously introduced SETG model [9] takes into account the theoretical principles of epidemic development, it has its own limitations in real-world application scenario, since it does not take into account the presence of multiple climate types within a region.

The second group of researches are primarily based on either theoretical models or data-driven models with nonlinear analysis. For example, with the help of pandemic simulation using SIRS (Susceptible-Infected-Rec-overed-Susceptible) model, Baker et al. [2] found that the summer weather would not substantially limit pandemic growth. This research observation also conforms to the findings of Zhu et al. [26] and Briz et al. [4], who employed generalized additive model and approximated Bayesian inference technique to serve the purpose. However, recent research also indicates that these results are highly sensitive to uncertainty underlying the data.

Existing work vs. Proposed Approach As per the findings, our semantically-enhanced theory-guided research primarily belongs to the first group. However, in contrast to majority of those works, we consider the theoretical guidance as well. Moreover, our enhanced semantics-driven theory-guided analysis primarily utilizes the overall climate patterns of the regions, rather than considering individual climate factors. Based on our research outcomes, we find that the dry/arid and semi-arid climate zones are most vulnerable for the increasing infection from COVID-19, followed by the temperate climate zones. Our observation on arid/semi-arid climate and temperate climate are supported by the works of Liu et al. [13] and Auler et al. [1], respectively. Additionally, the present work also reveals a significantly positive correlation of humidity with the daily relative-recovery from this disease, which eventually can help making administrative decisions to effectively control COVID-19 transmission on regional basis.

Conclusions

Motivated by the semantically-enhanced theory-guided framework as introduced in [9], in this paper, we have proposed an improved data-driven model to provide a more realistic analysis of how regional climate pattern impacts on the COVID-19 outbreak. Novelty of this work is primarily embedded in the following three aspects: (1) introducing the concept of “regional semantic average” to account for the relatedness of data from the same climate zone expanded over multiple spatial regions; (2) enhancing interpretation of the causal relationship between climate variability and COVID case development, considering semantic relatedness of the data on regional basis; and (3) upgrading the impact analysis with consideration to the expected values of relative-recovered cases and per million new infected/confirmed cases over the various regions. Consideration of regional semantic relatedness at the time of learning causal relationship between climate variability and COVID-19 outbreak not only helps to deal with the underlying uncertainty but also enables us to better assess the climatological impact on the development of infected and recovered cases of the disease on regional basis. Moreover, the theoretical guidance from the epidemiological model helps our model in attaining a generalizable solution. At the end of the study we find that both arid/semi-arid and temperate climate are evidently susceptible to COVID-19 transmission. We also observe that humid climate positively influences the recovery from this novel corona virus disease in India.

Ample scopes remain in further upgrading the framework with added knowledge on genetic aspects of the virus, and also, in exploring the impact of other factors. It may be noted that, though our proposed framework has been illustrated with respect to analyzing impact of climate variability on COVID-19 outbreak, it can also be extended easily for semantics-driven theory-guided analyses in various other domains, including bio-medical science, material science, quantum chemistry etc., by incorporating appropriate domain knowledge.