1 Introduction

The transmission rate of the coronavirus disease 2019 (COVID-19) has been very fast since its first reported case in December 2019 in Wuhan, China. It has infected over 3, 181, 642 people in 215 countries worldwide and resulted in 224, 301 deaths by 1st May 2020 according to the world health organization [1]. Till now some common symptoms of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have been identified based on the recognized cases, including fever, tiredness, dry cough, sore throat, and diarrhea, etc. [2, 3]. Though, presently the increasing number of asymptotic patients in some countries is a dangerous situation of the society and a challenge for the doctors and health care system [4]. In some studies, the local seafood market of Wuhan, China is identified to be a source of COVID-19 which results in its transmission from bat to human [1, 3, 5, 6]. Still, many researches are going on to explore the transmission route of COVID-19. In most of the infected cases of humans to human transmission through the respiratory tract, it spread due to the human contacts in gathering, meeting with relatives and friends, and between patients and healthcare workers, etc. [7]. Besides surface, the presence of coronavirus in blood and fecal swabs [8], and in the air [9] around the hospital area indicates its transmission through multiple routes; this is another challenge for the healthcare system. Several approaches are in use for the detection of COVID-19 but the exact treatment approach is still lacking. Many drugs are being tested and several vaccines are still in the development process for the treatment [10], therefore, social distancing, isolation, following instructions of the respective government organizations and doctors, and personal hygiene are some of the precautions to reduce the spread of the COVID-19. In India, 26,167 active cases with 9950 recovered, 1218 deaths, and 1 migrated were reported up to 1st May 2020 [11].

It is a common observation that the health of most of the people is affected by climate change, like seasonal cold and flu at the beginning of winter and summer. It is due to the reason that climate change affects the transmission of most of the virus. Similarly, the climate condition also affects the transmission rate of the epidemic virus. This fact is already established in some studies of the previous epidemic. Severe acute respiratory syndrome (SARS) started in January 2002 and ended mostly in July of the same year, has quit similar genetic sequence to COVID-19, and was affected by the warm weather condition [12]. The transmission rate of influenza has a strong correlation with atmospheric conditions that significantly increases at low daily temperature and humidity [13]. Some recent studies [14,15,16,17,18,19,20,21] established the effect of climate conditions on the transmission rate of COVID-19. Though, it is hard to find any study based on the impact of the atmospheric factors, including temperature and humidity on the transmission rate of the COVID-19 in different states of India during the lockdown period.

The machine learning-based approaches have been widely implemented in the health care system for disease diagnosis, monitoring, and prediction to reduce the workload of doctors and hospital workers [22,23,24]. In some recent research reports, machine learning approaches have been implemented successfully in the identification of the COVID-19 pandemic [25,26,27,28,29]. Though, the implementation of the machine learning approach using the atmospheric factors in the prediction of the COVID-19 is not noticed. With this motivation gradient boosting machine (GBM) approach has been implemented to establish the relationship among atmospheric factors (temperature and humidity) and daily spread rate of COVID-19 in different states of India. The present study has the following contributions: (a) explores the correlation between atmospheric parameters and transmission rate of COVID-19 in different states of India, (b) predicts the active and recovered cases of COVID-19, and (c) establish an efficient tree-based machine learning approach to explore the effect of temperature and humidity on the transmission rate of COVID-19.

2 COVID-19 and atmospheric data collection and compilation

The meteorological data of all states of India were collected from the Indian Central Pollution Control Board (CPCB) [30] and Indian Meteorological Department (IMD) [31] on a daily basis from 28th March 2020 to 22nd April 2020. The COVID-19 data were collected from the Ministry of Health and Family Welfare, Government of India [11], and an open-access source [32]. The meteorological parameter includes minimum temperature, maximum temperature, minimum humidity, and maximum humidity of all states of India. Besides, minimum pressure, maximum pressure, minimum wind speed, maximum wind speed, pm-10, and pm-2.5 were collected but not used in the analysis due to less correlated with the COVID-19 information. The COVID-19 related information, includes daily new infection cases, active cases (accumulated total cases up to previous days-recovered cases-deceased cases), recovered cased till the date, and mortality till the date. Finally, the meteorological parameters and COVID-19 information were combined for further analysis. The missing values of meteorological parameters were imputed by replacing the median values. The variations of imputed values of minimum and maximum temperature, and minimum and maximum humidity are shown in Figs. 1 and 2, respectively. The measurement sample represents the total number of measurements of temperature and humidity in different states of India for 26 days. Four meteorological parameters as input and active and recovered cases of COVID-19 were used as the output of the GBM approach, independently.

The collected data of 26 days have a total of 702 instances in which 467 instances (2/3rd of total instances of all states) were used in training and 235 (about 1/3rd of total instances of all states) instances were used in the combined testing. Moreover, the final tests were performed for individual states to check the model performance. Table 1 presents the basic statistics of the combined dataset. Figure 3 exhibits the variation in the number of recovered and active cases of COVID-19 in different states of India. The samples in Fig. 3 signify the total number of COVID-19 cases collected in different states of India from 28th March to 22nd April 2020.

Fig. 1
figure 1

The variation in the minimum and the maximum temperature

Table 1 Basic statistics of the dataset
Fig. 2
figure 2

The variation in the minimum and the maximum humidity

Fig. 3
figure 3

The variation in the active and recovered cases of COVID-19

Fig. 4
figure 4

Performance of GBM in the prediction of active cases of COVID-19 using the combined dataset of all states of India

3 Gradient boosting machine (GBM) learning approach

GBM is an ensemble forward learning model that is used to solve the regression as well as classification problems. It discards all weaker predictors and picks the stronger one. It is an improved version of the decision tree where every successor comparatively analyzed to build a set of the optimally satisfying structure of the tree by using the structure score, gain calculation, and increasingly refined approximations.

The prediction performance of GBM can be boosted by the use of invoking an additional classifier. This modification optimizes the accuracy of the tree without affecting its speed. It also provides an easily distributable and parallelizable feature with an effortless environment for model tuning and selection. This version of GBM is capable to handle the bigdata with optimal accuracy. It is rarely used in COVID-19 prediction modeling. The H2O package in R [33] is used in the present study in the implementation of the GBM approach. The GBM model was optimized for the number of trees (k) = 1, 2, 3,…50. The maximum number of trees K = 50 is selected arbitrarily. The algorithm table of GBM is as follows [34].

Algorithm Table of Gradient Boosting Machine (GBM)

figure c

Four atmospheric parameters, including minimum temperature, maximum temperature, minimum humidity, and maximum humidity were used in analysis. Specifically, to reduce the computational complexity, the average of the maximum and minimum temperature and the average of the maximum and minimum humidity were used as input in the GBM model to predict the number of recovered and active cases in all states and also for some individual states of India.

4 Analysis of outcomes

4.1 Statistical analysis of the COVID-19 dataset

The ANOVA analysis of the atmospheric parameters and the active case and recovered cases of the COVID-19 is shown in Table 2. The results of the ANOVA analysis signify that the atmospheric data and COVID-19 data sets included in the present study are significant and can be used for further processing.

Fig. 5
figure 5

Performance of GBM in the prediction of recovered cases of COVID-19 using the combined dataset of all states of India

Table 2 ANOVA analysis results of atmospheric factors and COVID-19 data

4.2 Results of gradient boosting machine analysis

The GBM model was tuned with the number of trees, learning rate, number of folds, and distribution functions (Gaussian, Tweedie, Huber, Laplace, Poisson, Quantile, and Gamma). The training prediction performance of the GBM model for the active and recovered cases of COVID-19 is summarized in Table 3. The performance of GBM is evaluated based on mean square error (MSE), root mean square error (RMSE), mean average error (MAE), mean residual deviance (MRD), and coefficient of determination (R2). It is obvious that the optimal prediction performance of the GBM was achieved for the Poisson distribution (R2 = 0.99) and the number of trees = 50, learning rate = 0.8, and the number of folds = 10 in the prediction of both active and recovered cases of COVID-19 in all states of India in the selected duration. The visual representations of the predictive performance of the GBM method in the prediction of active and recovered cases of COVID-19 for the combined test dataset of all states are shown in Figs. 4 and 5, respectively.

Fig. 6
figure 6

Performance of GBM in the prediction of active cases of COVID-19 of Delhi, India

Table 3 Overall performance of GBM in training using the combined dataset of all states of India

Figures 4 and 5 also, demonstrate a comprehensive comparison of the different distribution functions of GBM in the prediction of cases of COVID-19. The Poisson distribution function performs better than other functions in the prediction of COVID-19 cases for the combined test dataset of all states. The test performances of the Poisson distribution function in the prediction of COVID-19 cases, for ten states are summarized in Table 4. The detailed statewise prediction results of the GBM using different distribution functions for Delhi, Maharashtra, and Gujarat are summarized in Tables 56 and 7, respectively. Figures 6 and 7 demonstrate the predicted results of GBM for the active and recovered cases of COVID-19 in Delhi. Figures 8 and 9 represent the qualitative performance of GBM in the prediction of active and recovered cases of COVID-19 cases in Maharashtra. The visual representations of the performance of GBM in the prediction of active and recovered cases of COVID-19 cases in Gujarat are presented in Figs. 10 and 11, respectively.

Table 4 Performance of GBM using Poisson distribution in the prediction of recovered and active cases of COVID-19 in different states of India

5 Discussion

Tree-based machine learning approaches are quite useful in the modeling of small as well as big datasets in past studies [35, 36]. The GBM can be used for pandemic prediction and has high efficiently [36]. For this reason, the GBM approach was selected for the modeling of the transmission rate of COVID-19 in India using the atmospheric factors. India has a large geographical region, due to which there is a huge variation in the weather parameters (Table 1, Figs. 1 and 2). It is obvious from the statistical description of weather parameters, like, fluctuation of minimum temperature between −18 °C to 41 °C, maximum temperature between −8 °C to 44 °C, minimum humidity between 1% to 77%, and maximum humidity between 7%–99%. Besides, a huge variation in the number of cases (both active and recovered) of COVID-19 has been noticed in different parts of India. Specifically, it is in between 0 to 4591 for active cases and 0 to 789 for the recovered cases (Fig. 3). Considering the earlier mentioned variations in the weather parameters and the number of COVID-19 cases, a total of 702 instances was used from 27 different states of India for 26 days and used in GBM analysis. The statistical analysis of the parameters of the dataset suggests their unequal distribution.

ANOVA test results (Table 2) reject the null hypothesis and suggest that all parameters of datasets are significant. The GBM was trained by using the combined dataset of all states and optimized for the distribution functions and their parameters. The overall performance results of GBM in terms of R2, MSE, RMSE, MAE, and MRD (Table 4, Figs. 4 and 5) suggests the better performance of Poisson distribution in the prediction of both active and recovered cases of COVID-19. The performance of the different distribution function in the training of the GBM has subsequent trends (on the basis the performance measures): Poisson ˃ Gaussian ˃ Tweedie ˃ Gamma ˃ Huber ˃ Quantile. It may because the numbers of recovered and active cases of COVID-19 follow the Poisson distribution. Further research is required to establish this fact. Moreover, due to the highest accuracy of the Poisson distribution, it was used in the prediction of COVID-19 cases in different states of India, individually (Table 4). The best performance of GBM has been achieved in the prediction of recovered cases of Kerala (R2 = 0.98). GBM has R2 values between 0.95–0.61 for other states (Maharashtra, Gujarat, Andhra Pradesh, Haryana, and Rajasthan). It was also noticed that in some states the GBM has better prediction performance for the recovered cases than the active cases, like Kerala, Haryana, West Bengal, Telangana, and Rajasthan. Moreover, for some other states, like Gujarat, Uttar Pradesh, and Madhya Pradesh, GBM has better prediction accuracy for the active cases than the recovered cases. The recovery rate of COVID-19 was very low in Uttar Pradesh and Madhya Pradesh in the last week of March and in the first week of April 2020. This is also a reason for the average performance (R2 of 0.43 and 0.21, respectively) of the GBM model in the prediction of recovered cases of COVID-19 in the earlier two states. Similarly, the low number of reported active cases of COVID-19 in West Bengal and Telangana up to the first week of April 2020, results in the average prediction performance of the GBM (R2 of 0.29, and 0.19, respectively).

Those states for which GBM has high accuracy in the prediction of active and recovered cases of COVID-19 using the average temperature and humidity indicates the minor effect of atmospheric factors in the transmission rate of COVID-19. Though for the rest of the states in which the GBM has less accuracy of prediction of active and recovered cases, the atmospheric factors might have a major effect on the transmission rate of COVID-19. The role of atmospheric factors, like temperature and humidity in the transmission rate of COVID-19, is still uncertain and may vary according to location. Though, a negative correlation between the transmission rate of COVID-19 and the temperature and humidity was discussed in some recent studies. Ahmadi et al. [20] have concluded the high transmission rate of COVID-19 cases in Iran at low humidity and temperature; Wang et al. [16] have also reported the low transmission rate of COVID-19 cases in China at high temperature and humidity; Qi et al. [19] have described a negative correlation between the transmission rate of COVID-19 and the average temperature and average humidity, and Tosepu et al. [21] have established a positive Spearman-rank correlation (r = 0.392) between average temperature and cases of COVID-19 in Indonesia. Besides, the analysis outcomes of the ARIMA model and polynomial function [37] suggested the future scope of humidity and other atmospheric factors in the prediction of COVID-19 cases in the different geographic regions. Based on the collected atmospheric parameters and number of cases of COVID-19 in different states of India during the mentioned period, the states with low average temperature and humidity like, Himachal Pradesh, Uttrakhand, Jammu and Kashmir, Arunachal Pradesh, Mizoram, and Sikkim, etc. have a low number of active cases of COVID-19 while the other states with high average temperature and humidity like, Delhi, Gujarat, Maharashtra, Kerala, and Rajasthan have a high number of active cases of COVID-19. It means the average temperature and humidity have mostly positive correlation with the increasing cases COVID-19 in most of the states of India during the selected period of time. The higher population density may be also a reason for the earlier fact. Though, the increase in temperature and humidity increased the rate of recovery in most of the states. More data sets need to be combined and analyzed to make a concrete conclusion about the impact of the weather parameters on the transmission rate of COVID-19.

It was noticed that GBM has high prediction accuracy in the prediction of both active as well as the recovered cases of some states of India. Specifically, the three states Delhi, Maharashtra, and Gujarat which are worst hit by the pandemic having the maximum number of active cases compared with the rest of the states of India. Tables 5, 6 and 7 and Figs. 7, 8, 9 and 10 demonstrate the detailed performance of GBM in using different distribution functions. The test results of Delhi are significant with Poisson and Gaussian distribution that reflects the actual recovery rate. The active cases captured by Poisson and Huber distribution also reflect the real data, but some peaks show the spreading tendency. The prediction results of Maharashtra and Gujarat have high variability and reflect the sudden peaks with some irregular and short intervals that also match with the real condition of these two states. The performance of the implemented GBM approach in the present study is comparable or better than some of the previously implemented approaches in the prediction of transmission rates of COVID-19 by including the weather parameters. The weather dataset was not available (NA) for some states at the time of collection. The NA data were replaced by the median value of instances during the analysis that may be also a cause for the poor performance of the GBM model in the prediction of COVID-19 cases for these states. The transmission speed of COVID-19 was very low in some states of the country before 15th April 2020; which results in the non-availability of the dataset of COVID-19. It also affects the prediction performance of the GBM model. The performance of Gaussian distribution based GBM is compared with the deep neural network and random forest (RF) models using a similar dataset. The comparative experimental results (Table 8) suggest that GBM performs better than other models. It is obvious that deep neural network has poor performance in the prediction of active and recovered cases (R2 equal to 0.22 and 0.02, respectively). The RF approach has an average performance in the prediction of active and recovered cases (R2 equal to 0.59 and 0.33, respectively). The comparative performance of three models in terms of R2 is as follows: GBM > RF > deep neural network. Besides, the GBM has better performance than rest two approaches in terms of other evaluation measures, like MSE, RMSE, MAE, and MRD.

Table 5 Performance of GBM in the prediction of recovered and active cases in Delhi, India
Table 6 Performance of GBM in the prediction of recovered and active cases in Maharashtra, India
Table 7 Performance of GBM in the prediction of recovered and active cases in Gujarat
Fig. 7
figure 7

Performance of GBM in the prediction of recovered cases of COVID-19 of Delhi, India

Fig. 8
figure 8

Performance of GBM in the prediction of active cases of COVID-19 of Maharashtra, India

Fig. 9
figure 9

Performance of GBM in the prediction of recovered cases of COVID-19 of Maharashtra, India

Fig. 10
figure 10

Performance of GBM in the prediction of active cases of COVID-19 of Gujarat, India

Fig. 11
figure 11

Performance of GBM in the prediction of recovered cases of COVID-19 of Gujarat, India

Table 8 Comparative performance of GBM, random forest, and deep neural network in the analysis of combined training dataset of all states of India

The deep neural network is one of the most useful techniques in image processing and achieved better performance in several past studies, like in emotion recognition using the combination of deep convolutional neural network and kernel learning classifier [38]. Moreover, the strategies to improve the performance of the deep neural network in the analysis of data of different experimental domains have been discussed in some past studies, like using a training approach [39], generalized maxout networks [40], and transfer learning [41], etc. Deep learning has better performance in the analysis of categorical feature datasets whereas tree-based learning is better in the dense numerical feature dataset [42]. Moreover, the performance of the deep learning approaches varies according to the nature and the dimensionality of the dataset [43]. In the case of a relational dataset, its performance is inferior to the tree-based learning algorithms [44]. It may be due to the fact that a tree-based algorithm is prone to overfitting and gives better results in the case of high dimensionality. The COVID-19 dataset used in the present analysis has high dimensionality and relational in nature, this may be the reason for the deprived performance of a deep neural network method. The deprived performance of deep neural network in the present analysis, even after optimization of its parameters may be also due to the small size, randomness, noise, and missing values, etc. in the dataset.

6 Conclusions and future research scope

The present study established an association between the number of cases of COVID-19 and meteorological parameters in different states of India. The study implemented an efficient method of predictive modeling using the GBM based machine learning approach. The experimental results suggest that the GBM model is capable to capture the correlation between the cases of COVID-19 and atmospheric parameters. The maximum achieved values of the R2 and minimum values of the errors of the GBM suggest a certain association between the atmospheric factors and transmission rates of COVID-19 in some states of India, specifically in Delhi, Maharashtra, and Gujarat. Future research will include the additional meteorological parameters for a better understanding of the dependence of the transmission rate of COVID-19 on atmospheric conditions by using an efficient and robust machine learning approach. Also, the performance of the deep neural network needs to be improved in handling pandemic data.