Introduction

COVID-19 (coronavirus disease-2019), a disease caused by SARS-Co-2 (severe acute respiratory syndrome-coronavirus-2), is a rapidly spreading communicable disease which has aroused great attention all over the world. It has immense potential to generate explosive outbreaks in confined settings. World Health Organization (WHO), on January 30th, 2020, declared it “Public Health Emergency of International Concern” (PHEIC) and then later termed it as Pandemic on March 11th, 2020 (Lai et al. 2020; Wang et al. 2020; WHO 2020; Yang and Wang 2020). It is assumed that its origin is linked to the seafood and live animal market in Wuhan, Hubei Province, China. The super spreading speed of this virus caused the disease to spread to entire China in just 30 days and then devasted the whole world with 857,641 number of confirmed cases and 42,006 number of deaths (data of WHO as on April 2nd, 2020). The rapid escalation and global spread of this pandemic have witnessed near exponential growth in the number of new cases, reaching almost every country, territory, and area (Huang et al. 2020; Shen et al. 2020; Singh et al. 2020).

The main feature of COVID-19 is characterized by mild upper respiratory tract infection with fever, dry cough, and typical changes in radiographic studies; lower respiratory tract infection involving non-life-threatening pneumonia; and life-threatening pneumonia with acute respiratory distress syndrome (Hellewell et al. 2020; Prem et al. 2020; Mandal et al. 2020; Choi and Ki 2020). The medical staff all over the globe is endeavoring unprecedented efforts to constrain the outbreak in this hour of emergency though limited guidelines are available on the acute management of critically ill patients with severe illness due to this virus. The COVID-19 crises pose a global health threat where there is a need for pre-preparation, evidence-based collective actions, and coordinated scientific measures among all the countries (Wu et al. 2020a, 2020b; Binti Hamzah et al. 2020; Lin et al. 2020). The best way to prevent and slow down its transmission is to be well informed and social distancing, as the primary cause of the spread of this disease is through droplets of saliva or discharge from the nose when an infected person coughs or sneezes. To prevent community transmission, governments all over the globe are taking some extreme steps like suspending all public traffics and imposing complete lockdown/curfew in some areas. These steps are igniting strong emotional responses of fear and panic in public and adversely affecting the societies (Abdirizak et al. 2019; Kucharski et al. 2020; Zhao et al. 2020a). With each day bringing new vulnerable events adding to the anxiety of the people, there are educated guesses available all around in social media. In this hour of crisis, modeling and simulation are indispensable tools to provide estimated answers based on pure scientific methods rather than mere guesses (Chen et al. 2020). To understand the nature of transmission of COVID-19 infections, time series analysis can prove a useful tool that can provide insights into the epidemiological situation and predict future growth.

India is the second-largest populated country and in summers, maximum temperature of many states around 50°. Nowadays, the temperature raises (40°–47°) in India, so this is the best platform to check the effect of temperature on the noble corona pandemic. Time series analysis deals with modeling and forecasting of time series data by understanding the relationship among the variables. Autoregressive integrated moving average (ARIMA) model is based on Box-Jenkins (1976) method which is a commonly used time series analysis model. ARIMA model is applied to a wide variety of time series data including stationary, non-stationary, and seasonal (periodic) time series (Melard and Pasteels 2000; Valenzuela et al. 2008; Parmar and Bhardwaj 201320142015; Soni et al. 2014, 20152016, 2017; Kumar et al. 2015).

The present study deals with the development of an appropriate ARIMA model for making an accurate prediction of cumulative data of COVID-19 confirmed cases in six states of India and analyzes the variation of daily confirmed cases with the temperature. The development of accurate time series prediction models to understand the nature of the growth of pandemic can prove to be a vital tool to combat the spread of viruses (Ma et al. 2004; Kisi and Parmar 2016, 2017; Wu et al. 2020a, 2020b; Zhao et al. 2020b; Bashir et al. 2020; Gupta et al. 2020; Prata et al. 2020; Tomar and Gupta 2020). For this purpose, the daily data of COVID-19 confirmed cases from 1st March 2020 to 16th May 2020 in six states of India is taken for study (Data Source: https://www.covid19india.org). The development and application of the ARIMA model on datasets of COVID-19 confirmed cases enable to provide the future trend and growth of disease which will help government authorities to plan well in advance. The effect of temperature on COVID-19 is significant for all countries and especially WHO announced to do study the temperature effect. This study will also help in the estimation and preparation of a number of hospitals, decisions on lockdown, ventilators, quarantine centers, and other necessary infrastructure needed by the patients in the near future.

Methodology

Descriptive analysis

Since March 2020, COVID-19 infections were prevalent in India which engulfed almost all the states of India within less than a month. Maharashtra is one of the most affected states of India with more than 30,000 cases and around 1250 causalities. To study the pervasiveness of COVID-19 in six states of India namely Delhi, Madya Pradesh, Maharashtra, Punjab, Rajasthan, and Uttar Pradesh, the data from 1st March 2020 to 16th May 2020 of confirmed cases is taken and appropriate ARIMA models are developed for the six states. The data under study consists of 77 sample points, out of which 60 sample points are used for the modeling phase and the rest is for validating the model. The developed ARIMA models are used to make a 1-month ahead prediction with 95% confidence intervals of COVID-19 confirmed cases in the six states of India. Descriptive analysis of COVID-19 data from six states of India is given in Table 1. The daily mean cases of Maharashtra are 5755 which implies more spread of infection as compared to Punjab state for it is only 410. Positive values of kurtosis for data of each state indicate heavy tails of data distributions.

Table 1 Descriptive analysis of COVID-19 cases in the selected states of India

In Fig. 1, the time series plot of confirmed cases of COVID-19 in six states of India shows a sharp in cases in Maharashtra as compared to the other five states. At present, Maharashtra is COVID-19 most affected the state of India with recovery and mortality rates of 24.07% and 3.56% respectively.

Fig. 1
figure 1

Time series plot of COVID-19 confirmed cases in some states of India

Autoregressive integrated moving average (ARIMA) model

Autoregressive integrated moving average (ARIMA) model is a widely used econometric model based on the Box-Jenkins model which is capable of handling non-stationary time series data. An ARIMA \((p,d,q)\) model is formed by the combination of AR (autoregressive) component, the MA (moving average) component, and then the I (integrated) component. In an ARIMA model, the future value of a variable is assumed to be a linear combination of past observations which is given by mathematical Eq. (1).

$$f(t)=\theta_0+{\textstyle\sum_{i=1}^p}\varphi_if(t-i)+{\textstyle\sum_{j=1}^q}\theta_j\varepsilon_{t-j}$$
(1)

where the non-negative integers \(p\) and \(q\) are the orders of autoregressive and moving average polynomials respectively; d is the non-seasonal differencing required to make data stationery; \(f(t)\) are actual observations of time series and \({\varepsilon }_{t }\sim N(0,{\sigma }^{2})\) are random errors with constant variance \({\sigma }^{2}\); \({\varphi }_{i} (i=\mathrm{1,2},3,....,p)\) and \({\theta }_{j} (j=\mathrm{1,2},3,....,q)\) are coefficients.

ARIMA model building approach usually involves three steps: (a) model identification, (b) parameter estimation, and (c) diagnostic checking to determine the overall adequacy of the model (Diebold 1998; Box et al., 2015). In the model identification stage, time series data is tested for stationarity by computing autocorrelation function (ACF), partial autocorrelations function (PACF), and cross-correlations. A slow decaying pattern of the ACF plot indicates non-stationarity in time series data and an appropriate differencing transformation is performed to remove non-stationarity. In order to estimate the values of ARIMA model parameters, the method of least squares and maximum likelihood method are the commonly used methods (Chatfield 1996; Brockwell and Davis 2002). Finally, the validity of the selected model is ascertained with the help of the Ljung and Box test which will suggest that no further modeling of time series is required. If the selected model is found inadequate, then the three-step model building process is repeated until a valid model is achieved (McNeil et al. 2005; Peng et al. 2014).

Performance of prediction

The performance of prediction of the developed ARIMA model for COVID-19 confirmed cases of six states of India can be evaluated by mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE), and coefficient of determination (R2). The observed and predicted values of COVID-19 data from these states are used to measure the accuracy of prediction. MAE, MSE, RMSE, and MAPE are defined by

$$\text{MAE}=\frac1n\sum_{t=1}^n\left|f\left(t\right)-\overset\wedge f\left(t\right)\right|$$
$$MSE=\frac1n\sum\nolimits_{t=1}^n\left[f(t)-\overset\wedge f\left(t\right)\right]^2$$
$$\text{RMSE}=\sqrt{\frac1n\sum_{t=1}^n\left[f(t)-\overset\wedge f\left(t\right)\right]^2}$$
$$\text{MAPE}=\frac{100}n\sum\nolimits_{t=1}^n\left|\frac{f(t)-\overset\wedge f\left(t\right)}{f(t)}\right|$$

where \(\stackrel{\wedge }{f}\left(t\right)\) is the predicted value of \(f\left(t\right)\).

Correlation analysis

Correlation analysis is used to study the effect of the variable on another variable. Karl Pearson’s coefficient of correlation is the most widely used coefficient of correlation. It is given by the formula,

$${\rho }_{xy}=r=Correlation\;coefficient=\frac{cov(X,Y)}{{\sigma }_{x}{\sigma }_{y}}=\frac{E\left(XY\right)-E\left(X\right)E(Y)}{\sqrt{\left(E{(X}^{2})-({E(X))}^{2}\right)\left(\left(E{(Y}^{2})-({E(Y))}^{2}\right)\right)}}$$

Here, \({\sigma }_{x}\),\({\sigma }_{y}\) represents the standard deviation of X and Y while E(X), E(Y), and E(XY) represent the expected values (Makkhan et al. 2020; Parmar and Bhardwaj 2013).

The daily COVID-19 confirmed cases (X) and corresponding daily temperature (Y) are considered for the present study. Data collected from 1st March 2020 to 16th May 2020 of six major affected states of India is taken for the correlation analysis. This analysis is applied in two stages, firstly whole data is considered, and secondly, the effect of the last 10 days temperature applied on COVID-19 cases as in the last 10 days, temperature is higher than previous periods.

Results and discussion

In this study, the datasets consisting of confirmed cases of COVID-19 in six states of India namely Delhi, Madya Pradesh, Maharashtra, Punjab, Rajasthan, and Uttar Pradesh are used as input to predict future cases by the application of an appropriate ARIMA model. The first step in the ARIMA modeling approach is to check the stationarity of time series data. For stationary time series, mean, variance, and ACF should remain constant. For accurate prediction, time series must possess stationary and it can be accessed by ACF and PACF plots. The slow decaying ACF plot indicates the presence of non-stationarity in the time series data of six states of India (Fig. 2). Significant correlations are shown by the given time series of COVID-19 cases as the red bars extend beyond two standard deviation limits indicated by blue lines in ACF plots. So, differencing of time series needs to be done to get stationarity. If the non-stationarity still prevails, then second-order differences are taken to get rid of the trend completely. Regenerating ACF and PACF plots for appropriately differenced time series help in estimating parameters of the ARIMA model. After fitting time series data of COVID-19 with appropriate ARIMA models whose model residuals generate ACF and PACF plots as shown in Fig. 3. The fitting of COVID-19 data of six states with an appropriate ARIMA model is shown in Fig. 4. After fitting COVID-19 data with appropriate ARIMA models, the predictions are generated to validate the models (Fig. 5). The accuracy of the prediction is evaluated by comparing the observed and predicted values (Table 2 and Fig. 6). The MAPE values for the states Uttar Pradesh, Maharashtra, Rajasthan, and Madhya Pradesh are 3.3662, 4.8327, 5.8952, and 6.8025 respectively which are comparatively low as compared to the states Delhi and Punjab. This means that COVID-19 data for the states Uttar Pradesh, Maharashtra, Rajasthan, and Madhya Pradesh is fitted more appropriately with respective ARIMA models. One-month ahead prediction of confirmed COVID-19 cases in the six states of India is shown in Fig. 7 which exhibits an increasing trend of cases in these states. An exponential rise of COVID-19 cases for Maharashtra is observed as compared to other states. Approximately 1200 cases daily are estimated by the ARIMA model in the state of Maharashtra. As far as the capital city Delhi is concerned, there is a significant increase in the number of cases, and the ARIMA model estimates around 300 cases per day. The daily prediction of COVID-19 confirmed cases in Madhya Pradesh, Rajasthan, and Uttar Pradesh are 200 to 300 respectively and Punjab state with lowest cases of 10 to 20 per day. So, the governments of Maharashtra, Delhi, Uttar Pradesh, Rajasthan, and Madya Pradesh need effective strategies to control the rise of the curve of COVID-19 confirmed cases. With the increasing trend estimated by ARIMA models in the above-said states, there is an urgent need for medical facilities like mass testing, isolation wards, and life support systems to control the spread of an epidemic.

Fig. 2
figure 2

ACF plots of six states of India: (i) Delhi, (ii) Madya Pradesh, (iii) Maharashtra, (iv) Punjab, (v) Rajasthan, and (vi) Uttar Pradesh

Fig. 3
figure 3

ACF and PACF plots of the residuals of appropriately fitted ARIMA models for data of six states of India in the order Delhi, Madya Pradesh, Maharashtra, Punjab, Rajasthan, and Uttar Pradesh

Fig. 4
figure 4

Fitting of COVID-19 with ARIMA models for six states of India: (i) Delhi, (ii) Madya Pradesh, (iii) Maharashtra, (iv) Punjab, (v) Rajasthan, and (vi) Uttar Pradesh

Fig. 5
figure 5

Forecast and 95% forecast intervals of COVID-19 data for six states of India: (i) Delhi, (ii) Madya Pradesh, (iii) Maharashtra, (iv) Punjab, (v) Rajasthan, and (vi) Uttar Pradesh

Table 2 Predictive performance of ARIMA models
Fig. 6
figure 6

Comparison of observed and forecast values of COVID-19 data for six states of India: (i) Delhi, (ii) Madya Pradesh, (iii) Maharashtra, (iv) Punjab, (v) Rajasthan, and (vi) Uttar Pradesh

Fig. 7
figure 7

One-month ahead prediction of COVID-19 cases

Table 3 describes the two-stage correlation analysis of COVID-19 cases with respect to the temperature variations. In stage I, whole data is considered to calculate the coefficient of correlation and in stage II, the last 10 days are considered for the same. In stage I, the variation of the temperature is high so no effect of the temperature is shown on the daily cases of COVID-19. In stage II, the minimum temperature is increased with maximum and it depicts the small variation of − ve correlation. It also concludes that as the temperature is more increasing in many states of India, it will provide a reciprocal effect on daily cases of COVID-19.

Table 3 Two-stage correlation analysis

Conclusions

In this study, ARIMA models were developed and applied to predict the daily data of COVID-19 confirmed cases in six states of India namely Delhi, Madya Pradesh, Maharashtra, Punjab, Rajasthan, and Uttar Pradesh. The input dataset of cases for 60 days was modeled with appropriate ARIMA models and 17-day in-sample prediction was made which were then compared with observed values for 17 days to access the accuracy of prediction. One-month ahead predictions of daily COVID-19 confirmed cases were made on the basis of developed ARIMA models. The MAPE values are more than 50% low for the states Uttar Pradesh, Maharashtra, Rajasthan, and Madhya Pradesh as compared to Delhi and Punjab which indicate the accuracy of the developed models for these states. Moreover, a 1-month ahead forecast of COVID-19 confirmed cases suggests a large incidence of daily cases for the states Maharashtra and Delhi and low incidence in Punjab state. As the minimum and maximum temperatures are increased and it depicts the small variation of − ve correlation (− 0.11054). That means, in stage II as an increase in temperature, daily cases of COVID-19 will slightly decrease. So, it also concludes that as the temperature is more increasing in many states of India, it will provide a reciprocal effect on daily cases of COVID-19. The prediction results can help the governments of these states to plan well in advance and take preventive measures to flatten the incidence of the growth curve of COVID-19 confirmed cases.