1 Introduction

On 08-Dec-2019, a novel coronavirus disease (COVID-19), a member of the family of the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), started to infect people in the city of Wuhan, China [1]. COVID-19 was declared as pandemic by the World Health Organization (WHO) on 11-Mar-2020, and since then it invaded almost all countries of the world [2].

Essentially, COVID-19 is an infectious viral disease that is transmitted from human-to-human through droplets whether direct; during coughing, sneezing of the patient or the carrier of the disease or indirect; through getting in contact with the patient’s saliva on close contact, shaking hands, using his personal articles or touching surfaces soaked with his droplets containing the virus. The virus finds its way into the human body through the mucus membranes of the mouth, nose and eyes [2,3,4].

Clinical picture of the COVID-19 infected patients varies significantly, from being asymptomatic to having severe form of the disease. In most cases, high fever, cough, sore throat, general weakness, fatigue and muscular pain are manifested in many patients. In the severe cases, pneumonia, acute respiratory distress syndrome, micro-coagulopathies, sepsis and septic shock are highly manifested, and in many instances, it could lead to death. Reports show that clinical deterioration occurs rapidly, often during the second week of the course of the disease [5, 6]. Patients with underlying medical conditions such as cardiovascular disease, diabetes, chronic respiratory disease, cancer and old-aged people are more likely to experience serious illness [7].

Since it has been first reported, the COVID-19 invaded 210 countries and territories around the world [8]. As for 10-Aug-2020, in more or less seven months, a total of 20,173,775 confirmed cases of COVID-19 were reported and its death toll showed about 736,300 deaths.

Many research works are going on to study the infectious nature of COVID-19, and every day we learn something new about it through the flooding of the huge data that are accumulating hourly rather than daily [9]. However, currently, some information is known about COVID-19; its full characteristics are still unclear. One of the COVID-19 features is that due to its accelerated genetic mutations, it changes its behaviour very quickly. Therefore, scientists are continuously performing observational studies just to establish facts about COVID-19 that will help in ending its pandemic. However, the viral genetic mutations increase the likelihood of having a second wave of the pandemic in future [9].

After recognizing the high rates of spread of COVID-19, the severity of cases and its related high death rates, governments followed the advice of the WHO and took decisions of lock-down cities, banning local and international flights, restricting movements of millions and suspending schools, universities and business operations. Such decisions made the people feel stressed, depressed and/or anxious, with variable degrees of psychological impacts. Moreover, with the long stay at home, the people are getting anxious and looking forward to returning to their normal life, work and activities [10, 11].

The ARIMA and SARIMA models are widely used statistical approaches for time-series analysis and forecasting. The non-seasonal ARIMA \((p, d, q)\) method is employed to build the pure seasonal SARIMA \((p, d, q) \times (P, D, Q)_s\) model. Currently, the public’s concern is to find answers for two questions; (1) When this COVID-19 pandemic will be over? and (2) After coming to its end, will COVID-19 return again in what is known as a second rebound of the pandemic? In this work, we have used the SARIMA statistical model to answer both questions on the scientific basis of algorithmic modelling.

The main contributions of our research work include:

  • Finding the best prediction models for daily confirmed cases in countries with the highest number of COVID-19 cases in the world to have more readiness in health care systems to forecast of the confirmed cases.

  • Analysis the risk of second rebound of COVID-19 pandemic

  • Estimating the pandemic life cycle and selecting the optimal parameter of the model using the grid search method. The proposed method outcomes matched the updated daily data.

  • Significant results are achieved when compared with the state-of-the-art models. Hence, the proposed SARIMA model can be extended and used to predict other countries as it is giving an acceptable performance when observed its accuracy.

  • Mathematical model presents the statistical estimation of the slowdown period of the pandemic which is extracted based on the concept normal distribution.

This paper is organized as follows: Sect. 2 presents the related works. Section 3 presents dataset description with current statistics. Section 4 introduces the proposed methodology. Section 5 presents the experimental observations and detailed discussion. Finally, the conclusions and possible future works are introduced in Sect. 6.

2 Related work

Lai et al. [12] studied the epidemic nature of COVID-19 incidence in terms of daily cumulative index, mortality rate and associative status of the countries health care resources and economy. With the catastrophic outbreak of COVID-19 globally, a huge volume of data is generated instantly and opens a hot research avenue for machine learning and artificial intelligence researchers.

Luo [13] provided a simple figure for each country to show the estimated pandemic life cycle together with the actual data to date and reveals the rate of spread of the infection and ending phase. The predictions were started purely driven by personal curiosity regarding when COVID-19 will end. However, this work needs more update with more analyses and cases, as well as sharing of learning and reflections from this exercise, and they did not use any mathematical model to show the predictive model’s behaviour.

Dandekar and Barbastathis [14] proposed a method to capture the current infected curve growth and predict a halting of infection spread by 20-Apr-2020. This method has shown that reversing quarantine measures right at this time can lead to an exponential explosion in the infected case count, thus annulling the part played by all measures implemented in the USA since 15-Mar-2020. However, the model used data of one-month period following the current US policy, that implies it has lack of sufficient data to make strong predicts.

The Institute for Health Metrics and Evaluation (IHME) COVID-19 health service utilization forecasting team, Christopher [15] peaked daily deaths varies from 30-Mar-2020 through 12-May-2020 by state in the USA and 27-Mar-2020 through 04-May-2020 by countries in the European Economic Area (EEA). They have estimated that through the end of July, there will be 60,308 deaths from COVID-19 in the USA and 143,088 deaths in the EEA. Deaths from COVID-19 are estimated to drop below 0.3 per million between 04-May-2020 and 29-Jun-2020 by state in the USA and between 04-May-2020 and 13-Jul-2020 by country in the EEA. Timing of the peak required for hospital resources highly varies across states in the USA and regions of Europe.

According to the WHO report on guidelines to protect COVID-19 [16], it infects humans by entering the body via different parts such as eyes, nose and/or mouth. It shall be noted that to avoid this infection, the guideline by WHO suggests not to touch the face with unwashed hands. Proper washing of hands with detergents such as soap and water for at least 20 s or cleaning hands thoroughly with alcohol-based solutions is recommended in all settings. It is also recommended to stay one meter or more away from one another to reduce the risk of infection through respiratory droplets. COVID-19 spreads rapidly in droplets and somehow surfaces.

Lutfi and Burcu [17] performed Auto-Regressive Integrated Moving Average (ARIMA) model on the European Centre for Disease Prevention and Control (ECDC) COVID-19 data to predict the number of confirmed cases and deaths of COVID-19. The limitation of this particular study is that a limited number of countries were considered. However, Tandon et al. [18] developed a model to use for forecasting future COVID-19 cases in India. The study indicates an ascending trend for the cases in the coming days.

Previous researchers were focused on developing methods to achieve an accurate and time-efficient model for prediction of the spread of COVID-19. The main drawbacks of the previous research works were less accurate prediction in most cases. In reference to the related work on COVID-19, there were great ideas to improve and indicate an ascending trend for the cases in the future. Generally, previous works lack promising features that could enable us to predict the spread of COVID-19 with better accuracy and manifest the time when it will slow down.

3 Dataset description

To validate our work, we used the records of COVID-19 data from WHO and Johns Hopkins university official websites [8]. The data shows confirmed cases, daily recovery and death rates. In our work, we have considered COVID-19 datasets for 20 countries that have a maximum spread of the pandemic as shown in Table 1 that indicates the updated data as of 10-Aug-2020, and the pie chart in Fig. 1 shows the distribution of confirmed cases whereby the top 11 countries confirmed cases are presented in percentage. The remaining countries show a small number of confirmed cases; hence, we do not present them in percentage. Table 2 describes the currently active and closed cases where out of the total infected cases, 99% of the patients are in mild condition and 1% are in critical condition. In the cases of closed cases, 95% of the patients have been recovered and 5% of them have died.

Table 1 The top 20 countries sorted by the number of confirmed cases as of 10-Aug-2020 [8]
Fig. 1
figure 1

Distribution of cases as of 10-Aug-2020 for top 11 countries [8]

Table 2 A sample of the top countries sorted by the number of confirmed cases in 10-Aug-2020 [8]

3.1 Current statistics

The age factor and death rate due to COVID-19: Table 3 presents the collected data from New York City (NYC) Health as of 14-Apr-2020 and 13-May-2020, [8, 19]. All data in this report are preliminary and are subject to change as cases continue to be investigated. These data include cases in NYC residents and foreign residents treated in NYC facilities. This table shows only confirmed deaths. A death is considered confirmed when a person dies after positive COVID-19 laboratory test has been confirmed. The main underlying illnesses that lead to high risk of death if one has got infected by COVID-19 include diabetes, lung disease, cancer, immunodeficiency, heart disease, hypertension, asthma, kidney disease and liver disease. The death rate is computed as shown in Eq. 1.

$$\begin{aligned} \text {Death Rate}= &\, \text {number of deaths} / \text {number of cases}\nonumber \\= &\, \text {probability of dying if infected by the virus (\%)}. \end{aligned}$$
(1)

Preexisting medical conditions (comorbidities) put patients at higher risk of death from COVID-19 pandemic. Patients who have no preexisting (comorbidities) medical conditions are having a fatality rate of 0.9%. Table 3 depicts the rate of death due to COVID-19 for various age range in New York City. For people in the age range from 0 to 17 years old, the rate of death is insignificant if the patients do not have an underlying health condition. In the case of elderly people whose age is 75+ years old, the rate of death rate reaches 14.3%. Generally, as the age increases and if the patient has an underlying health condition, there is a high risk of death due to the COVID-19.

Table 3 The age factor and death rate due to COVID-19 in New York city health on 13-May-2020 [8]

Moreover, the data depicts men are highly susceptible to death compared to that of women. Out of the total death rates, 61.8% men and 38.2% women die due to COVID-19 in New York City as of 13-May-2020 as shown in Table 4.

Table 4 Sex ratio of death rate due to COVID-19 in New York city health on 13-May-2020 [8]

Table 5 shows the COVID-19 fatality rate by age in China. The fatality rate varies depending on the age group. The percentages shown do not have to add up to 100%, as they do not represent the share of deaths by age group. It presents the risk of dying if one is infected with COVID-19 for a person in a given age group. In general, relatively few fatality cases are seen among children [19].

Table 5 Death rate in China due to COVID-19 by age group [19]

Table 6 shows the fatality rate in China in terms of sex ratio. Like the cases in other countries, the probability of fatality rate by sex ratio in China varies. When reading these numbers, it must be taken into account that smoking in China is much more prevalent among males. Smoking increases the risks of respiratory complications. Hence, males are highly susceptible to death when compared to females which are evidenced empirically as 4.7% and 2.8%, respectively.

Table 6 Sex ratio of death rate due to COVID-19 in China on 13-May-2020 [8]

Table 7 shows COVID-19 fatality rate by comorbidity in China. This probability differs depending on the preexisting condition. The percentage shown in the table does not represent in any way the share of deaths by a preexisting condition. Rather, it represents, for a patient with a given preexisting condition, the risk of dying if infected by COVID-19.

Table 7 Fatality rate by comorbidity in China [8]

4 Methodology

In the subsequent subsections, the proposed Auto-Regressive Integrated Moving Average (ARIMA) have been described. The ARIMA is a statistical and econometric model applicable in time-series analysis-related problems mainly to understand the data or to predict future points in the series [20].

4.1 The ARIMA models

A time-series \(Y_t\) is described as a series of independent variables based on time, where \(t\) is a time step [21]. A deterministic time-series is expressed by the function, \(Y_t= f (t)\). While the stochastic time series is expressed by \(Y_t= X (t)\), where \( X\) is a random variable. The ARMA model developed by Box et al. [22] has been used for the forecasting process in the stationary time series. Box-Jenkins (ARMA) forecasting model is very popular as it has high prediction efficiency in the stationary time series analysis [23]. An autoregression AR \( (p) \) is a known time series method used to predict the future value by using observations from previous \( p \)-time steps as inputs to the regression equation multiplied by the appropriate coefficients \(\phi\) of AR [24, 25]. Besides, the sum is extended by adding the mean of the series \(\mu\) and white noise \(\omega\) that is a random error. The AR  \( (p) \) model is given in the form shown in Eq. 2.

$$\begin{aligned} AR(p): y_t = \mu + \sum _{i=1}^{p}(\phi _i y_{t-i}) + \omega _t \end{aligned}$$
(2)

The polynomial function of the Moving Average MA \( (q) \) method is not included for any variable from a time-series [26]. It consists of three parts that include: the first part is the mean of the series \(\mu\), the second part is the summation of the multiplication of a finite number of MA coefficients, \(\theta\), and model residuals \(\omega\), and the third part is the white noise \(\omega _t\). The MA \( (q) \) model is given in Eq. 3.

$$\begin{aligned} MA(q): y_t = \mu + \sum _{i=1}^{q}(\theta _i \omega _{t-i}) + \omega _t \end{aligned}$$
(3)

The ARMA \( (p, q) \) model composes of two main polynomials which are AR \( (p) \) and MA \( (q) \) [27]. Mathematically it is represented as shown in Eq. 4.

$$\begin{aligned} y_t = \mu + \sum _{i=1}^{p}(\phi _i y_{t-i}) + \sum _{j=1}^{q}(\theta _j \omega _{t-j}) + \omega _t \end{aligned}$$
(4)

or

$$\begin{aligned} \phi (B) y_t= \mu + \theta (B) \omega _t \end{aligned}$$
(5)

The notation ARMA \( (p, q) \) represents the order of an ARMA method, described as follows:

  • \(y_t\) stands for predicted value at time \( t \),

  • \( p \): is the order of AR polynomial indicating number of autoregressive lags,

  • \( q \): stands for the order of MA model presenting the number of moving average model lags,

  • \(\phi _i\): The AR \( (p) \) coefficients has to estimate \((i=1,2,\ldots ,p)\),

  • \(\theta _j\): MA \( (q) \) coefficients (parameters) that need to estimate, \((j=0,1,2,\ldots ,q)\),

  • \(\mu\): represents the mean value of the time series data,

  • \( d \): represents the number of differences and is calculated based on the equation \(\Delta y_t =y_t -y_{t-1}\)

  • \(\omega _t\): represents the white noise of the time-series at time t.

The ARIMA \( (p, d, q) \) model is a widely used statistical method used in stationary time-series analysis such as forecasting [28]. To build such a model, the primary step is to investigate whether the statistical stationery of a time-series can be satisfied or not. Then, the next phase is estimating the numerical values of \( p \) and \( q \) parameters for AR and MA models. Thus, the essential idea of the ARIMA model is based on the assumption that the predicted value of the variable \(y_t\) is generated from a linear equation of several previous observations with random errors [29]. A process \(X_t\) is an ARIMA \( (p, d, q) \) when it satisfies the form in Eq. 6.

$$\begin{aligned} \nabla ^d X_t = (1-B)^d X_t \end{aligned}$$
(6)

In other words, the process \(X_t\) should be stationary after differencing a non-seasonal process d times. During the training step of the ARIMA model using the available dataset, the values of pd, and q are continually changing until the end and the last values are considered for the forecasting of the future values. The mathematical description of the model is presented as shown in Eq. 7.

$$\begin{aligned} \phi _p (B) (1-B)^d X_t = \mu + \theta (B) \omega _t \end{aligned}$$
(7)

4.2 Seasonal ARIMA model

The non-seasonal ARIMA model (pdq) is vital in building pure seasonal SARIMA\((p, d, q) \times (P, D, Q)_s\) model, whereby the term (pdq) presents the non-seasonal part of the model and \((P, D, Q)_s\) describes the seasonal part of the model [30, 31]. The mathematical description of the model is presented as shown in Eq. 8.

$$\begin{aligned} \phi _p (B) \Phi _P (B^s) W_t = \theta _q(B) \Theta _Q(B^s)\omega _t \end{aligned}$$
(8)

The notation of Eq. 8 is described as follows: pd  and q are represented in the previous Eq. 4, P presents the order of seasonal AR model, D indicates the number of seasonal differencing, Q refers to the order of seasonal MA, and s is the length of the season (periodicity). Besides, the \(\omega _t\) and B are the white noise value at period t, and the backward shift operator, respectively.

Equation 8 presents the seasonal components of SARIMA which can be expanded mathematically after substituting the value of \(W_t =\nabla ^d (B)\nabla ^D_s (B) X_t\).

$$\begin{aligned} \phi _p (B) \Phi _P(B^s) (1-B)^d (1-B^s)^D X_t = \theta _q (B) \Theta _Q(B^s) \omega _t \end{aligned}$$
(9)

The components of seasonal SARIMA can be written as:

  • non-seasonal AR:\(\phi _p (B) = 1- \phi _1 B - \phi _2 B^2 - \phi _3 B^3- \cdots - \phi _p B^p,\)

  • non-seasonal MA: \(\theta _q (B) = 1- \theta _1 B - \theta _2 B^2 - \theta _3 B^3- \cdots - \theta _q B^q,\)

  • seasonal AR: \(\Phi _P(B^s) = 1- \Phi _1 B^s - \Phi _2 B^{2s} - \Phi _3 B^{3s}- \cdots - \Phi _p B^{ps},\)

  • seasonal MA: \(\Theta _Q(B^s)= 1- \Theta _1 B^s - \Theta _2 B^{2s} - \Theta _3 B^{3s}- \cdots - \Theta _Q B^{Qs}\)

and

  • \(B^s X_t = X_{t-s}\),

  • \(\nabla _s X_t = \nabla _s (B) X_t = (1-B^s) X_t = X_t -B^s X_t = X_t - X_{t-s},\)

  • \(\nabla ^d (B)X_t = (1-B)^d X_t,\)

  • \(\nabla ^D_s (B) X_t = (1-B^s)^D X_t\)

Considering the relationship within the data, SARIMA\((p, d, q) \times (P, D, Q)_s\) model is successfully applied to different time-series because of the order of SARIMA is a relatively small number. The period value of time-series s (seasonality) is based on the dataset. For instance, \(s =\)7,30,365 for weekly, monthly and yearly data respectively. The d and D indicate the order of the non-seasonal and seasonal differencing and its values are not more than 1 and 2 total of seasonal difference, respectively (i.e., \(0 \le d, D \le 1\)).

4.3 Model selection

There are three steps in ARIMA model creation namely identification, parameter estimation, and diagnostic checking [32]. The identification process of the model deals with determining proper differencing to get stationary time-series, the order of the model desired and the autocorrelation (ACF) and partial autocorrelation (PACF) functions that are used to recognize the temporal correlation structure of the transformed data. ACF is a statistical metric of the correlation that is used to check if previous values in the time-series analysis have a certain relationship with the latest values or not. For all low order lags, PACF represents the value of the correlation coefficient between the variable and its time lag [33].

The two main methods commonly used to select appropriate models are Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) of Schwarz which are presented in Eqs. 10 and 11 for AIC and BIC, respectively [34, 35].

$$\begin{aligned} AIC= & - 2 \log ( L) + 2 k = - 2 \log ( L) \nonumber \\&+ 2 (p+q+P+Q) \end{aligned}$$
(10)
$$\begin{aligned} BIC= & - 2 \log ( L) + k \ln (n) = - 2 \log ( L) \nonumber \\&+ (p+q+P+Q) \ln (n) \end{aligned}$$
(11)

In this regard, n refers to the size of the series, and k presents the number of the parameters of the ARIMA method. It is experimentally proved that our model becomes efficient when the value of AIC is smaller. According to [22], an optimal forecasting model is selected based on the best fitting that has the minimum AIC value of the group.

4.4 Data normalization

In this work, data normalization using the min-max scalar function which is available in the scikit-learn library has been applied. Scaling data is a vital task to stabilize the value of variance. Generally, data normalization enhances performance and minimal computational complexity. Equation 12 is used to normalize all datasets before starting to train the model where \(Y_i\) presents the scaled datasets, \(x_i\) refers to the actual data, and the terms \(min (x_i)\) and \(max (x_i)\) presents the minimum and maximum values of the actual dataset, respectively.

$$\begin{aligned} Y_i = \frac{x_i- min (x_i)}{max (x_i) - min (x_i)} \end{aligned}$$
(12)

5 Experimental results and evaluation

In the subsequent subsections, the experimental results of the proposed method are presented. The experimental results are presented in terms of simulated results and tabular form and comparative study with state-of-the art methods also are carried out.

5.1 Experimental results

To carry out the experiments, the following machine learning libraries such as scikit-learn and Stat are used. The experimentations are executed on the Kaggle environment that provides the required packages. The COVID-19 dataset is collected starting from 22-Jan-2020 to the present time from official websites and data repositories such as WHO and world meter [3, 8]. To attain the best prediction, different parameters of the proposed model are tuned using a grid search technique. The values of parameters have been selected based on the collected data from the corresponding countries. For each country, the best parameters of the SARIMA model are identified and used to forecast for the next 60 days.

The SARIMA model can predict the current time and forecasts the future. In this study, the model is used to forecast the number of confirmed in the next few weeks. It can estimate the full pandemic life cycle and visualize its corresponding curves. The model is fitted with the training data set followed by validation using the test set. After estimating the full life cycle curve for each country, it determines the peak point in the bell-shaped curve to show when the pandemic will stop. For each model, the initial phase creates a set of parameters and initializes them with a bunch of values. Then, the grid search is applied to find out the optimal model that has minimum values of AIC. Next, the model selects the best combination of parameters that can provide minimum error (AIC) and assigned to the best model.

The proposed method is used to estimate the pandemic life cycle. To select the best parameter of the model, the grid search method is applied to each country’s data. The proposed method updates the daily data with the newest version. Table 8 presents the experimental results of the proposed method for the diagnostics test on the global dataset. Moreover, Table 9 shows the experimental results of the diagnostic test using the SARIMA model for the global data that have p-values \(\le 0.05\), that indicates minimum values of the AIC of each model.

In this work, we have experimentally proved that the model parameters vary from country to country as the data for each country substantially differs. Considering the relationship within the data, the SARIMA model \((p, d, q) \times (P, D, Q)_s\) is successfully applied to different time-series data. The period value of time-series s (seasonality) is considered based on the dataset. Since the daily data for a few months have been used, the value of s is assigned to be 3,7,12. The best forecasting SARIMA model parameters are selected based on the minimum values of AIC, and P-values that are less than 0.05. Table 8 presents the AIC values of different forecasting models. The following SARIMA\((9, 0, 8)\times (0, 0, 0, 3)\) model has the lowest AIC values as shown in Table 9. The best combination of the parameters \((9, 0, 8)\times (0, 0, 0, 3)\) is considered to be the best for the corresponding model.

Table 8 The experimental results of the diagnostics test on the global COVID-19 data using the proposed SARIMA model
Table 9 Experimental results of the diagnostics test for SARIMA models that have p-values less than 0.05 for Global

To train and validate the proposed SARIMA model, We have split the COVID-19 data into training and testing dataset on the basis of 70% and 30% ratio for training and validation for testing for each country. The training set comprises data from 22-Jan-2020 to  15-Jun-2020 and the testing set is from 15-Jun-2020 to current day. Table 10 presents the forecasting values with lower and upper confidence limits that are calculated using the proposed model for the period from 15-Jun-2020 to current day. Figure  2 shows the observed (marked in blue line) or training set from 22-Jan-2020 to 15-Jun-2020 and the testing set from 15-Jun-2020 to present-day and values for one step ahead forecasting is presented by the red line. In Fig.  3, the forecasted values marked in the red line, actual values marked in blue line and grey shading area are used for the confidence intervals with lower and upper confidence limits.

Fig. 2
figure 2

Comparison between the observed and predicted values (one-step ahead result) for SARIMA model on COVID-19 dataset

Fig. 3
figure 3

The forecasted values for the COVID-19 new cases over the globe until 15-Nov-2020

The proposed model predicts the number of the confirmed cases of the next few days or months using the previously observed data as shown in Table 11 with lower and upper confidence limits. Although the increasing trend is visible, the proposed model has better performance for the testing set. Generally, the forecast performance is acceptable when the MSE and RMSE values for the testing set from  15-Jun-2020 to present day are 2.48513 and 1.57643, respectively.

Table 10 Experimental results for the proposed SARIMA\((9, 0, 8)\times (0, 0, 0, 3)\) Model (from  14-Jul-2020 until  12-Aug-2020) with 95% CI
Table 11 The forecasted values of daily confirmed cases for 60 days using SARIMA\((9, 0, 8)\times (0, 0, 0, 3)\) model with 95% CI

5.2 The risk of second rebound of COVID-19 pandemic

Epidemiologically, the history of the deadly pandemic viral infection demonstrates that after getting to the end, they are usually followed by waves of significant spread and deaths. For instance, the Spanish flu first appeared in the USA and then transmitted to Europe via World War I participant soldiers in early Mar-1918. It had all the hallmarks of the seasonal flu, that is highly contagious and infectious strains. Yet the first wave of the virus did not appear to be particularly deadly, with symptoms like high fever and malaise usually lasting only three days. There was hope at the beginning that the virus had finalized its course. However, somewhere in Europe, a mutated strain of the Spanish flu virus had emerged. This mutated virus got spread by the end of wartime troop movements from England to France, Africa and the USA causing the fatal severity of the Spanish flu’s “second rebound” [36, 37].

Another example was the H7N9 pandemic. Since its emergence in Mar-2013, novel avian influenza A H7N9 virus has triggered five epidemics of human infections in China. This raises concerns about the pandemic threat of this quickly evolving H7N9 subtype for humans [38,39,40,41].

The worrying thing is that many countries are preparing to ease their lockdowns while planning to continuously monitor potential new cases to prevent a second deadly outbreak. The uneven progress of countries’ efforts to control the virus has led health researchers to warn that nations will have to monitor closely for new infections and adjust the measures in place until the availability of vaccine. China’s aggressive control over the daily life have nearly brought the first wave of COVID-19 to an end; however, the danger of a second wave remains uncertain [3, 4].

While these control measures appear to have reduced the number of infections to some extent, without herd immunity against COVID-19, cases could easily resurge as businesses, factory operations and schools gradually resume and increase social mixing, particularly given the increased risk of imported cases from overseas as COVID-19 continues to spread globally. World leaders and health officials are warning that hard-won gains must not be risked by people relaxing physical distancing measures [42, 43].

From the outset of this worldwide pandemic, multiple models have been developed by different organizations and research institutions. Generally, models present the worst-case and best-case scenarios, under different sets of circumstances. With each model, the timing, height, and width of the peak of confirmed COVID-19 cases and deaths rates are uncertain. This is due to complexity and randomness in the dynamics of virus transmission and uncertainty in key epidemiological parameters [44].

As presented in Fig. 4, the green line depicts the health care system capacity. The part of the red line of the bell curve above the ideal green line shows that if social distancing is not respected, millions of people may die due to the pandemic. On the other hand, if the social distancing measures are strictly followed, only thousands of people may die before the end of the pandemic (as depicted by the blue coloured bell-shaped line). Besides lowering the morbidity and mortality indices, social distancing measures aim to ensure there is less burden to the health care system [44, 45].

Fig. 4
figure 4

Death flatten curve

With due acknowledgement to the uncertain nature of the ongoing COVID-19 pandemic and our growing inter-connected and complex world, what is eventually and fundamentally required are the flexibility, robustness and resilience to deal with unexpected future events and scenarios.

Moreover, the proposed model forecasts that there is a chance of the second rebound of the pandemic in a year time if the prevention guidelines and precautions are not followed. We have to consider the uncertain nature of the current COVID-19 pandemic and the growing inter-connected and complex world, that are ultimately demanding flexibility, robustness and resilience to cope with the unexpected future events and scenarios. Our study shows the pandemic rebound is in line with the current scenario in some countries such as India, Brazil and the USA as their social distancing and related measures are relaxed (See Tables 12 and 14).

5.3 Estimation of slowdown of COVID-19

The COVID-19 is similar to other pandemics in terms of life cycle pattern which includes the outbreak, slowdown, stoppage phases and infection peak point. Based on the various phases of the life cycles of COVID-19 at a specific point in time, each country has a different starting date of the first phase based on the first confirmed case. For example, the first confirmed cases in the USA and Italy is on 15-Jan-2020, and on 31-Jan-2020, respectively [8].

The basic idea of our assessment is based on the assumption that the data follows the concept of normal distribution. The proposed predictive model enables to estimate the expected period that the virus can be slowed down and ultimately stopped. The inflection’s peak point is specified as it appears like the peak point in the bell-shaped curve that depicts a possible slowdown and stoppage of the pandemic based on the normal distribution as shown in Fig. 5. However, estimating the ending date varies based on different considerations such as the first confirmed case and protective measures. Theoretically, one can define the end date as the one with the last predicted case in the pandemic life cycle curve, and others may consider an early date as the end date from businesses, schools or governments when most of the predicted infections (indicated by the regressed pandemic life cycle curve) have been actualized and only a small portion of the total predicted epidemic population is left.

Fig. 5
figure 5

A normal distribution within 1 standard deviation (\(\sigma\)) from the mean (\(\mu\)) using SARIMA

The following mathematical Equations present the statistical estimation of the slow down period of the pandemic which is extracted based on the concept of normal distribution. It explains how to calculate the area under the curve between \(\mu +2\sigma\) and \(\mu +3\sigma\) corresponding to the period that the pandemic can stop.

$$\begin{aligned} p(\mu +2\sigma<X<\mu +3\sigma )= & \,p(\frac{\mu +2\sigma -\mu }{\sigma } \\< &\, Z<\frac{\mu +3\sigma -\mu }{\sigma }) \\= &\, p(\frac{2\sigma }{\sigma }< Z<\frac{3\sigma }{\sigma })\\= &\, p(2< Z <3) = 2.1\% \end{aligned}$$

Figure 6 shows the confidence intervals (CI) for the expected total cases that have been identified and calculated as follows:

$$\begin{aligned} p(\mu -2\sigma< Z< \mu +2\sigma )= 95.46\% \\ p(\mu -3\sigma< Z <\mu +3\sigma )=99.73\% \end{aligned}$$

The final predictions of the proposed model provide the following three estimates of end dates: (1) The estimated period from \(\mu +2\sigma\) to \(\mu +3\sigma\) with probability 2.1% presents the last expected cases have identified. (2) The estimated period from \(\mu -2\sigma\) to \(\mu +2\sigma\) presents 95.46% of the expected total cases that have been identified. (3) The estimated period from \(\mu -3\sigma\) to \(\mu +3\sigma\) presents the date when 99.73% of the expected cases have been identified as shown in Fig. 6.

Fig. 6
figure 6

A normal distribution within 1 standard deviation (\(\sigma\)) from the mean (\(\mu\))

Table 12 presents the experimental results of the proposed model that shows the expected deadline of specified countries. The topmost affected countries in the first are the USA, Spain, Italy, France, the UK, Germany and Russia. In the second rebound of the pandemic, the model generates the countries namely the US, Brazil, India, Spain, Italy, France, the UK, Germany and Russia. Table 12 presents estimation without forecasting or with forecasting (for one month ahead) in the first rebound. Similarly, in the second rebound, the model generates estimation without forecasting or with forecasting (for one month ahead). The table has the names of the attributes such as country, date of the first confirmed case, the peak point (top of the bell-shaped graph), the start date is the first expected date with a confidence interval of 95%, the end date which is the last expected date with a confidence interval of 99%, start value (the corresponding value of the start date) and the end value is the corresponding value of the end date.

Table 12 Expected deadline for some countries in the first and second rebounds

The proposed method exhibits different forecasting results for the first and second rebounds of the pandemic for various countries. To make the forecasted results more updated and in line with reality, we are describing the second rebound cases. Table 12 shows the estimated time for the USA by applying forecasting approach. The expected number of confirmed cases for the USA will be 701996 on  08-Dec-2020, and after one and a half month that is on  05-Feb-2020, the number of confirmed cases will decrease to 13 as shown in Fig. 7. Moreover, for the second rebound when forecasting approach is applied, the expected number of confirmed cases will be 1072667 on  24-Jan-2020, and after three months that is on  02-Apr-2020, the number of confirmed cases will decrease to 15 as shown in Fig. 8.

Fig. 7
figure 7

Expected dead line for the USA without forecasting

Fig. 8
figure 8

Expected deadline for the USA in the second rebound with forecasting

Table 12 presents the estimated values of the end date of the pandemic in India. When forecasting approach is applied, the proposed method exhibited different results. Hence, the expected number of confirmed cases will be 21370 on  25-Nov-2020, and after two months, that is on  21-Jan-2021, the number of confirmed cases will decrease to 3 as shown in Fig. 9

Fig. 9
figure 9

Expected deadline for the India in the second rebound without forecasting

As presented in Table 12, for the case of Brazil, when forecasting approach is applied, the proposed method exhibits various results. The expected number of confirmed cases will be 135773 on  14-Oct-2020, and after two months, that is on  01-Dec-2020, the number of confirmed cases will decrease to 793 as shown in Fig. 10.

Fig. 10
figure 10

Expected deadline for the Brazil in the second rebound without forecasting

Table 12 shows the prediction of the deadline to end the pandemic for France using the real data, and the results showed that expected number of confirmed cases will be 148084 on  02-Dec-2020, and after two months that is on  28-Jan-2021, the number of confirmed cases will decrease to 12 without applying forecasting approach. When forecasting approach is applied, the proposed method exhibits different results. The expected number of confirmed cases will be 12758 on  20-Aug-2020, and after a month that is on  27-Sep-2020, the number of confirmed cases will decrease to 11 as shown in Fig. 11.

Fig. 11
figure 11

Expected deadline for the France in the second rebound without forecasting

China was successful in halting the COVID-19 epidemic as the government applied early quarantine strategy. The confirmed cases trend in China becomes stable and frequently remains between zero and one. This fact indicates that quarantine worked well to reduce human exposure and succeeded to control the epidemic. Moreover, the study shows Brazil and India had unstable trends. Finally, the expected confirmed cases for the top countries will be manifested between Dec-2020 to Apr-2020 as shown in Table 12. Moreover, these predictions may vary based on many factors such as the lockdown period or developing an effective vaccine against COVID-19.

5.4 Comparison with state-of-the-art models

In our work, we have carried out a comparative study with state-of-the-art methods as presented in Table 13. The comparative study is carried in the following models namely ARIMA, Machine learning (Random Forest) and deep learning model (LSTM). The performance of each model is evaluated using various metrics such as root-mean-square error (RMSE) and mean absolute error (MAE) on the test dataset. Based on the experimental results of the proposed SARIMA model, it is indicated that significant results are achieved when compared with the state-of-the-art models. Hence, the proposed SARIMA model can be extended and used to predict other countries as it is giving an acceptable performance when observed its accuracy.

Table 13 A comparative study of the proposed method with the state-of-the-art models in terms of confirmed cases

Table 14 presents the comparison with the state-of-the-art model for the top countries on the first rebound. The estimation of COVID-19 end dates for top countries with forecasting approach as of  Oct-2020 is 99.73% percentage. For example, the end date based on a the-state-of-art method for the USA is 27-Aug-2020 [13] while our model’s prediction date is on 15-Oct-2020 which is statistically more accurate. In any case, prediction and specifying an end date is arbitrary. Alternatively, estimation as a range of dates might make sense for such uncertain predictions. The estimated date range is expected to become narrower as the countries continually evolve along the pandemic life cycle curve to its end date.

Table 14 Comparison of the proposed model with the state-of-the-art method on the first rebound

In any prediction tasks, more data are needed to achieve better performance from the models underuse. The best predictive models can help in predicting future confirmed cases if the spread of the virus does not change radically. It is known that the pandemic COVID-19 virus is novel and can be transmitted easily. This can affect all the predictions, but to the best of our knowledge and in the time of writing, our proposed model is best compared to the state-of-the-art methods.

6 Conclusion

This research work investigates the answer to the most important questions raised today: when will the COVID-19 pandemic end and is there a possibility for the second rebound in case of returning to daily routine life. Despite accelerated virus mutation and the nature of the dataset based on time and date, the work done tried to reduce the variability of the data by taking only the dataset from WHO and John Hopkins University. The proposed model provides a statistical estimate of the slowing down of the pandemic, which is derived based on the normal distribution principle. The work done helped in estimating the life cycle of the pandemic and selecting the optimal model parameter using the grid search method. The experimental results of the proposed method match with the daily data to show the realistic nature of the proposed model.

The results pointed out to the likelihood that there will be a second rebound of the pandemic in a year time if the currently taken precautions are eased completely. This study will have a significant benefit in helping governments in making decisions and planning for the future to reduces anxiety and prepare the minds of people for the next phases of the pandemic. The proposed work has some limitations. Hence, we believe that it could lead to the next research avenue on COVID-19 pandemic and can be a good starting point considering the uncertain nature of the pandemic and our growing inter-connected and complex world. What is eventually and fundamentally needed is the flexibility, robustness and resilience to deal with unexpected future events and scenarios. The future work of this research will focus on improving the performance of our model by using a huge data and applying the proposed model to more countries. Moreover, we plan to update this study with more analyses and cases, by fine-tuning the prediction and visualization methodology.