1 Introduction

Since the outbreak of the COVID-19 pandemic [1], the estimate of a time-scale for the end of a wave of pandemic outbreak has undoubtedly become an outstanding challenge. Nevertheless, due to the variations of a wide range of parameters, such as the rate of spreading, the contact network of the individuals, various mitigation measures, etc., it is very difficult to make such an estimate [211]. However, a multidimensional set of data is often used in statistical learning approaches for making predictions [12, 13]. Indeed, such predictions have been attempted in a wide range of cases, such as financial time series, weather data, medical applications and many other physical systems [1417]. There have been multiple earlier attempts in using machine learning approaches for predicting epidemic spreading in the context of COVID-19 (see e.g., [21, 22]) as well. However, to have a proper estimate, a large set of training data is needed to be fed to the supervised learning algorithm. This is often a major hurdle to overcome for a pandemic such as the present one, the like of which is not seen in a century.

To address this issue, we first consider a simplified model of epidemic spreading, called the Susceptible-Infected-Removed (SIR) model [1820] and estimate its predictability using a supervised learning algorithm, by varying various parameters of the SIR model. We then find the condition under which the model is best suited to make ‘predictions’ about the first wave of the COVID-19 pandemic in different states in India. Since in most of these cases, the first and the second waves are separated by a period of low infection rates, the end-time of the first wave can be well defined. Therefore, it is possible to make an error estimate for the ‘prediction’ of the first wave end-time. The optimal condition of the model that ‘predicts’ the end of the first wave can then be used to generate a ‘synthetic’ training data set of substantial size. This ‘synthetic’ training data set can then be used to make predictions for the ongoing second wave. The use of synthetic data for enhancing prediction capability of ML algorithms is a well known technique (see e.g., [23]). By increasing the size of the training data set substantially, this technique enables the ML algorithm to make stable predictions.

Certainly, there are multiple issues in using the first-wave data for the optimization of the training set. Particularly, the two waves are, of course, different in several aspects: effects of vaccinations, changes in the norms of travel restrictions, presence of mutant variants of the virus, etc. One outcome of these variations would be the changes in the maximum values of the daily infection rates between the two waves. As is evident from the data in India [24], the peak height of the second wave was about four times larger than the peak height of the first wave. Therefore, we normalize the data by the peak height, which necessarily assumes that the peak for the second wave has passed, in each of the cases where we make the end-time predictions.

The rest of the paper is organized as follows: First we describe the SIR model and the machine learning methods used for making the predictions in the model (Sect. 2). In Sect. 3, we present the simulation results, describing the variations in predictability of the SIR model under different conditions (testing rate, site dilution). Then in Sec. 4, we use the ML algorithm to make predictions of the end-time of the second wave of the pandemic for some states in India. Finally, we discuss and conclude in Sect. 5.

2 Model and methods

The SIR model is a well studied model for epidemic spreading [18]. Although a simple model, this and its variants have been widely popular for epidemic spreading studies [19, 20]. The model assumes that the total population is divided into three groups - Susceptible: denoting the individuals who can get infected by the virus but are not yet infected, Infected: denoting the individuals who are currently infected and can infect the susceptible population and Removed: denoting the population who were already infected by the virus and do not affect the evolution of the spreading dynamics any more. For the case of COVID-19, it is not entirely conclusive whether an individual can get infected more than once. However, in our study we have assumed it to be the case.

There are several variants of SIR-like models that are applied in the case of COVID-19 spreading, specifically those concerning the imposition of social distancing measures and travel restrictions (see e.g., [2528], see also [29] for a review on effects of travel restrictions). However, in our case we have considered the simplest version of the SIR model, since the data for the additional states (e.g., exposed, isolated individuals) are not available for comparisons.

We consider the model on a two dimensional square lattice, where each site represents the location of an individual who can be in one of the three states mentioned above. An infected individual can, with certain probability, infect one of their eight neighbors (four nearest neighbors and four diagonal neighbors) if that neighbor is in the susceptible state. An infected individual remains in that state for a given duration of time, during which they can infect other susceptible individuals. Following that duration, the infected individual enters the Removed state, where they no longer participate in the spreading dynamics.

In one time step of the simulation, every individual is selected once, and their states are attempted for a possible update. The updates are done in a parallel updating scheme, such that a given update comes into effect in the following time step. If an individual is in the susceptible state and one of the eight neighbors is infected, then the susceptible individual can be infected with probability p. If an infected individual has remained in that state for \(\tau\) time steps, then that individual is put in the removed state. Here we keep \(\tau =14\) , and the value of p is fixed at a randomly chosen value between 0.3 to 0.8 from a uniform distribution for each realization of the model.

The number of infected individuals at a time t, denoted by I(t), represents what is usually referred to as the ‘active cases’. The time derivative of the susceptible (S(t)) individuals, dS(t)/dt, is essentially the number of new infections in a day (at t). Both of these quantities start from a low value. Initially, these quantities are identical and start from a low value. The initial infection is chosen randomly and uniformly between 10 and 20 for each simulation, independent of the system size. Both of these quantities then increase with time, reach a peak and then eventually decrease to zero. The model does not show multiple waves of infection rates, nor does it account for the effects of vaccination or restriction imposed in interactions.

While it is straightforward to get an exact solution for the mean field version of the model and also to numerically estimate the above mentioned quantities in other topologies (including the square lattice considered here), the actual situation is far more complex, and the available data sets are limited. Particularly, just the absence of tests for the individuals without symptoms and/or access to such medical facilities, would distort the data for the number of infections and other related quantities. Also, the topology of a square lattice is a simplified one and would at the very least include ‘disorder’ in terms of unoccupied sites. We first investigate the effects of these factors in predictability of the SIR model. For the prediction, we use a supervised machine learning algorithm, particularly the Random Forest algorithm. This is an ensemble of decision tress, and the predictions in the model are made using the majority of the predictions of the decision trees. The various attributes that we use for the training of the algorithm are: daily infection, daily recovery and the number of active cases at a particular time. The target variable for the prediction is the remaining time before the daily infection number goes below 5% of the peak value. In the Random-forest, we used 1000 estimators and kept the maximum depth at 15. The results are stable with small variations around these parameters. Following the training of the algorithm with a training set of 200 ensembles (each ensemble represents the full time series of the different quantities mentioned from the start to the end of the spreading dynamics), each having a randomly chosen infection rate and initial infection number, in the way outlined above. Then the trained algorithm is used to make predictions for a different set of 100 ensembles. Of course, there are multiple other factors that are not captured in the model or data; for example the effects of social distancing measures, quarantines, asymptomatic individuals, exposed individuals who were not showing symptoms yet and so on. However, since the relevant data for these parameters are not available for comparisons, we could not use it for this study.

In Fig. 1, a typical time series of the infection rate (normalized by the peak height) is shown. The actual remaining time and the machine learning (ML) predicted remaining times, at every instance, are also shown. The root-mean-squared fluctuations between the actual remaining time and the predicted remaining time at every instance give an estimate for the error in prediction. In the following, we first estimate the error in predictions i.e., the efficiency of the ML predictions under different conditions of variable testing rate and disorder in the SIR model. Then we use the model as training set to make predictions for the end of the second wave of the COVID-19 pandemic in different states in India.

Fig. 1
figure 1

The time variation of the daily infection rate (normalized) that starts from a low values, reaches a peak and then declines and the corresponding ML predictions for the time remaining before the end of infections are shown for a single run of the SIR model. The actual remaining time (straight line) is also shown, and the root-mean-square deviation between the predicted and the actual time gives the error in the prediction (Color figure online)

3 Simulation results

Here we describe the simulation results of the SIR model of epidemic spreading and estimate the variations of its predictability with different parameters of the model, using supervised machine learning algorithm.

3.1 ML predictability of SIR model with variable testing

As indicated before, a major source of distortion in the data for the pandemic is the limited testing resources available. This was especially apparent during the first wave of the pandemic (see e.g., [31]). Therefore, it is useful to understand, even for this simplified model, how does incomplete testing and/or variable testing rate affect the measurements in the model so as to affect, in turn, the predictability of the model.

We check this effect in two different ways. First we assume that only a fraction (v) of the total population can get tested i.e., only that fraction of the population have access to the necessary medical facilities. While the underlying SIR dynamics runs with the three possible states (S, I and R) for each individual, the measurements are made, at each time, only on the v fraction of the total population, chosen randomly and kept fixed in time for that realization.

Figure  2 depicts the effect of the various values of v, between 10% (\(v=0.1\)) to 100% (\(v=1\)) testing. While the (apparent) number of daily infection is very sensitive to the value of v, when scaled by this factor (Fig. 2b), the curves fall on top of each other, with a varying degree of fluctuations. The consequent error in the predictions using the ML algorithm, however, only varies weakly (Fig. 2c) with v. This gives an important conclusion that while drastically sub-sampling, the predictability remains almost the same in the model, as long as a macroscopic fraction of the randomly chosen population is tested.

Fig. 2
figure 2

The simulation results for the SIR model in two dimensional square lattice (\(200 \times 200\)) for a fixed fraction v of the population tested (from \(v=1\) to \(v=0.1\) showing the highest to the lowest peak heights). (a) The time variation of the daily infection rate for a particular realization of the model is shown for different values of v. (b) When scaled by v, the daily infection curved fall on top of each other, with a variation in fluctuations. (c) The variation in the root-mean-squared error in prediction made by the ML algorithm (for 100 sets, after training by 200 sets) with v is shown. No significant variation is noted (Color figure online)

Secondly, it is also known that the rate of testing is not a fixed quantity over the duration of the pandemic. Particularly, it is often dependent on the rate of positive results obtained in daily testing. Therefore, we also look at the variation due to a time dependent value of v. We vary v between a lower bound \(v_{min}\) and an upper bound \(v_{max}=0.5\), linearly dependent on the daily infection rate. Other than the linear dependence of v within this range, it is not allowed to fall below or increase above, the fixed threshold values. The individuals are again randomly selected for testing at each step, but now with a time dependent value of v. Fig. 3 shows the results for this case. In Fig. 3a, the time variation of the daily infections are shown for various values of \(v_{min}\) and \(v_{max}\). In Fig. 3b, the actual data for number of testing and the corresponding daily infections are shown for various states in India, justifying the choice of the linear variation (indicated by the straight line). Nevertheless, there is hardly any systematic variation in the error (hence also the predictability) with \(v_{min}\). This is also an interesting observation that while the time dependent testing rate can introduce fluctuations in the data, ultimately it does not translate to a lower predictability, as long as it is made sure that a macroscopic fraction of the individuals is always tested.

Fig. 3
figure 3

The simulation results for the SIR model in two dimensional square lattice (\(200 \times 200\)) for a time dependent testing rate are shown. (a) The time variation of the daily infections (scaled by the peak value) is shown for various ranges of allowed variations in v. The curves are of different fluctuations, but of similar nature. (b) The variations in the error in ML predictions with the lower bound in testing fraction (\(v_{min}\)), showing no systematic variation in predictability. (c) The actual dependence of the testing and daily positive cases is shown for different states in India that shows an approximate linear variation (indicated by the straight line for a guide to the eye). This is for the justification of the choice of the linear dependence of v with daily (apparent) infection rate in the simulations (Color figure online)

3.2 ML predictability of SIR model with site dilution

As mentioned before, topology of the contact network of the individuals can play a crucial role in the spreading dynamics. So far we have kept that to be very simple fully occupied two dimensional square lattice. But such an orderly arrangement is not realistic. As a simple way to introduce disorder, we remove a fraction q of the sites i.e., there are no individuals occupying that fraction of the sites This modification, of course, introduces a fluctuation that diverges near the critical point \(q_c\approx 0.4\) of site percolation [32] (see also [33] for percolation threshold with longer than nearest neighbor connections). It is generally known that a system with higher disorder is relatively more predictable through machine learning, compared to the systems having less disorder [17]. It is also known that the distribution of population in a city follow a fractal character [34], which will happen here near the percolation threshold. It is, therefore, interesting to study the variation of the predictability when the SIR model is simulated in a site diluted lattice.

Figure 4 shows the simulation results for the site diluted lattice. It is interesting to note that even when the infection curves are scaled by the corresponding maximum values, they do not overlap. Indeed, there is a non-monotonic variation in the duration upto which the dynamics run, with the dilution fraction. The corresponding errors, scaled by the error obtained without ML i.e., just considering the average duration of the training set as the predicted duration for each testing set, show a non-monotonic variation with the dilution fraction. A system size dependence shows that the minimum point of the error tends toward the critical percolation threshold, as the system size is increased. Therefore, we conjecture that the highest disorder in the model i.e., the percolation critical point, is the maximally predictable point as well. This is an interesting observation, since as mentioned before, at the percolation point, the occupied site form a fractal structure. As mentioned before, this mimics the fractal nature of the population distributions in cities [34], although with a different fractal dimension.

Fig. 4
figure 4

The simulation results for the SIR model in two dimensional square lattice (\(200 \times 200\)) for different values of site dilution fraction. (a) The duration for which the daily infection rate is finite, varies non-monotonically with site dilution. Therefore, the curves, even when scaled, do not overlap with each other. (b) The relative errors for various system sizes show a minimum that tends toward the percolation threshold with increasing system size (Color figure online)

In Fig. 5, the locations of infected sites and the corresponding infection times are shown (in color gradient). It is clear that for the dilution fraction around 0.6, the infected locations look like a fractal. It is known that the spatial arrangements of COVID-19 spreading indeed follow a fractal structure [35]. Therefore, it is significant that near the point where the infection spreading is fractal-like, the matching with first-wave prediction is the highest (see below). Finally, the marginally connected sites are infected much later, delaying the decay process of the daily infection curve (see Fig. 4).

Fig. 5
figure 5

The simulation results for the SIR model in two dimensional square lattice (\(200 \times 200\)) for different values of site dilution fraction. The three figures are for dilution fractions 0.5, 0.55 and 0.6 (from left to right) with colors indication the time of infection. The infected locations look like a fractal (see [35]) with the marginally connected locations getting infected at a much later time, delaying the decay of daily infections for dilute lattices (see Fig. 4) (Color figure online)

4 Application: Predictions of end-time of second wave in some Indian states

So far we have discussed the predictability of the SIR model using supervised machine learning. We have also seen that the predictability depends on the site dilution fraction in the model when simulated on a square lattice. Here we attempt in using the SIR model as a training set and then make predictions for the end-time of the second waves in eight Indian states where the total infection has crossed one million. These states are: Andhra Pradesh (AP), Delhi (DL), Karnataka (KA), Kerala (KL), Maharashtra (MH), Tamil Nadu (TN), Uttar Pradesh (UP) and West Bengal (WB).

In Fig. 6, we see that the first and second waves are more or less distinct in these states–separated by a low daily infection rate. First we use the SIR model with various site dilution fractions and make ‘predictions’ about the end-time of the first wave. Knowing the actual end-time of the first waves in these states, it is possible to estimate the errors in those predictions (Fig. 6(inset)). It is seen that the error is minimum for the dilution fraction 0.55. We therefore use the SIR model with dilution fraction 0.55 to generate a training set (500 sets) and then feed the data for the second wave to make a prediction about the end-time. In doing so, one obvious issue is with the peak height, which are very much different between the first and the second waves and also among the different states. We, therefore, normalize the training as well as the testing data by the corresponding peak heights. One obvious assumption, therefore, is that the peak for the second wave has passed, which is obvious in many of the states and are also indicative in the rest of the states.

Fig. 6
figure 6

The seven day rolling average of the daily infection from March 14, 2020 to May 20, 2021 in eight Indian states. The inset shows the errors in ‘predicting’ the first wave using training data from SIR model with various levels of site dilution. The minimum error is noted for the dilution fraction 0.55, which is then used as the training set for the predictions of the second wave (Color figure online)

We make another set of predictions by using the first waves as the training data. It is remarkable that the two sets of predictions are very close to each other. However, in making the final prediction (Fig. 7, Table 1), we use the training set of the SIR model.

The errors in the remaining time were estimated from the errors of the straight-line fit of the remaining time line. This has two components, the errors in the pre-factor and the errors in the slope. The final errors were calculated by taking the combinations of the extreme values of these two errors, which has resulted in asymmetric error bars in some cases.

Fig. 7
figure 7

The predicted end-time for the second wave in the eight Indian states, where the total infection has crossed one million. The errors are also indicated, which are estimated from the least squared fit of the straight lines of the final points of prediction. In cases where the infection rates are very close to the peak, the predictions are less accurate. The green dots are the predictions using SIR model as training set, and the purple dots are the predictions using the first wave data as the training set (Color figure online)

Table 1 The predicted end-dates for the second wave in different states and the corresponding errors. The errors are higher for the states where the infection rates are close to the peak (data as of May 20, 2021 [24, 30])

5 Conclusions

We have reported the variations in the predictability of the SIR model of epidemic spreading with different parameters of the model, using supervised machine learning algorithm. It is interesting to note that the predictions for the end-time in the model are remarkably stable, even when only a small fraction (10%) of the individuals are tested for the infection. The predictions are also stable when the testing rate vary with time–linearly with the positivity rate of the testing within a given range. However, the predictability changes substantially when a disorder is introduced in terms of site dilution in the model i.e., some positions are not occupied by any individual. Particularly, the relative predictability of the model is the highest (error in prediction is the lowest) when the site dilution fraction approaches the percolation critical point (see Fig. 4). In that case, the underlying lattice structure approaches a fractal and the fluctuations in the cluster size diverges with system size [32]. It is seen before that the predictability using ML approaches increases with the increase in the disorder in the system (see e.g., [17]). Indeed, the fluctuations in the time series of the various attributes used for the ML algorithm have richer characteristics (carrying more information), and consequently, the training of the algorithm is better. Also, it is interesting to note that the spatial distribution of population in cities is fractal in nature [34], although not necessarily of the same fractal dimension as that of the site percolation. Nevertheless, the fluctuations introduced in the daily infection rate and other related quantities due to the delayed spreading of the infections in marginally connection regions, would introduce qualitatively similar effects in any fractal geometry. Therefore, a fractal geometry is perhaps better suited to model the epidemic spreading.

We then use the model and the ML approach to make predictions of the end-time of the ongoing second wave of the COVID-19 pandemic in eight Indian states, where the total number of infections are over one million (see Table 1). In doing so, we first need to overcome the lack of training data for the ML algorithm. We first note that the first and second waves of the pandemic in India are somewhat separated by a relatively low daily infection rate. Therefore, we take the data for the first wave and make ‘predictions’ about its end-time using the SIR model as the training set. As the predictability of the SIR model is already shown to be sensitive to the lattice dilution, we estimate the dilution fraction for which the error in the ‘predictions’ for the first wave is minimum. We then use the SIR model with that dilution fraction to generate the training data set for making predictions about the end-time of the ongoing second wave. This approach assumes that the statistical nature of the fluctuations in the first and second waves would be similar once those are scaled by the respective peak infection rates. This in turn necessarily assumes that the peak infection for the second wave has already past, which indeed seems to be the case (see Fig. 6). In places where the peak has just reached (until May 20, 2021), the errors in predictions are higher. It also does not consider the effects of vaccinations, new mutant variants of the virus and changes in the travel restriction norms. Nevertheless, use of this ‘synthetic’ training data set enables the ML algorithm to make predictions for the end-time, which is otherwise difficult to do due to the lack of training data sets.

It is useful to note here that ML algorithms have been used widely in various aspects of COVID-19, including diagnosis to predictions on infections (see e.g., [36, 37]). It was also used for imposition of social distancing measures and the possible effects of early lifting of such measures (see e.g., [29] for a review). In our case, we do not implement such measures in the simulation due to the lack of data for the corresponding comparisons.

In conclusion, we note that the epidemic spreading in the SIR model on a two dimensional square lattice can be well predicted by supervised machine learning algorithms. The predictability of the model is sensitive to the site dilution fraction of the model and becomes the highest near the percolation critical point. An optimal condition for predictability can be obtained by tuning the site dilution fraction in the model that minimizes the prediction errors for the first wave of the COVID-19 pandemic. The optimized model can then be used to make predictions of the end-times for the ongoing second wave of the pandemic in different states in India with reasonable success.

Table 2 The infection numbers in states where the predicted end-date (when infection was supposed to be 5% of peak) has passed

Note added in the revised version: Since the work presented here (in the first submission) represent predictions made using data up to May 20, 2021, several states have now passed the predicted end-dates. In Table 2, therefore, we provide the six states for which the predicted end-date has passed and the value of the infection number on that particular date and what percentage of the peak was reached (predicted end-time was defined as 5% of peak).