Abstract

The onset of the COVID-19 pandemic and the subsequent transmission among communities has made the entire human population extremely vulnerable. Due to the virus’s contagiousness, the most powerful economies in the world are struggling with the inadequacies of resources. As the number of cases continues to rise and the healthcare industry is overwhelmed with the increasing needs of the infected population, there is a requirement to estimate the potential future number of cases using prediction methods. This paper leverages data-driven estimation methods such as linear regression (LR), random forest (RF), and XGBoost (extreme gradient boosting) algorithm. All three algorithms are trained using the COVID-19 data of Pakistan from 24 February to 31 December 2020, wherein the daily resolution is integrated. Essentially, this paper postulates that, with the help of values of new positive cases, medical swabs, daily death, and daily new positive cases, it is possible to predict the progression of the COVID-19 pandemic and demonstrate future trends. Linear regression tends to oversimplify concepts in supervised learning and neglect practical challenges present in the real world, often cited as its primary disadvantage. In this paper, we use an enhanced random forest algorithm. It is a supervised learning algorithm that is used for classification. This algorithm works well for an extensive range of data items, and also it is very flexible and possesses very high accuracy. For higher accuracy, we have also implemented the XGBoost algorithm on the dataset. XGBoost is a newly introduced machine learning algorithm; this algorithm provides high accuracy of prediction models, and it is observed that it performs well in short-term prediction. This paper discusses various factors such as total COVID-19 cases, new cases per day, total COVID-19 related deaths, new deaths due to the COVID-19, the total number of recoveries, number of daily recoveries, and swabs through the proposed technique. This paper presents an innovative approach that assists health officials in Pakistan with their decision-making processes.

1. Introduction

The COVID-19 was declared a deadly virus by the World Health Organization (WHO) [1, 2]. There is a need for countries to act in unison to prevent further transmission of the disease. A pandemic is a disease that spreads worldwide [3]. Throughout history, the world has witnessed many pandemics. The most recent was in the year 2009 due to the H1N1 flu. The first few cases of COVID-19 were reported to the WHO on 31 December 2019 in the city of Wuhan, Hubei province in China, wherein several people were afflicted with pneumonia, and the cause could not be determined. In January 2020, officials identified a novel virus that was not named yet [4, 5], which was subsequently popularized as the 2019 novel Coronavirus [6]. Upon obtaining the samples and analyzing the virus genetics, it was established that it caused the outbreak. The virus was named Coronavirus 2019 (COVID-19) by the WHO in February 2020 [7], while some studies found that this deadly COVID-19 virus is associated with SARS-CoV-2 [8, 9]. With its 204 million population, Pakistan saw first of its case in February 2020 [10]. With the 5th largest population globally, it became essential to understand how the virus will progress in this vast population and how it will progress in Pakistan. Therefore, it has become essential to address the problem of the future trend of COVID-19-positive cases in Pakistan by using the COVID-19 dataset from [11]. Machine learning is widely used to handle large data, and it can help in this regard. We specifically test three methods, namely, linear regression, random forest, and XGBoost algorithm. In this paper, we predict positive COVID19 cases in Pakistani regions of Sindh, Punjab, Gilgit Baltistan, Balochistan, Khyber Pakhtunkhwa, Azad Jammu, and Kashmir using three ML algorithms, and we compare the results; in order to find out the optimal algorithm for the dataset which gives the highest accuracy for the forecast of COVID-19-positive cases. A real-time forecasting scheme is presented based on ML models, which provides real-time prediction allowing citizens and the government of Pakistan to take actions proactively [12, 13]. This paper effectively predicts future COVID-19 pandemic trends by employing open-source data science libraries and machine learning tools in Python. The primary objectives of this study are as follows:(i)To source [12], preprocess, visualize, and analyze the data of COVID-19 in Pakistan(ii)To recognize the various parameters required for COVID-19 modeling and drive these variables for all the three forecasting algorithms used(iii)To rectify and eliminate biases(iv)Model and predict future the trend of the COVID-19 pandemic(v)Visualize and discuss the results

The COVID-19 dataset of Pakistan has not been tested on a large scale by using machine learning algorithms. This paper contributes to using machine learning algorithms on indigenous datasets in Pakistan, which can significantly help in assessing and planning to take actions accordingly. The paper is structured in the following manner: Section 2 presents an overview of literature related to COVID-19 forecasting; Section 3 explains the methodology for predicting COVID-19; Section 4 shows the results for all three machine learning models; and Section 5 illustrates the relationship between parameters. In Section 5, we summarized this work and presented various results.

Kavadi et al. [1] developed a mathematical model to assess and estimate the growth of the worldwide COVID-19 pandemic. Machine learning generalized inverse Weibull model has been implemented to evaluate the potential risks associated with the Coronavirus. In order to ensure precise and real-time prediction on the growth of the pandemic, cloud computing was employed. A model was implemented by Nemati et al. [3] to highlight the efforts of the Pakistan government to fight with COVID-19. This paper presents the current scenario of the Coronavirus situation in Pakistan and provides information about the hospital facilities provided for COVID-19 patients. The results show that the recovery rate is higher than the mortality rate in Pakistan, and Balochistan has more hospitals for COVID-19 patients. Azad Jammu and Kashmir have the least hospitals for COVID-19 patients. Isolation zones were built in Pakistan, and this study shows that Punjab and Khyber Pakhtunkhwa regions have more isolation wards and better medical facilities. Ardabili et al. [4] proposes the PDR-NML method (partial derivative regression and nonlinear machine learning) to predict the pandemic trends of COVID-19. The results show that the proposed ML method is more effective than other state-of-the-art methods in the Indian population. Thus, it can be an innovative tool in helping other countries make their predictions. The authors of this study have also used PPDLR for normalizing the features required for timely prediction and PDLFR for robust and accurate prediction and observed that machine learning performed well for data analysis than artificial intelligence. Lalmuanawma et al. [5] predicted the trend of COVID-19. The Fb-prophet model is used to establish the pandemic curve and forecasting its direction. The disadvantage of this study is that they have used the limited dataset this work is integrated into the logistic model. Three significant points have been summarized based on the modeling results related to Indonesia, Peru, Brazil, India, and Russia. According to estimations based purely on mathematical aspects, the peak of the virus will be witnessed globally in late October, and it is expected that 14.12 million people will be impacted on a cumulative basis. Rustam et al. [7] implemented the autoregressive integrated moving average (ARIMA) model to predict the new COVID-19 cases each day in Saudi Arabia for four weeks. The authors have summarized four different prediction models in this study, including autoregressive model, moving average, a combination of both (ARMA), and integrated ARMA (ARIMA), to identify the apt model fit. The results show that the ARIMA model is more effective in comparison to the other models. Pandey et al. [8] aim to forecast the COVID-19-positive cases in India and Odisha by using linear regression and multiple linear regression. Therefore, it is observed that both models provided remarkable accuracy for the prediction of the COVID-19 pandemic. Roy et al. [10] summarized four machine learning algorithms to forecast COVID-19-infected people. The data of COVID-19 between 20/01/2020 and 18/09/2020 for the USA, Germany, and global were obtained from the World Health Organization. The performance of all algorithms is compared according to the RMSE, APE, and MAPE criteria, and it was observed that these models could be used to diagnose the COVID-19 data over time. To predict the future forecast of the COVID-positive case, Ayyoubzadeh et al. [11] used XGBoost, -means, and long short-term memory (LSTM) neural networks to construct a prediction model. Therefore, it was observed that -means–LSTM provides higher accuracy with an error score of 601.20%.

3. Methodology

In this study, classification algorithms were applied, and an evaluation process is done for each algorithm based on different parameters shown in Figure 1. This research work involves few significant steps like data collection, data preprocessing, applying machine learning algorithms, evaluation, and comparative analysis.

3.1. Data Collection

The data used in this work is accessed from http://covid.gov.pk [12, 14]. The information related to COVID-19 cases in Pakistan has been compiled from different sources, including Kaggle and World Health Organization (WHO) [6, 1517]. A cumulative data set is created from a mix of the above resources. The information taken from http://covid.gov.pk/ data is not in a required CSV format. It also contained some unnecessary data that was not needed to predict positive cases in Pakistan data preprocessing was done. The dataset includes the hospital data of COVID-19-positive patients, deceased patients, recovered patients, total deaths of patients, and the number of swab tests conducted every day in each region of Pakistan. The dataset contains all the COVID-19 data of the patients in the specified data collection period.

3.2. Data Preprocessing

After the collection of information, the data was transformed into the required CSV format. In order to rectify the issue of systemic bias, a feasible methodology was adopted. The moving-average method, which is typically used to assess time-series through the computation of averages of various subsets within the complete dataset, was adopted for this purpose. The moving-average method, which is typically used to assess time-series through the computation of averages of various subsets within the complete dataset, was adopted for this purpose. In this context, seven days were taken as the complete dataset. Initially, the moving average was computed by finding the average of the first subset over seven days. Then, the subset was altered as the following fixed subset was chosen. This went on till all the subsets were subjected to this method. Essentially, this method tends to smoothen the data by mitigating anomalies, the weekend bias. In Figures 25, the dataset variables are plotted as time series depicting total COVID-19-positive cases across Pakistan, total COVID-19 deaths across Pakistan, new COVID-19-positive cases in Pakistan regions, and COVID-19 patients who are in serious condition. Figure 2 displays the daily new COVID-19-positive cases in Pakistan as it is essential for forecasting. Figure 3 displays the average of COVID-19-positive cases in a week. And, Figure 4 represents total COVID-19-positive cases across Pakistan. Also, Figure 5 represents total deaths across Pakistan. Figure 6 displays daily new reported COVID-19 cases in Pakistan regions, whereas Figure 7 illustrates the COVID-19 patients’ data who are in serious condition.

3.3. Applying Machine Learning Algorithms

After preprocessing, random forest, XGBoost, and linear regression models were applied to predict COVID-19-positive cases in Pakistan [18]. A linear regression model was employed to model the COVID-19 trend. It was trained using positive cases and new positive cases data on both the national and provincial levels in Pakistan. In regression, the coefficient of determination is a statistical measure that informs the preciseness of the regression predictions by comparing them with the fundamental data points. If the value of is deduced to be 1, it denotes that the regression predictions accurately align with the data. Thus, the closer the value of is to 1, the more influential the model is in predicting trends [19]. The random forest algorithm is a popular unsupervised machine learning algorithm, and it is employed for classification [20, 21]. It is an ensemble machine learning method. The random forest represents a decision tree. number of outputs are obtained by the number of the decision tree using this algorithm.

3.4. Forecasting the Trend of Positive COVID-19 Cases across Pakistan Regions

The COVID-19 outbreak has badly affected the essential aspects of life around the world. In order to control this outbreak, smart lockdowns have been imposed all over the country and are highly affected areas of Pakistan. This study will provide an idea about the increase of COVID-19 in Pakistan and its provinces. It will also help Pakistan and its citizens make appropriate decisions to handle the situation by following proper SOP’s and guidelines.

3.5. Forecasting the Trend of Positive COVID-19 Cases Using Linear Regression Algorithm

In this study, a detailed description of linear regression is presented. In addition, all the tests performed for the validity of linear pegression are analyzed and discussed. We have used linear regression to forecast the value of a dependent variable by provided independent variable data [2224]. It was observed that there is a linear relationship between independent variables and dependent variables. In our study, we considered as an independent variable and as a dependent variable, and the value of is predicted by using the following equation:

where is a vector of input parameters and is a vector of output parameters. is also called independent variables as response variables. In machine learning regression is a method to find the relation between and . When the relationship is done using a linear predictor function, assuming a system is linear, equation (1) represented by

Here, is the vector of coefficients of regression and represents the vector of model error. If we expand the above equation then the equation would be represented as

In equation (3), , , and are estimated by using standard methods. Let us assume that the estimated coefficients is defined by and the fitting response is represented in

The (coefficient of determination) is given by

Here, represents the forecast of total positive cases in Pakistan and variable represents the date, denotes the -intercept and indicates the slope. The linear regression model is built by learning the values of and from a given dataset, where is the measure of the proportion of variation in y explained by the input parameters. In this study, is used to determine the values of , , and , and is the mean of all observations.

3.6. Forecasting the Trend of Positive COVID-19 Cases Using Random Forest Algorithm

To implement the random forest model first, we have taken the COVID-19 dataset of Pakistan as an input. Then, the random forest model was trained on that dataset. Independent variables are considered dependent variables. The actual number of COVID-19 cases is regarded as the dependent variable [25]. The random forest model was used for forecasting the COVID-19-positive cases in Pakistan territories. Implementation of this model is described in the following flowchart.

Random forest consists of many decision trees. The higher the number of decision trees, the more accurate results we will get. There is a direct relation between outcome and number of decision trees in Random Forest. It consists of many decision trees. The higher the number of decision trees, the more accurate results we will get. There is a direct relation between outcome and number of decision trees in random forest [2628]. The primary purpose of this algorithm is to improve prediction accuracy by aggregating multiple classifiers. The random forest algorithm is widely used for classification and prediction. It can be applied to many fields such as forecasting, data analysis, text classification, and face recognition [29]. This algorithm combines multiple decision trees and classifier models. The construction process of random forest is described in Figure 8. In our study, the prediction process is divided into two significant parts: the first part is the growth of the decision tree, and the second part is the voting process. The growth process is divided into three categories: first is a random selection of training set, second is random forest construction, and third is split node. In the node splitting process, Gini is selected as the smallest coefficient to split the feature. The steps for calculation of coefficient Gini is given as follows:

where represents the probability of category in the sample set .

where represents the number of the sample set and and represents the samples in subsets and . Therefore, it was observed that the random forest algorithm provides better performance due to the random selection of the feature set and training set. In this study, the and mean square error for random forest were calculated using evaluation metrics [30]. The formula for calculating is given below:

In the above equation, represents actual values, represents the predicted value, and represents the average of all values. If the value is nearer to 1 then the model is good for forecasting.

The formula for calculation of root mean square (RMSE) is shown below:

Here, represents the actual value, represents the predicted value, and indicates the number of samples, and .

3.7. Forecasting the Trend of Positive COVID-19 Cases Using Extreme Gradient Boost (XGBOOST) Algorithm

The extreme gradient boost (XGBoost) is a widely used and most good machine learning algorithm.

It converts the weak classifier into the robust classifier. The process is repeated according to the needs of the desired model, which in this study is to forecast the positive cases of COVID-19 in Pakistan territories. XGBoost algorithm is a tree learning model which takes the decision tree as its basic unit, and the final learning model of XGBoost consists of many decision trees. It is an impaired algorithm based on a gradient boosting tree. It uses CART or linear classifier as the gradient boosting algorithm. XGBoost algorithm has several advantages for prediction problems.(i)It supports parallelization(ii)Used for processing of missing values(iii)Based on the existing model it supports iteration(iv)Provide controllable model complexity(v)Provide high flexibility(vi)Support shrinkage

The calculation of XGBoost is as follows:

Suppose a sample set with sample set and -dimensional characteristics a model containing decision trees can be represented by :

Here, represents the function space formed by all tree models and represents regression tree.

Here, shows the mapping relationship from to the leaf node, and represents the weight to the leaf node.

The objective function of the defined model is as follows:

In the above equation, represents the actual value and represents the forecast value where the first part is the learning loss, and the second part represents the sum of complexity of each tree; the complexity of th is indicated by

where is the number of leaf nodes, is the leaf weight, is the penalty coefficient of leaf weight, and is the penalty coefficient of profit function of segmented leave node.

The gradient boost strategy is used in the XGBoost algorithm to generate a regression tree after every iteration which is added to the existing model.

Assume that the forecasted of th sample in the th time of iteration and the new added regression tree is then we get

where is the forecasted result of the model in round and is the new added function in round.

By combining equations (14), (11), and objective function, we get

where is the constant term.

Now, apply second-order Taylor expansion on the above equation then we get

Here, and represent the first and second derivatives of the loss function and is a constant.

By removing constants, we get

4. Results and Discussion

The COVID-19 virus has infected many people, and the number of infected people may increase in the future. The machine learning system approach will show promising results for the forecast of COVID-19-positive cases. Statistical models are essential techniques for evaluating infectious disease data analyses in real-time. In this research, a real-time COVID-19 forecast is built for the regions of Pakistan. Our predicted models performed very well in predicting the daily new confirmed COVID-19-positive cases in the regions of Pakistan. All the steps involved in building the proposed model are implemented in python using Pandas library for data loading and preprocessing of data Matplotlib is used to plot the curves, and Scikit-learn library is also used for implementation of the classifier. This research’s experiments are executed on a system with a Dell i7 processor with 64 GB RAM. For further evaluation, metrics (accuracy, precision, support, recall, -score, and sensitivity) are used to measure the quality of machine learning models. We have proposed a prediction model that works over six months to predict COVID-19 activity by combining the previous incidence of COVID-19. Our proposed model performed well for all regions of Pakistan. The performance of algorithms was evaluated by using mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) [15, 31], and evaluation metric. This proposed model has several advantages compared to other reported works on a similar topic. Our forecasting models performed well for the COVID-19 forecast, but random forest and XGBoost provide better accuracy. We have used a large amount of data which improved the performance of all ML models. Figures 914 show results of the COVID-19 trend using linear regression in regions of Pakistan. The red bars are the training data, whereas the blue is the predicted trend with indicated model scores. If the blue bar is increasing, it means positive cases are increasing day by day. In Figure 9, red bars represent the actual COVID-19-positive cases data in Sindh, Pakistan, whereas blue bars represent the predicted COVID-19-positive cases. According to prediction, this figure shows that Sindh may have a higher number of positive cases in May. In Table 1, error metrics shows that the MSE score for Sindh is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated using evaluation metrics on the Sindh region, whose accuracy percentage is 86%, support percentage is 28%, precision percentage is 82%, recall percentage is 1%, -score percentage is 82%, and sensitivity percentage is 1%. Figure 10 represents the forecast prediction of Punjab, Pakistan. According to prediction, this figure shows that Punjab may have a higher number of positive cases in May. In Table 1, error metrics shows that the MSE score for Punjab is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated using evaluation metrics on the Punjab region, whose accuracy percentage is 82%, support percentage is 27%, precision percentage is 72%, recall percentage is 1%, -score percentage is 83%, and sensitivity percentage is 1%. Figure 11 represents the forecast prediction of Gilgit Baltistan, Pakistan. According to prediction, this figure shows that in January and February, Gilgit Baltistan may have a higher number of positive cases than cases are slowly decreasing. In Table 1, error metrics shows that the MSE score for Gilgit Baltistan is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, the linear regression model is evaluated in Table 2. By using evaluation metrics on the Gilgit Baltistan region, it shows accuracy percentage is 84%, support percentage is 94%, precision percentage is 84%, recall percentage is 1%, -score percentage is 96%, and sensitivity percentage is 1%. Figure 12 represents the forecast prediction of Khyber Pakhtunkhwa, Pakistan. In Table 1, error metrics shows that the MSE score for Khyber Pakhtunkhwa is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated using evaluation metrics on the Khyber Pakhtunkhwa region, whose accuracy percentage is 86%, support percentage is 27%, precision percentage is 76%, recall percentage is 1%, -score percentage is 82%, and sensitivity percentage is 1%. Figure 13 represents the forecast prediction of Balochistan, Pakistan. In Table 1, error metrics shows that the MSE score for Balochistan is 0.202, MAE is 2.025, RMSE is 4.860, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated by using evaluation metrics on the Balochistan region, whose accuracy percentage is 82%, support percentage is 28%, precision percentage is 71%, recall percentage is 1%, -score percentage is 82%, and sensitivity percentage is 1%. Figure 14 represents the forecast prediction of Azad Jammu And Kashmir, Pakistan. In Table 1, error metrics shows that the MSE score for Azad Jammu And Kashmir is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated by using evaluation metrics on Azad Jammu And Kashmir region whose accuracy percentage is 74%, support percentage is 30%, precision percentage is 5%, recall percentage is 1%, -score percentage is 83%, and sensitivity percentage is 1%.

By using the above random forest methodology, a visualization of records in terms of actual versus predicted values have shown below in graphs. Figures 1520 show results of the COVID-19 trend using random forest in regions of Pakistan. The red bars are the training data, whereas the blue is the predicted trend with indicated model scores. In Figure 15, red bars represent the actual COVID-19-positive cases data in Sindh, Pakistan, whereas blue bars represent the predicted COVID-19-positive cases. According to prediction, this figure shows that Sindh may have a higher number of positive cases in May. In Table 3, error metrics shows that the MSE score for Sindh is 0.006, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated by using evaluation metrics on Sindh region whose accuracy percentage is 93%, support percentage is 136%, precision percentage is 84%, recall percentage is 82%, -score percentage is 90%, and sensitivity percentage is 92%. Figure 16 shows that in March, April, and May, Punjab may have a higher number of positive cases. In Table 3, error metrics shows that the MSE score for Punjab is 0.149, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random Forest model is evaluated by using evaluation metrics on Punjab region whose accuracy percentage is 93%, support percentage is 154%, precision percentage is 85%, recall percentage is 75%, -score percentage is 88%, and sensitivity percentage is 92%. Figure 17 represents the forecast of Khyber Pakhtunkhwa, and in April and May, Khyber Pakhtunkhwa may have a higher number of Positive cases. In Table 3, error metrics shows that the MSE score for Khyber Pakhtunkhwa is 0.022, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated using evaluation metrics on Khyber Pakhtunkhwa region whose accuracy percentage is 93%, support percentage is 154%, precision percentage is 84%, recall percentage is 84%, -score percentage is 89%, and sensitivity percentage is 92%. Figure 18 represents the forecast of Gilgit Baltistan. In Table 3, error metrics shows that the MSE score for Gilgit Baltistan is 0.002, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated using evaluation metrics on the Gilgit Baltistan region, whose accuracy percentage is 95%, support percentage is 117%, precision percentage is 90%, recall percentage is 76%, -score percentage is 92%, and sensitivity percentage is 90%. Figure 19 represents the forecast of Balochistan, and blue bars mean that in April, May, and June, Balochistan May have a higher number of COVID-19-positive cases. In Table 3, error metrics shows that the MSE score for Balochistan is 0.013, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated by using evaluation metrics on Balochistan region whose accuracy percentage is 93%, support percentage is 156%, precision percentage is 92%, recall percentage is 79%, -score percentage is 86%, and sensitivity percentage is 92%. Figure 20 represents the forecast of Azad Jammu And Kashmir forecast, and blue bars represent that in April, May, and June, Azad Jammu And Kashmir May have a higher number of COVID-19-positive cases. In Table 3, error metrics shows that the MSE score for Azad Jammu And Kashmir is 0.126, MAE is 2.030, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated by using evaluation metrics on Azad Jammu And Kashmir region whose accuracy percentage is 93%, support percentage is 181%, precision percentage is 85%, recall percentage is 74%, -score percentage is 85%, and sensitivity percentage is 92%.(I)Sindh region(II)Punjab region(III)Khyber Pakhtunkhwa region(IV)Gilgit Baltistan region(V)Balochistan region(VI)Azad Jammu And Kashmir region

Using the above XGBoost methodology, a visualization of records in terms of actual versus predicted values is shown below in graphs. Figures 2126 show results of the COVID-19 trend using the XGBoost model in regions of Pakistan. The red bars are the training data, whereas the blue is the predicted trend. In Figure 21, red bars represent the actual COVID-19-positive cases data in Sindh, Pakistan, whereas blue bars represent the predicted COVID-19-positive cases. According to prediction, this figure shows that in May, Sindh may have a higher number of positive cases. In Table 5 Error Metrics shows MSE score for Sindh is 0.074, MAE is 0.579, RMSE is 1.389, and MAPE score is 0.003. Figure 22 shows that in April and May, Punjab may have a higher number of positive cases. In Table 5, error metrics shows that the MSE score for Punjab is 0.394, MAE is 1.332, RMSE is 3.17, and MAPE score is 0.007. Figure 23 represents the forecast of Balochistan. In Table 5, error metrics shows that the MSE score for Balochistan is 0.304, MAE is 1.169, RMSE is 2.807, and MAPE score is 0.006. Figure 24 represents the forecast of Khyber Pakhtunkhwa. In Table 5, error metrics shows that the MSE score for Khyber Pakhtunkhwa is 0.198, MAE is 0.836, RMSE is 2.008, and MAPE score is 0.004. Figure 25 represents the forecast of Gilgit Baltistan. In Table 5, error metrics shows that the MSE score for Gilgit Baltistan is 0.049, MAE is 0.944, RMSE is 2.266, and MAPE score is 0.005. Figure 26 represents the forecast of Azad Jammu And Kashmir. In Table 5, error metrics shows that the MSE score for Azad Jammu And Kashmir is 0.049, MAE is 0.472, RMSE is 1.135, and MAPE score is 0.002.

4.1. Comparative Analysis

Linear regression, random forest, and XGBoost algorithms are used to predict COVID-19 cases, and it is observed that the random forest algorithm is better than linear regression. The random forest provides high accuracy for the prediction of positive COVID-19 cases in Pakistan. To compare the performance of linear regression, XGBoost, and random forest estimation method, mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) are used [32, 33].

4.2. Evaluation Metrics

Since it is an inevitable prediction [34], the accuracy of all algorithms is checked. To identify the model with the best prediction power, we considered six evaluation metrics, including accuracy, precision, sensitivity, recall, support, and -score [35, 36]. Tables 2 and 4 show the performance results of machine learning algorithms for regions of Pakistan for our proposed model. It is observed that the linear regression and random forest show comparable results. Random forest has comparably better performance than linear regression. However, this paper also proposes using the XGBoost algorithm, which performs better than both ML algorithms.

4.2.1. Accuracy

It is used to check the performance of linear regression and random forest. This study would equate to the correct number of positive cases over the total predictions made by both models.

4.2.2. Precision

It is the ratio of TP (true positive) samples with the sum of false positive (FP) and TP (true positive). It is used to classify the total COVID-19-positive cases by using the Pakistan dataset.

4.2.3. Recall

It is the fraction of TP (true positive) samples with the sum of false negative and true positive.

4.2.4. -Score

It is the mean of precision and recall value. It provides a balance between recall and precision by evaluating linear regression and random forest model performance in the classification of COVID-19 patients.

4.2.5. Sensitivity

It is the rate of TP (true positive). It measures the proportion of true positives (TP). In our study, a true positive would be the prediction of positive COVID-19 cases.

4.3. Correlation

It is used to measure the interrelation between two variables and also the direction of their relationship. The value of correlation is always greater than -1 and less than +1. If the coefficient reaches point 0, then the relationship between variables becomes weak. In correlation positive (+) sign indicates a positive relationship between variables, and a negative (-) sign indicates a negative relationship. There are several types of correlation: point-biserial correlation, Kendall rank, Spearman correlation, and Pearson correlation [37, 38].

4.3.1. Pearson Correlation

Through Pearson correlation, we can measure the relationship between linearly related variables. It is a widely used correlation. In this type of correlation, when variables whose correlation is to be found are supposed to be normalized, if they are not normalized, then the first normalization should be performed [39]. The relationship between two variables must be straight, assuming that data is equally distributed about the regression line. Correlation between dataset features provides detailed information about features and the ratio of influence that they have on the target value. The heat map of Pearson correlation between the features of the dataset is shown in Figure 27. It revealed a stronger positive correlation between new positive cases and hospitalized with symptoms. There is also a strong correlation between total cases and deaths. Correlation in Figure 28 reveals a stronger positive correlation between new positive cases and recoveries, and there is also a strong correlation between total cases and total recoveries.

5. Conclusion

This deadly virus has killed many people all around the world. It is a dangerous disease that transfers from one human to another, and it creates severe damage to the lungs. In this paper, we have proposed machine learning methods for forecasting COVID-19-positive cases in Pakistan regions. Random forest, XGBoost, and linear regression algorithms were used as prediction models. After evaluating these algorithms, it is identified that the random forest and XGBoost algorithm provide better accuracy than linear regression. Random forest and XGBoost algorithms provide a high prediction rate. The evaluation results of this proposed model prove that using variables as predictors can lead us to high forecasting accuracy. These predictions will be helpful for researchers, government authorities, and health industry planners to manage services and arrange medical infrastructure accordingly. Additionally, the correlation matrix reveals that positive COVID-19 patients and hospitalized patients have a robust correlation. This proposed model is also helpful for other countries for forecasting COVID-19-positive cases.

In the future, this model can be extended to implement various other ML algorithms and prediction methodologies.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.