Introduction

As the global population grows, even though we are still battling some of the pathogens that have been with us since the advent of known human history such as tuberculosis, we are also witnessing a trend in increasing emergence of novel pathogens from non-human hosts, and this is posing a major threat to the public health [1]. Within the last ten years, we witnessed the emergence of new viruses that could potentially spread across international borders and wreak global havoc, the latest of this being the novel coronavirus (COVID-19). The recent development of machine learning-based tools for healthcare providers allows novel ways to combat such global pandemics. The term machine learning encompasses the collection of tools and techniques for identifying patterns in data [2]. In traditional methods of identifying patterns from data, we approach the system with our presumptions as to which components of the data (age, sex, pre-existing conditions) affect the outcome of interest (patient survival). However, in machine learning, we provide data and the machine identifies trends and patterns, enabling us to formulate a model to predict the outcome of patients. The authors will attempt to provide a narrative review of such tools, how they are useful in healthcare, and how they are being utilized in the prediction, prevention, and management of COVID-19. We will also include discussions on such tools used in past infectious diseases such as the SARS-CoV-1 and MERS-CoV viruses and how they may be translatable to COVID-19. The discussion is structured under the techniques for outbreak detection, prediction models for spread, prevention and vaccine development, early case detection and tracking, prognosis prediction of affected patients, and drug development (Fig. 1).

Fig. 1
figure 1

Machine learning tools, data sources and the interventions that are helpful in different stages of a pandemic. Phase 1: Animal influenza virus has not been known to create diseases in the humans. Phase 2: An animal influenza virus has been known to cause disease in humans and is hence a potential pandemic threat. Phase 3: An animal influenza virus has been known to cause solitary or a cluster of diseases in humans but has not created human-to-human transmission. Phase 4: Human to human transmission, sufficient to maintain community level outbreak has been identified. Phase 5: The virus has created sustained community level outbreak in two countries within the same WHO region. Phase 6: In addition to Phase 5, the virus has created sustained community level outbreak in at least one country in another WHO region. Post Peak Period: Level of pandemic in most countries have dropped below peak levels. Possible New Wave: Level of pandemic in most countries is rising again. Post-Pandemic Period: Level of pandemic has returned to seasonal influenza level in most countries

Outbreak detection

Biosurveillance is the science of early detection and prevention of a disease outbreak in the community [3]. Analytics, machine learning, and natural language processing (NLP) are being increasingly used in biosurveillance [4]. Scanning social media, news reports, and other online data can be used to detect localized disease outbreaks before they even reach the level of pandemics [5]. The Canadian company Blue Dot successfully used machine learning algorithms to detect early outbreaks of COVID-19 in Wuhan, China by the end of December 2019 [6, 7]. Big data analysis of medical records, as well as satellite imaging (eg: cars crowding around a hospital), are some other ways big data analysis has been used in the past to detect localized outbreaks [8, 9]. Google Trends has been used in the past to detect the outbreak of Zika virus infections in populations, using dynamic forecasting models [10]. Sentiment analysis is the technique of using natural language processing in social media to understand the positive and negative emotions of the population [11]. Unsupervised sentiment analysis has been proposed as a method for the early detection of infectious diseases in the population [12]. Also, sentiment analysis could be a valuable tool to understand the public’s reactions or overreactions thereof, towards disease outbreaks, and can provide valuable insights to the government in directing efforts towards public education [13,14,15]. These techniques of sentinel biosurveillance would help detect pandemics before they become one and can provide valuable time for the health system to prepare for prevention and management.

Prediction of spread

Various statistical, mathematical and dynamic predictive modeling has been used to successfully predict the extent and spread of infectious diseases through the population [16,17,18,19,20,21]. As opposed to traditional epidemiological predictive models, big-data-driven models have the added advantage of adaptive learning, trend-based recalibration, flexibility and scope to improve based on a newer understanding of the disease process, as well as estimation of the impact of the interventions, such as social distancing, in curbing its spread [22]. The most common is the Susceptible-Exposed-Infectious-Recovered (SEIR) modeling method which is now being used to predict the areas and extent of COVID-19 spread [23, 24]. These techniques can also be used to determine other parameters of the epidemic, such as under-reporting of cases, the effectiveness of interventions, and the accuracy of testing methods [25, 26]. For example, a modeling algorithm attempted to simulate the conditions in which Ebola could spread in the Chinese society, and the effectiveness of the four levels of governmental interventions was evaluated in such conditions [27]. Similar models have also attempted to predict the outbreak and expansion of the Zika virus in real-time in the Americas and were determined to have close to 85% accuracy in quantitative evaluations [28]. An attempt at validating different machine learning algorithms determined that backward propagation neural network (BPNN) demonstrated highest predictive accuracy is modelling Zika virus transmission [29].

Scientists at the Johns Hopkins University developed a COVID-19 prediction modeling based on a previously published stochastic metapopulation epidemic model [30]. A comparison of the prediction of this modeling with real-life data elucidated the lacunae in the understanding of the virus’s dynamics and the model’s limitations [31, 32].

However, a predictive model is only as good as the data it is based on, and in the event of a global pandemic, data sharing across communities is of paramount importance. This was one of the major obstacles in learning about and modeling the 2013–2016 Ebola virus outbreak [33]. The World Health Organization (WHO) has proposed a consensus on expedited data sharing on the COVID-19 outbreak to promote inter-community learning and analytics in this area.

Preventive strategies and vaccine development

Artificial Neural Networks (ANN) were used to predict antigenic regions with a high density of binders (antigenic hotspots) in the viral membrane protein of Severe Acute Respiratory Syndrome Coronavirus (SARS–CoV) [34, 35]. This information is critical to the development of vaccines. Using machine learning for this purpose allows for rapid scanning of the entire viral proteome, allowing faster and cheaper vaccine development. Reverse vaccinology and machine learning were successfully employed to identify six potential vaccine target proteins in the SARS-CoV-2 proteome [36]. Machine learning has also been used in the past to predict the strains of influenza virus that are more likely to cause infection in a population in an upcoming year, and in turn, should constitute the year’s seasonal influenza vaccine. Successful prediction of the future expansion of small subtrees of hemagglutinins (HA) part of the viral antigenic set was possible from training H3N2 and testing on H1N1, using reconstructed timed phylogenetic tree [37]. Machine learning can also be used to predict the hosts of newly discovered viruses based on analysis of nucleoprotein gene sequences and spike gene sequences, and can be a useful additional tool for tracing back viral origins, especially when the data set is large and comparative analysis is difficult or time-consuming [38].

Early case detection and tracking

Early case identification, quarantining, and preventing exposure to the communities are crucial pillars in managing an epidemic such as COVID-19. Mobile phone-based surveys can be useful in early identification of cases, especially in quarantined populations [39, 40]. Such methods have shown success in Italy in identifying influenza patients through a web-based survey [41]. As opposed to traditional methods of survey and analysis, the use of artificial intelligence tools can be used to collect and analyze large amounts of data, identify trends, stratify patients based on risk, and propose solutions to population instead of the individual. Digital phenotyping is the novel concept of collecting smartphone-based active (surveys) and passive (text, voice, location, screen use) data to produce an individual phenotype [42, 43]. This technique can be used to obtain multiple data points and allow stratifying individuals based on their risk. The government of India recently launched a mobile application called “Aarogya Setu” which tracks its users’ exposure to potentially COVID-19 infected patients, using the Bluetooth functionality to scan the surrounding area for other smartphone users. If a patient is tested positive, then the data from the mobile application can be used to track down every app-user who the patient encountered, within the last 30 days [44]. Such techniques of digital phenotyping can be performed even on entry-level smartphones and would be especially useful in low and middle-income countries as a cost-effective method of risk stratification, due to the ubiquitous smartphone availability [45].

The close physical and economic proximity with China should have resulted in high morbidity and mortality due to COVID-19 in Taiwan. However, with the help of machine learning, they were able to bring the number of infected patients to far lower than what was initially predicted. They identified the threat early, mobilized their national health insurance database, and customs and immigration database to generate big data for analytics. Machine learning on this big data helped them stratify their population into lower risk or higher risk based on several factors, including travel history. Persons with higher risk were quarantined at home and were tracked through their mobile phones to ensure that they remained in quarantine. This application of big data, in addition to active case finding efforts, ensured that their case numbers were far fewer than what was initially anticipated [46].

Deep learning algorithms have also been used to identify patterns of infectious disease involvement in imaging results such as CT and MRI. With CT scanning showing high correlations to PCR-positive COVID patients, such algorithms have shown great promise in their ability to detect findings consistent with COVID-19 in CT images of patients [47,48,49].

Prognosis prediction

Machine learning algorithms have been used previously to predict prognosis in patients affected by the MERS Co-V infection [50]. The patient’s age, disease severity on presentation to the healthcare facility, whether the patient was a healthcare worker, and the presence of pre-existing co-morbidities were the four factors that were identified to be the major predictors in the patient’s recovery. These findings are consistent with the currently observed trends in the COVID-19 disease [51, 52]. Using the data visualization tool Mirador, a mobile application Ebola CARE (Computational Assignment of Risk Estimates) was developed to predict a patient’s outcome after being infected with Ebola [53, 54]. The tool identified 24 clinical and laboratory parameters that possibly affect a patient’s prognosis. There is a need for adaptation of these algorithms to assist physicians in their decision-making process while managing COVID-19. Recovery prediction tools help determine resource allocation, triage, treatment determination, as well as health system preparedness.

Treatment development

Machine learning tools have been used in drug development, drug testing, as well as drug repurposing. They enable us to interpret large gene expression profile data sets to suggest new uses for currently available medications. Deep generation models, also known as AI imagination, can design novel therapeutic agents with possible desired activity [55]. These tools help reduce the cost and time of developing drugs, help in developing novel therapeutic agents, as well as predict possible off-label uses for some therapeutic agents [56]. Bayesian Machine Learning tools have been used to develop drugs against Ebola in in-vitro settings and the findings translated well to in-vivo settings as well [57].

Conclusion

Machine learning provides an exciting array of tools that are flexible enough to allow their deployment in any stage of the pandemic. With the large amount of data that is being generated while studying a disease process, machine learning allows for analysis and rapid identification of patterns that traditional mathematical and statistical tools would take a long time to derive. The flexibility, ability to adapt based on a new understanding of the disease process, self-improvement as and when new data becomes available, and the lack of human prejudice in the approach of analysis makes machine learning a highly versatile novel tool for managing novel infections. However, with such an enhanced ability to derive meaning from large amounts of data, there is a greater demand for higher quality control during the collection, storage, and processing of the data. Besides, standardization of data structures across populations would allow these systems to adapt and learn from data across the globe, which wasn’t possible in the past, and is even more important in learning and managing a global pandemic like COVID-19. In addition, with a new pandemic, there tends to be a lot more “noise” in the data, and hence blindly feeding this immature data, which is ridden with outliers into an AI algorithm should always be approached with caution.