Patterns
Volume 1, Issue 9, 11 December 2020, 100123
Journal home page for Patterns

Article
Machine Learning Maps Research Needs in COVID-19 Literature

https://doi.org/10.1016/j.patter.2020.100123Get rights and content
Under a Creative Commons license
open access

Highlights

  • AI/machine learning techniques can analyze coronavirus research at massive scale

  • COVID-19 research has so far focused on non-lab-based (e.g., observational) research

  • COVID-19 lab-based/basic microbiological research is less prevalent than expected

The Bigger Picture

The impact of the COVID-19 pandemic has led scientists to produce a vast quantity of research aimed at understanding, monitoring, and containing the disease; however, it remains unclear whether the research that has been produced to date sufficiently addresses existing knowledge gaps. We use artificial intelligence (AI)/machine learning techniques to analyze this massive amount of information at scale. We find key discrepancies between literature about COVID-19 and what we would expect based on research on other coronaviruses. These discrepancies—namely, the lack of basic microbiological research, which is often expensive and time-consuming—may negatively impact efforts to mitigate the pandemic and raise questions regarding the research community's ability to quickly respond to future crises. Continually measuring what is being produced, both now and in the future, is key to making better resource allocation and goal prioritization decisions as a society moving forward.

Summary

As of August 2020, thousands of COVID-19 (coronavirus disease 2019) publications have been produced. Manual assessment of their scope is an overwhelming task, and shortcuts through metadata analysis (e.g., keywords) assume that studies are properly tagged. However, machine learning approaches can rapidly survey the actual text of publication abstracts to identify research overlap between COVID-19 and other coronaviruses, research hotspots, and areas warranting exploration. We propose a fast, scalable, and reusable framework to parse novel disease literature. When applied to the COVID-19 Open Research Dataset, dimensionality reduction suggests that COVID-19 studies to date are primarily clinical, modeling, or field based, in contrast to the vast quantity of laboratory-driven research for other (non-COVID-19) coronavirus diseases. Furthermore, topic modeling indicates that COVID-19 publications have focused on public health, outbreak reporting, clinical care, and testing for coronaviruses, as opposed to the more limited number focused on basic microbiology, including pathogenesis and transmission.

Data Science Maturity

DSML 4: Production: Data science output is validated, understood, and regularly used for multiple domains/platforms

Keywords

coronavirus
COVID-19
SARS-CoV-2
machine learning
natural language processing
PCA
data science
artificial intelligence
topic modeling
dimensionality reduction

Cited by (0)

8

Lead Contact