Preprints with The Lancet is part of SSRN´s First Look, a place where journals identify content of interest prior to publication. Authors have opted in at submission to The Lancet family of journals to post their preprints on Preprints with The Lancet. The usual SSRN checks and a Lancet-specific check for appropriateness and transparency have been applied. Preprints available here are not Lancet publications or necessarily under review with a Lancet journal. These preprints are early stage research papers that have not been peer-reviewed. The findings should not be used for clinical or public health decision making and should not be presented to a lay audience without highlighting that they are preliminary and have not been peer-reviewed. For more information on this collaboration, see the comments published in The Lancet about the trial period, and our decision to make this a permanent offering, or visit The Lancet´s FAQ page, and for any feedback please contact preprints@lancet.com.
Natural Language Processing for Improved COVID-19 Characterization: Evidence from More than 350,000 Patients in a Large Integrated Health Care System
19 Pages Posted: 5 Apr 2022
More...Abstract
Background: Understanding symptoms of SARS-CoV-2 infection in large, community-based populations can improve clinical screening and COVID-19 surveillance. Most prior studies examining symptoms of COVID-19 are survey-based, biased towards hospitalized patients, or rely on structured data from electronic medical records (EMR). We sought to assess whether natural language processing (NLP) of unstructured text from EMR could improve characterization of COVID-19 symptoms in a large integrated healthcare system.
Methods: This was a retrospective cohort study conducted in Kaiser Permanente Southern California (KPSC), a large integrated health care system using data from patients with positive SARS-CoV-2 laboratory tests from March 2020 to May 2021. An NLP algorithm was developed to extract free text from EMR on 12 established COVID-19 symptoms. Proportions of patients reporting each symptom were described before and after supplementing structured EMR data with NLP-extracted symptoms.
Findings: Among 359,938 patients with confirmed SARS-CoV-2 infection, NLP-supplemented analysis identified an additional 55,568 (15%) symptomatic cases that were previously defined as asymptomatic using structured data alone. The most common symptoms identified through NLP-supplemented analyses were cough (61%), fever (52%), myalgia (43%), and headache (40%). The proportion of additional cases with each selected symptom identified in NLP-supplemented analysis varied across symptoms, from 29% of all complaints for cough, to 61% of all records with nausea or vomiting. Of 295,305 symptomatic patients, the median time from symptom onset to testing was 3 days using structured data alone, whereas NLP-supplemented analyses resulted in the identification of COVID-19 symptoms approximately one day earlier.
Interpretation: These findings demonstrate the value of NLP to facilitate enhanced characterization of COVID-19 signs and symptoms compared to traditional surveillance systems. Deploying NLP-based methods in real-time could improve disease surveillance without requiring substantial human or technological resources.
Funding: This study was funded by Roche/Genentech, Inc. but was solely conducted at Kaiser
Permanente Southern California.
Declaration of Interest: This study was funded by Roche–Genentech but was solely done at KPSC. ST, BA, VH, JS, LQ, HF, SS, SC, FX received support from Roche-Genentech for the conduct of the study. VY is works for Roche-Genentech. The funder did not contribute to the design, conduct, or analysis of this study, or to manuscript development.
Ethical Approval: The study protocol was reviewed and approved by the KPSC Institutional Review Board with a waiver of requirement for informed consent.
Keywords: Natural language processing, Public health surveillance, SARS-CoV-2, artificial intelligence
Suggested Citation: Suggested Citation