lancet-header

Preprints with The Lancet is part of SSRN´s First Look, a place where journals identify content of interest prior to publication. Authors have opted in at submission to The Lancet family of journals to post their preprints on Preprints with The Lancet. The usual SSRN checks and a Lancet-specific check for appropriateness and transparency have been applied. Preprints available here are not Lancet publications or necessarily under review with a Lancet journal. These preprints are early stage research papers that have not been peer-reviewed. The findings should not be used for clinical or public health decision making and should not be presented to a lay audience without highlighting that they are preliminary and have not been peer-reviewed. For more information on this collaboration, see the comments published in The Lancet about the trial period, and our decision to make this a permanent offering, or visit The Lancet´s FAQ page, and for any feedback please contact preprints@lancet.com.

Natural Language Processing for Improved COVID-19 Characterization: Evidence from More than 350,000 Patients in a Large Integrated Health Care System

19 Pages Posted: 5 Apr 2022

See all articles by Deborah E. Malden

Deborah E. Malden

Government of the United States of America - Centers for Disease Control and Prevention (CDC)

Sara Y. Tartof

Kaiser Permanente Southern California - Department of Research & Evaluation

Bradley K. Ackerson

Kaiser Permanente Southern California

Vennis Hong

Kaiser Permanente Southern California - Department of Research & Evaluation

Jacek Skarbinski

Kaiser Permanente Northern California

Vince Yau

Genentech, Inc. - San Francisco

Lei Qian

Kaiser Permanente Southern California; Kaiser Permanente Southern California

Heidi Fischer

Kaiser Permanente Southern California

Sally Shaw

Kaiser Permanente Southern California

Susan Caparosa

Kaiser Permanente Southern California

Fagen Xie

Kaiser Permanente Southern California

More...

Abstract

Background: Understanding symptoms of SARS-CoV-2 infection in large, community-based populations can improve clinical screening and COVID-19 surveillance. Most prior studies examining symptoms of COVID-19 are survey-based, biased towards hospitalized patients, or rely on structured data from electronic medical records (EMR). We sought to assess whether natural language processing (NLP) of unstructured text from EMR could improve characterization of COVID-19 symptoms in a large integrated healthcare system.

Methods: This was a retrospective cohort study conducted in Kaiser Permanente Southern California (KPSC), a large integrated health care system using data from patients with positive SARS-CoV-2 laboratory tests from March 2020 to May 2021. An NLP algorithm was developed to extract free text from EMR on 12 established COVID-19 symptoms. Proportions of patients reporting each symptom were described before and after supplementing structured EMR data with NLP-extracted symptoms.

Findings: Among 359,938 patients with confirmed SARS-CoV-2 infection, NLP-supplemented analysis identified an additional 55,568 (15%) symptomatic cases that were previously defined as asymptomatic using structured data alone. The most common symptoms identified through NLP-supplemented analyses were cough (61%), fever (52%), myalgia (43%), and headache (40%). The proportion of additional cases with each selected symptom identified in NLP-supplemented analysis varied across symptoms, from 29% of all complaints for cough, to 61% of all records with nausea or vomiting. Of 295,305 symptomatic patients, the median time from symptom onset to testing was 3 days using structured data alone, whereas NLP-supplemented analyses resulted in the identification of COVID-19 symptoms approximately one day earlier.

Interpretation: These findings demonstrate the value of NLP to facilitate enhanced characterization of COVID-19 signs and symptoms compared to traditional surveillance systems. Deploying NLP-based methods in real-time could improve disease surveillance without requiring substantial human or technological resources.

Funding: This study was funded by Roche/Genentech, Inc. but was solely conducted at Kaiser
Permanente Southern California.

Declaration of Interest: This study was funded by Roche–Genentech but was solely done at KPSC. ST, BA, VH, JS, LQ, HF, SS, SC, FX received support from Roche-Genentech for the conduct of the study. VY is works for Roche-Genentech. The funder did not contribute to the design, conduct, or analysis of this study, or to manuscript development.

Ethical Approval: The study protocol was reviewed and approved by the KPSC Institutional Review Board with a waiver of requirement for informed consent.

Keywords: Natural language processing, Public health surveillance, SARS-CoV-2, artificial intelligence

Suggested Citation

Malden, Deborah E. and Tartof, Sara Y. and Ackerson, Bradley K. and Hong, Vennis and Skarbinski, Jacek and Yau, Vince and Qian, Lei and Fischer, Heidi and Shaw, Sally and Caparosa, Susan and Xie, Fagen, Natural Language Processing for Improved COVID-19 Characterization: Evidence from More than 350,000 Patients in a Large Integrated Health Care System. Available at SSRN: https://ssrn.com/abstract=4075842 or http://dx.doi.org/10.2139/ssrn.4075842

Deborah E. Malden (Contact Author)

Government of the United States of America - Centers for Disease Control and Prevention (CDC) ( email )

Atlanta
United States

Sara Y. Tartof

Kaiser Permanente Southern California - Department of Research & Evaluation ( email )

Pasadena, CA
United States

Bradley K. Ackerson

Kaiser Permanente Southern California ( email )

Mission Viejo, CA
United States

Vennis Hong

Kaiser Permanente Southern California - Department of Research & Evaluation ( email )

Mission Viejo, CA
United States

Jacek Skarbinski

Kaiser Permanente Northern California ( email )

Oakland, CA
United States

Vince Yau

Genentech, Inc. - San Francisco ( email )

1 DNA Way
South San Francisco, CA 94080-4990
United States

Lei Qian

Kaiser Permanente Southern California ( email )

CA
United States

Kaiser Permanente Southern California ( email )

CA
United States

Heidi Fischer

Kaiser Permanente Southern California ( email )

CA
United States

Sally Shaw

Kaiser Permanente Southern California ( email )

CA
United States

Susan Caparosa

Kaiser Permanente Southern California ( email )

CA
United States

Fagen Xie

Kaiser Permanente Southern California ( email )

CA
United States

Click here to go to TheLancet.com

Paper statistics

Downloads
46
Abstract Views
343
PlumX Metrics