Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: Jan 2, 2022
Open Peer Review Period: Jan 2, 2022 - Jan 17, 2022
Date Accepted: Feb 8, 2022
Date Submitted to PubMed: Feb 10, 2022
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Identifying COVID-19 Outbreaks From Contact-Tracing Interview Forms for Public Health Departments: Development of a Natural Language Processing Pipeline

Caskey J, McConnell IL, Oguss M, Dligach D, Kulikoff R, Grogan B, Gibson C, Wimmer E, DeSalvo TE, Nyakoe-Nyasani EE, Churpek MM, Afshar M

Identifying COVID-19 Outbreaks From Contact-Tracing Interview Forms for Public Health Departments: Development of a Natural Language Processing Pipeline

JMIR Public Health Surveill 2022;8(3):e36119

DOI: 10.2196/36119

PMID: 35144241

PMCID: 8906835

A Natural Language Processing Pipeline to Identify COVID-19 Outbreaks from Contact Tracing Interview Forms for Public Health Departments

  • John Caskey; 
  • Iain L McConnell; 
  • Madeline Oguss; 
  • Dmitriy Dligach; 
  • Rachel Kulikoff; 
  • Brittany Grogan; 
  • Crystal Gibson; 
  • Elizabeth Wimmer; 
  • Traci E DeSalvo; 
  • Edwin E Nyakoe-Nyasani; 
  • Matthew M Churpek; 
  • Majid Afshar

ABSTRACT

Background:

In Wisconsin, COVID-19 case interview forms contain free text fields that need to be mined to identify potential outbreaks for targeted policy making. We developed an automated pipeline to ingest the free text into a pre-trained neural language model to identify businesses and facilities as outbreaks.

Objective:

We aim to examine the performance of our pipeline.

Methods:

Data on cases of COVID-19 were extracted from the Wisconsin Electronic Disease Surveillance System (WEDSS) for Dane County between July 1, 2020, and June 30, 2021. Features from the case interview forms were fed into a Bidirectional Encoder Representations from Transformers (BERT) model that was fine-tuned for named entity recognition (NER). We also developed a novel location mapping tool to provide addresses for relevant NERs. The pipeline was validated against known outbreaks that were already investigated and confirmed.

Results:

There were 46,898 cases of COVID-19 with 4,183,273 total BERT tokens and 15,051 unique tokens. The recall and precision of the NER tool were 0.67 (95 % CI 0.66-0.68) and 0.55 (95 % CI: 0.54-0.57), respectively. For the location mapping tool, the recall and precision were 0.93 (95% CI: 0.92-0.95) and 0.93 (95% CI: 0.92-0.95), respectively. Across monthly intervals, the NER tool identified more potential clusters than were confirmed in the WEDSS system.

Conclusions:

We developed a novel pipeline of tools that identified existing outbreaks and novel clusters with associated addresses. Our pipeline ingests data from a statewide database and may be deployed to assist local health departments for targeted interventions. Clinical Trial: Not applicable


 Citation

Please cite as:

Caskey J, McConnell IL, Oguss M, Dligach D, Kulikoff R, Grogan B, Gibson C, Wimmer E, DeSalvo TE, Nyakoe-Nyasani EE, Churpek MM, Afshar M

Identifying COVID-19 Outbreaks From Contact-Tracing Interview Forms for Public Health Departments: Development of a Natural Language Processing Pipeline

JMIR Public Health Surveill 2022;8(3):e36119

DOI: 10.2196/36119

PMID: 35144241

PMCID: 8906835

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

Advertisement