Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jun 20, 2022
Open Peer Review Period: Jun 19, 2022 - Aug 14, 2022
Date Accepted: Nov 11, 2022
Date Submitted to PubMed: Jan 24, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study

Mokhberi M, Biswas A, Masud Z, Kteily-Hawa R, Goldstein A, Gillis JR, Rayana S, Ahmed SI

Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study

JMIR Form Res 2023;7:e40403

DOI: 10.2196/40403

PMID: 36693148

PMCID: 9976773

Development of COVID-19-related Anti-Asian Tweet Dataset: A Quantitative Study

  • Maryam Mokhberi; 
  • Ahana Biswas; 
  • Zarif Masud; 
  • Roula Kteily-Hawa; 
  • Abby Goldstein; 
  • Joseph Roy Gillis; 
  • Shebuti Rayana; 
  • Syed Ishtiaque Ahmed

ABSTRACT

Background:

Since the advent of the COVID-19 pandemic, individuals of Asian descent have been the subject of stigma and hate speech in both offline and online communities. One of the major venues for encountering such unfair attacks is social networks such as Twitter. As the research community seeks to understand, analyze and implement detection techniques, high-quality datasets are becoming immensely important.

Objective:

In this study, we introduce a manually labeled dataset of Tweets having anti-Asian stigmatizing content.

Methods:

We sampled over 668M Tweets posted on Twitter between January 2020 to July 2020 and used an iterative data construction approach that includes three different stages of algorithm-driven data selection and manual labeling to finally arrive at 11,263 Tweets with primary labels (unknown/irrelevant, not-stigmatizing, stigmatizing-low, stigmatizing-medium, stigmatizing-high) and Tweet sub-topics (for e.g., wet market and eating habits, COVID-19 cases, bioweapon, etc.). Moreover, we selected 5,000 Tweets from that dataset and labeled them by a second annotator, and then a third annotator resolved conflicts in labels between first and second annotators. We present this final dataset as a high quality Twitter dataset on stigma towards Chinese people during COVID-19 pandemic. The dataset and instructions for labeling can be viewed in the Github repository: https://anonymous.4open.science/r/COVID-Stigma-A-Dataset-of-Anti-Asian-Stigmatizing-Tweets-During-COVID-19-65DD.

Results:

We implement some state-of-the-art models to detect stigmatizing Tweets to set initial benchmarks for our dataset. Our results show the Bidirectional Encoder Representations from Transformers (BERT) model achieves the highest accuracy of 79% when detecting stigma on unseen data with traditional models such as Support Vector Machine performing at 73% accuracy.

Conclusions:

Our dataset can be used as a benchmark for further qualitative and quantitative research and analysis around the issue. We believe this contribution will help to significantly predict and hence reduce the unfair stigma, hate, and discrimination against Asian people during future crises like COVID-19.


 Citation

Please cite as:

Mokhberi M, Biswas A, Masud Z, Kteily-Hawa R, Goldstein A, Gillis JR, Rayana S, Ahmed SI

Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study

JMIR Form Res 2023;7:e40403

DOI: 10.2196/40403

PMID: 36693148

PMCID: 9976773

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

Advertisement