JMIR Preprints #40403: Development of COVID-19-related Anti-Asian Tweet Dataset: A Quantitative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Development of COVID-19-related Anti-Asian Tweet Dataset: A Quantitative Study

Maryam Mokhberi;
Ahana Biswas;
Zarif Masud;
Roula Kteily-Hawa;
Abby Goldstein;
Joseph Roy Gillis;
Shebuti Rayana;
Syed Ishtiaque Ahmed

ABSTRACT

Background:

Since the advent of the COVID-19 pandemic, individuals of Asian descent have been the subject of stigma and hate speech in both offline and online communities. One of the major venues for encountering such unfair attacks is social networks such as Twitter. As the research community seeks to understand, analyze and implement detection techniques, high-quality datasets are becoming immensely important.

Objective:

In this study, we introduce a manually labeled dataset of Tweets having anti-Asian stigmatizing content.

Methods:

We sampled over 668M Tweets posted on Twitter between January 2020 to July 2020 and used an iterative data construction approach that includes three different stages of algorithm-driven data selection and manual labeling to finally arrive at 11,263 Tweets with primary labels (unknown/irrelevant, not-stigmatizing, stigmatizing-low, stigmatizing-medium, stigmatizing-high) and Tweet sub-topics (for e.g., wet market and eating habits, COVID-19 cases, bioweapon, etc.). Moreover, we selected 5,000 Tweets from that dataset and labeled them by a second annotator, and then a third annotator resolved conflicts in labels between first and second annotators. We present this final dataset as a high quality Twitter dataset on stigma towards Chinese people during COVID-19 pandemic. The dataset and instructions for labeling can be viewed in the Github repository: https://anonymous.4open.science/r/COVID-Stigma-A-Dataset-of-Anti-Asian-Stigmatizing-Tweets-During-COVID-19-65DD.

Results:

We implement some state-of-the-art models to detect stigmatizing Tweets to set initial benchmarks for our dataset. Our results show the Bidirectional Encoder Representations from Transformers (BERT) model achieves the highest accuracy of 79% when detecting stigma on unseen data with traditional models such as Support Vector Machine performing at 73% accuracy.

Conclusions:

Our dataset can be used as a benchmark for further qualitative and quantitative research and analysis around the issue. We believe this contribution will help to significantly predict and hence reduce the unfair stigma, hate, and discrimination against Asian people during future crises like COVID-19.

Citation

Please cite as:

Mokhberi M, Biswas A, Masud Z, Kteily-Hawa R, Goldstein A, Gillis JR, Rayana S, Ahmed SI

Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study

JMIR Form Res 2023;7:e40403

DOI: 10.2196/40403

PMID: 36693148

PMCID: 9976773

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jun 20, 2022

Open Peer Review Period: Jun 19, 2022 - Aug 14, 2022

Date Accepted: Nov 11, 2022

Date Submitted to PubMed: Jan 24, 2023

(closed for review but you can still tweet)

Development of COVID-19-related Anti-Asian Tweet Dataset: A Quantitative Study

ABSTRACT

Citation

Copyright