Offensive Content Detection Via Synthetic Code-Switched Text

The prevalent use of offensive content in social media has become an important reason for concern for online platforms (customer service chat-boxes, social media platforms, etc). Classifying offensive and hate-speech content in on-line settings is an essential task in many applications that needs to be addressed accordingly. However, online text from online platforms can contain code-switching, a combination of more than one language. The non-availability of labeled code-switched data for low-resourced code-switching combinations adds difficulty to this problem. To overcome this, we release a real-world dataset containing around 10k samples for testing for three language combinations en-fr, en-es, and en-de1 and a synthetic code-switched textual dataset containing 30k samples for training2. In this paper, we describe the process for gathering the human-generated data and our algorithm for creating synthetic code-switched offensive content data. We also introduce the results of a keyword classification baseline and a multi-lingual transformer-based classification model.

Learn More

Publications

Offensive Content Detection Via Synthetic Code-Switched Text

COLING 2022

Publication date: October 17, 2022

Cesa Salaam, Franck Dernoncourt, Trung Bui, Danda Rawat, David Seunghyun Yoon

Research Areas: AI & Machine Learning Document Intelligence Natural Language Processing