Intersectional Bias in

Hate Speech and Abusive Language Datasets

Jae Yeon Kim, Carlos Ortiz, Sarah Nam, Sarah Santiago, Vivek Datta


UC Berkeley

Motivation

What if the training data, used to detect bias in social media, were itself biased?

Data

Definitions

  • Abusive Language: "Any strongly impolite, rude or hurting language using profanity"
  • Hate Speech: "Language used to express hatred towards a targeted individual or group"

Objectives

Previous work

Our goal

  • Demonstrating intersectional bias (race + gender)
  • Classifying race, gender, and party ID of the 100k tweets 
  • Estimating how these language features are associated with the distributions of hateful and abusive tweets 
  • Looking at either
    • racial ​(Waseem 2016; Waseem and Hovy 2016; Davidson, Bhattacharya, and Weber 2019; Sap et al. 2019) or
    • gender bias (Tatman 2017; Park, Shin, and Fung 2018; Dixon et al. 2018)

"The Enduring Myth of Black Criminality" by Ta-Nehisi Coates (Atlantic)

https://www.youtube.com/watch?v=cQo-yYhExw0

Hypotheses

  • Racial bias: The first hypothesis is about between-group differences. Pr(hateful/abusive|Black) > Pr(hateful/abusive|White) 
  • Intersectional bias: The second hypothesis is about within-group differences. Pr(hateful/abusive|Black men) > Pr(hateful/abusive|the rest of the subgroups)

Classifications

Racial demographic classification     

  • Made using a mixed-membership demographic language model developed by Blodgett, Su Lin and Green, Lisa and O'Connor, Brendan (2016).
  • The model is based on the merged data of the 59.2 million tweets from 2.8 million users and the US Census data.

Party identification classification

  • Made using a lasso regression on a bag-of-words model with bigrams and stemming
  • Trained with 86,460 tweets of Democrat and Republican candidates in the 2018 US Congressional Election.

Gender classification

  • Made using a lasso regression on a bag-of-words model with bigrams and stemming, trained with the data provided by the Data For Everyone Library
  • Two separate models were trained, each one binary for male and female to improve accuracy and samples available

Classifications

Descriptive analysis

Bootstrapping

Logistic regression analysis

Implications

  • The study provides the first systematic evidence on intersectional bias in datasets of hate speech and abusive language.
  • Consistent with broad social science research on the criminalization of African American men (Oliver 2003; Mastro, Blecha, and Seate 2015; Kappeler and Potter 2017; Najdowski, Bottoms, and Goff 2015; Hall, Hall, and Perry 2016; Skinner and Hass 2016)

Limitations

  • Missing variables in the model may cause selection bias. A better approach would be to design and run an experiment. 
  • All of the key predictor variables were not directly observed but were based on machine-enabled text classification. Uncertainty in the data may destabilize inference if the effect size is small.