Simple n-gram based models perform well for gender prediction.
Sometimes.
Evalita 2018 - Torino
Capetown
Milano
Tirana
BUT...
If a model performs well on a gender-labelled data set, then it is (dangerously) modelling gender
ASSUMPTION
Let's give GXG a try!
Build the best possible model
Take a state-of-the-art gender prediction system and test it under new conditions
Research Questions
RQ1
RQ2
In case it performs poorly, try something to improve it
MODEL
text
word n-grams
character n-grams
Linear SVM
PAN 2017
languages? genres?
RESULTS
Results
lexical | bleached | |
---|---|---|
youtube | 62 | 59 |
74 | 67 | |
diaries | 70 | 67 |
journalism | 62 | 54 |
children | 54 | 53 |
Accuracy
IN
Results
lexical | bleached | |
---|---|---|
youtube | 57 | 53 |
52 | 50 | |
diaries | 62 | 53 |
journalism | 56 | 53 |
children | 60 | 53 |
Accuracy
CROSS
Conclusions
Gender prediction is hard!
We don't know if it is dangerous
Abstract features produce consistent but low results
RQ1
RQ2
also...
github.com/anbasile/gxg
DOWNLOAD TRAINED MODEL
What next?