Simple n-gram based models perform well for gender prediction.
Sometimes.
Evalita 2018 - Torino

Capetown
Milano
Tirana










BUT...

If a model performs well on a gender-labelled data set, then it is (dangerously) modelling gender
ASSUMPTION

Let's give GXG a try!
Build the best possible model
Take a state-of-the-art gender prediction system and test it under new conditions
Research Questions
RQ1
RQ2
In case it performs poorly, try something to improve it
MODEL


text
word n-grams
character n-grams
Linear SVM

PAN 2017

languages? genres?

RESULTS
Results
lexical | bleached | |
---|---|---|
youtube | 62 | 59 |
74 | 67 | |
diaries | 70 | 67 |
journalism | 62 | 54 |
children | 54 | 53 |
Accuracy
IN
Results
lexical | bleached | |
---|---|---|
youtube | 57 | 53 |
52 | 50 | |
diaries | 62 | 53 |
journalism | 56 | 53 |
children | 60 | 53 |
Accuracy
CROSS
Conclusions
Gender prediction is hard!
We don't know if it is dangerous
Abstract features produce consistent but low results
RQ1
RQ2
also...
github.com/anbasile/gxg
DOWNLOAD TRAINED MODEL

What next?

gxg
By Angelo
gxg
- 881