Text analysis
Experiment:
identify predominant gender in PISA 2018
Process overview
- Parse and clean up a set of TMX files from MemoryLn
- Build gender word sets, i.e. of words that differentiate sentences about men and about women
- Segment text and split sentences in words, then tokenize and lowercase each word
- Assign gender class to each sentence
- Count number of sentences and words in each class, sum and obtain percentage
- Statistical analysis
MALE_WORDS
guy spokesman chairman men's men him he's his boy boyfriend boyfriends boys brother brothers dad dads dude father fathers fiance gentleman gentlemen god grandfather grandpa grandson groom he himself husband husbands king male man mr nephew nephews priest prince son sons uncle uncles waiter widower widowers
FEMALE_WORDS
heroine spokeswoman chairwoman women's actress women she's her aunt aunts bride daughter daughters female fiancee girl girlfriend girlfriends girls goddess granddaughter grandma grandmother herself ladies lady mom moms mother mothers mrs ms niece nieces priestess princess queens she sister sisters waitress widow widows wife wives woman
Assign gender to sentence
- If a sentence contains only male words, it is classified it as MALE sentence.
- If a sentence contains only female words, it is classified as FEMALE sentence.
- If a sentence contains both male and female words, it is classified as BOTH.
- If a sentence contains neither male and female words, it is classified as UNKNOWN.
Questionnaires
94.939% unknown (1234 sentences)
3.085% both (21 sentences)
1.227% female (10 sentences)
0.749% male (7 sentences)
Cognitive
90.967% unknown (11981 sentences)
5.862% male (637 sentences)
2.457% female (270 sentences)
0.714% both (65 sentences)
Now, statistic analysis would be necessary to see how relevant these raw figures really are
Corollary
This technique is naive, but it can be extended to detect other things, e.g. tense language, tone of text, sentiment, e.g. "awesome", "good", "stupendous" vs "horrible", "tasteless", "bland".
However, when we look at word senses, we need more elaborate language context-based models to cope with ambiguity (e.g. n-grams, co-occurrences, etc.)
Code
https://repl.it/@msoutopico/text-analysis
Data
https://capps.capstan.be/Lab/ling/TM/
https://capps.capstan.be/Lab/ling/corpus_cog.txt
https://capps.capstan.be/Lab/ling/corpus_qq.txt
Text analysis: gender classification
By msoutopico
Text analysis: gender classification
- 212