Text analysis

Experiment:
identify predominant gender in PISA 2018

Process overview

  • Parse and clean up a set of TMX files from MemoryLn
  • Build gender word sets, i.e. of words that differentiate sentences about men and about women
  • Segment text and split sentences in words, then tokenize and lowercase each word
  • Assign gender class to each sentence
  • Count number of sentences and words in each class, sum and obtain percentage
  • Statistical analysis

MALE_WORDS

guy spokesman chairman men's men him he's his boy boyfriend boyfriends boys brother brothers dad dads dude father fathers fiance gentleman gentlemen god grandfather grandpa grandson groom he himself husband husbands king male man mr nephew nephews priest prince son sons uncle uncles waiter widower widowers

FEMALE_WORDS

heroine spokeswoman chairwoman women's actress women she's her aunt aunts bride daughter daughters female fiancee girl girlfriend girlfriends girls goddess granddaughter grandma grandmother herself ladies lady mom moms mother mothers mrs ms niece nieces priestess princess queens she sister sisters waitress widow widows wife wives woman

Assign gender to sentence

  • If a sentence contains only male words, it is classified it as MALE sentence.
  • If a sentence contains only female words, it is classified as FEMALE sentence.
  • If a sentence contains both male and female words, it is classified as BOTH.
  • If a sentence contains neither male and female words, it is classified as UNKNOWN.

Questionnaires

94.939% unknown (1234 sentences)
3.085% both (21 sentences)
1.227% female (10 sentences)
0.749% male (7 sentences)

 

Cognitive

90.967% unknown (11981 sentences)
5.862% male (637 sentences)
2.457% female (270 sentences)
0.714% both (65 sentences)

Now, statistic analysis would be necessary to see how relevant these raw figures really are

Corollary

This technique is naive, but it can be extended to detect other things, e.g. tense language, tone of text, sentiment, e.g. "awesome", "good", "stupendous" vs "horrible", "tasteless", "bland".

 

However, when we look at word senses, we need more elaborate language context-based models to cope with ambiguity (e.g. n-grams, co-occurrences, etc.)

Code

https://repl.it/@msoutopico/text-analysis

Data

https://capps.capstan.be/Lab/ling/TM/

https://capps.capstan.be/Lab/ling/corpus_cog.txt

https://capps.capstan.be/Lab/ling/corpus_qq.txt