June 11, 2019
as a predictor
of social stratification
So freaking good. That’s all I’m gonna say. Don’t believe me? Walk
into the place and smell it. [. . . ] Will definitely go back.,Fresh, hand-
made pepperoni rolls. . . .. oh yeah. [...] Parking sucks, but I’m not taking off a point for that! Their marinara is dee-lish,Super tasty!!!
Let me start off saying that 2 years ago my husband and I had a spectac-
ular dinner at L’Atelier by Joel Robuchon and finally got the "Time"
to visit Joel Robuchon.We got a limo service and a nice tour inside
the mansion of Robuchon which was very memorable and the hostess
escorted us to the dining area. Decore: In comparison to L’Atelier this
place was much more chic and elegant. However, I still loved the idea
to see all the chefs preparing and decorating my plates at L’Atelier.
patterns of variation in language use are explainable (statistically) at least
in part with reference to social class
Do speakers from different social classes use syntax in a different way?
- Build a new prediction tool
- Build better NLP tools, more resilient to bias issues
Given a set of labelled texts, grouped by author, predict the label from text.
Hypothesis: use the price range of a restaurant as a proxy for the social class of its reviewers
(Historical note: see Labov, 1966)
All the reviews written by an author for different restaurants.
$ - 1
$$ - 5
$$$ - 1
$$$$ - 0
Hypothesis: scores will be sorted (in increasing or decreasing order) from class 1 to 4
|Automated Readability Index||6.48||6.52||6.59||6.91||0.17|
|Coleman Liau Index||7.58||7.76||8.07||8.41||0.32|
|Gunning Fog score||13.46||13.7||14.08||14.23||0.31|
|Linsear Write Formula||6||5.8||5.83||5.72||0.1|
(all results are significant at p < 0.01)
Automatically detect the language of the reviews and assign a language code to an author. Assume that each author writes in only one language.
Work on English only
Use non-English data for multi-lingual experiment
|1||$: 15 - $$$$: 1||$|
|2||$: 15 - $$$$: 14||$|
Solution: discard authors whose entropy is below mean.
Discard authors that have less than 40 reviews.
512 authors, 4 balanced classes, more or less clean (i.e. parsable) representative English texts
From ~1 million authors and ~5 millions reviews to...
Features & Modelling
Bag-of-Words (and characters)
Named Entities only
Words and c h a r a c t e r s
NNS CONJ NNS
[(NNS, cc, CONJ), ...
Cvccc - shape
117 - frequency
05 - length
True - alphanumeric?
CNN w/ abstract
CNN w/ syntax
Abstract features work well
There is significant
between social groups
Across languages too
What about interaction?
Data is still noisy
- Account for interaction
- Manually label some data
- Can humans predict social status?
- Build a POS tagger that accounts for social status