November 15, 2019
as a predictor
of social stratification
Presented as a long paper
at ACL 2019 in Florence
So freaking good. That’s all I’m gonna say. Don’t believe me? Walk
into the place and smell it. [. . . ] Will definitely go back.,Fresh, hand-
made pepperoni rolls. . . .. oh yeah. [...] Parking sucks, but I’m not taking off a point for that! Their marinara is dee-lish,Super tasty!!!
Let me start off saying that 2 years ago my husband and I had a spectac-
ular dinner at L’Atelier by Joel Robuchon and finally got the "Time"
to visit Joel Robuchon.We got a limo service and a nice tour inside
the mansion of Robuchon which was very memorable and the hostess
escorted us to the dining area. Decore: In comparison to L’Atelier this
place was much more chic and elegant. However, I still loved the idea
to see all the chefs preparing and decorating my plates at L’Atelier.
patterns of variation in language use are explainable (statistically) at least
in part with reference to social class
Related work 2
Can socio-economic groups be differentiated on the basis of syntactic features, compared to lexical features
Can socio-economic status be predicted from a person’s writing?
Given a set of labelled texts, grouped by author, predict the label from text.
use the price range of a restaurant as a proxy
for the social class of its reviewers
All the reviews written by an author for different restaurants.
$ - 1
$$ - 5
$$$ - 1
$$$$ - 0
Hypothesis: scores will be sorted (in increasing or decreasing order) from class 1 to 4
|Automated Readability Index||6.48||6.52||6.59||6.91||0.17|
|Coleman Liau Index||7.58||7.76||8.07||8.41||0.32|
|Gunning Fog score||13.46||13.7||14.08||14.23||0.31|
|Linsear Write Formula||6||5.8||5.83||5.72||0.1|
(all results are significant at p < 0.01)
Automatically detect the language of the reviews and assign a language code to an author. Assume that each author writes in only one language.
Work on English only
|1||$: 15 - $$$$: 1||$|
|2||$: 15 - $$$$: 14||$|
Solution: discard authors whose entropy is below mean.
512 authors, 4 balanced classes, more or less clean (i.e. parsable) representative English texts
From ~1 million authors and ~5 millions reviews to...
Features & Modelling
Bag-of-Words (and characters)
Words and c h a r a c t e r s
NNS CONJ NNS
[(NNS, cc, CONJ), ...
Cvccc - shape
117 - frequency
05 - length
True - alphanumeric?
LR BOW (lexical) baseline
CNN pos tags
CNN dependency tree
There is significant
between the groups in our dataset
While lexical information is highly predictive, it is restricted to topic. In contrast, syntactic information is almost as predictive and is a much better signal for stylistic variation
What about interaction?
Data is still noisy
RC CliC-it 2019