You Write
Like
You Eat
Angelo Basile
November 15, 2019
Albert Gatt
Malvina Nissim
Stylistic variation
as a predictor
of social stratification
Presented as a long paper
at ACL 2019 in Florence
So freaking good. That’s all I’m gonna say. Don’t believe me? Walk
into the place and smell it. [. . . ] Will definitely go back.,Fresh, hand-
made pepperoni rolls. . . .. oh yeah. [...] Parking sucks, but I’m not taking off a point for that! Their marinara is dee-lish,Super tasty!!!
Let me start off saying that 2 years ago my husband and I had a spectac-
ular dinner at L’Atelier by Joel Robuchon and finally got the "Time"
to visit Joel Robuchon.We got a limo service and a nice tour inside
the mansion of Robuchon which was very memorable and the hostess
escorted us to the dining area. Decore: In comparison to L’Atelier this
place was much more chic and elegant. However, I still loved the idea
to see all the chefs preparing and decorating my plates at L’Atelier.
The Problem
The Problem
Language variation
patterns of variation in language use are explainable (statistically) at least
in part with reference to social class
Language variation
(Labov, 1962)
age
gender
location
psychology
register
Related work
Related work 2
social status
Background
fourth floor
Can socio-economic groups be differentiated on the basis of syntactic features, compared to lexical features
Research Questions
RQ1
RQ2
Can socio-economic status be predicted from a person’s writing?
Framing
Given a set of labelled texts, grouped by author, predict the label from text.
Text Classification
Data
TEXT
$$
AUTHOR
Distant Supervision
hypothesis:
use the price range of a restaurant as a proxy
for the social class of its reviewers
All the reviews written by an author for different restaurants.
$
$$
$$$
$$
$$
$$
$$
$ - 1
$$ - 5
$$$ - 1
$$$$ - 0
$$
X
Y
LEARN
LABELLING
Readability scores
Readability Metrics
Hypothesis: scores will be sorted (in increasing or decreasing order) from class 1 to 4
Readability Metrics
Metric | $ | $$ | $$$ | $$$$ | std |
---|---|---|---|---|---|
Automated Readability Index | 6.48 | 6.52 | 6.59 | 6.91 | 0.17 |
Coleman Liau Index | 7.58 | 7.76 | 8.07 | 8.41 | 0.32 |
Dale-Chall Score | 6.65 | 6.76 | 6.94 | 7 | 0.14 |
Flesch-Kincaid Ease | 5.42 | 5.55 | 5.59 | 5.82 | 0.14 |
Gunning Fog score | 13.46 | 13.7 | 14.08 | 14.23 | 0.31 |
Linsear Write Formula | 6 | 5.8 | 5.83 | 5.72 | 0.1 |
Lix index | 30.7 | 31.39 | 31.69 | 32.71 | 0.72 |
Flesch-Reading | 81.06 | 79.93 | 79.1 | 77.39 | 1.34 |
---|
(all results are significant at p < 0.01)
LANGUAGES
Automatically detect the language of the reviews and assign a language code to an author. Assume that each author writes in only one language.
Work on English only
Filtering
An example:
id | labels | Y |
---|---|---|
1 | $: 15 - $$$$: 1 | $ |
2 | $: 15 - $$$$: 14 | $ |
entropy |
---|
0.23 |
0.69 |
Solution: discard authors whose entropy is below mean.
DATA SET
512 authors, 4 balanced classes, more or less clean (i.e. parsable) representative English texts
From ~1 million authors and ~5 millions reviews to...
Features & Modelling
Bag-of-Words (and characters)
POS Tags
Dependency Trees
Abstract features
Logistic Regression
Convolutional Network
Words and c h a r a c t e r s
NNS CONJ NNS
[(NNS, cc, CONJ), ...
Cvccc_05_True_117...
MODELS
FEATURES
VARIATION?
Words
Cvccc - shape
117 - frequency
05 - length
True - alphanumeric?
Cvccc_05_True_117
Bleaching
Results
model
F1
random baseline
LR BOW (lexical) baseline
CNN lexical
CNN pos tags
CNN dependency tree
CNN bleaching
0.25
0.53
0.54
0.33
0.52
0.46
Conclusions
Positive results
There is significant
variation
between the groups in our dataset
syntactic
RQ1
RQ2
While lexical information is highly predictive, it is restricted to topic. In contrast, syntactic information is almost as predictive and is a much better signal for stylistic variation
What about interaction?
Shortcomings
Data is still noisy
$ |
fast |
kids |
coffee |
customer |
clean |
they |
order |
came |
always |
pizza |
$$ |
tried |
happy |
staff |
won |
put |
phoenix |
find |
try |
place |
salsa |
$$$ |
at |
clubs |
wynn |
music |
pretty |
night |
club |
vegas |
buffet |
hotel |
$$$$ |
excellent |
gras |
we |
las |
steak |
tasting |
foie |
wine |
course |
vega |
RC CliC-it 2019
By Angelo
RC CliC-it 2019
- 803