You Write

You Eat

Angelo Basile

November 15, 2019

Albert Gatt

Malvina Nissim

Stylistic variation

as a predictor

of social stratification

Presented as a long paper

at ACL 2019 in Florence

So freaking good. That’s all I’m gonna say. Don’t believe me? Walk
into the place and smell it. [. . . ] Will definitely go back.,Fresh, hand-
made pepperoni rolls. . . .. oh yeah. [...] Parking sucks, but I’m not taking off a point for that! Their marinara is dee-lish,Super tasty!!!

Let me start off saying that 2 years ago my husband and I had a spectac-
ular dinner at L’Atelier by Joel Robuchon and finally got the "Time"
to visit Joel Robuchon.We got a limo service and a nice tour inside
the mansion of Robuchon which was very memorable and the hostess
escorted us to the dining area. Decore: In comparison to L’Atelier this
place was much more chic and elegant. However, I still loved the idea
to see all the chefs preparing and decorating my plates at L’Atelier.

The Problem

Language variation

patterns of variation in language use are explainable (statistically) at least
in part with reference to social class

Language variation

(Labov, 1962)

age

gender

location

psychology

Johannsen et al., 2015

Ficler et al., 2017

Related work

van der Goot, et al., 2018

Wieling et al., 2011

Verhoeven et al., 2016

Related work 2

social status

Labov, 1966

van Dalen et. al, 2017

Flekova et. al, 2016

Lampos et al., 2016

Background

fourth floor

Can socio-economic groups be differentiated on the basis of syntactic features, compared to lexical features

Research Questions

RQ1

RQ2

Can socio-economic status be predicted from a person’s writing?

Framing

Given a set of labelled texts, grouped by author, predict the label from text.

Text Classification

Data

TEXT

AUTHOR

Distant Supervision

hypothesis:
use the price range of a restaurant as a proxy
for the social class of its reviewers

All the reviews written by an author for different restaurants.

$$$

$ - 1

$$ - 5

$$$ - 1

$$$$ - 0

X

Y

LEARN

LABELLING

Readability scores

Readability Metrics

Hypothesis: scores will be sorted (in increasing or decreasing order) from class 1 to 4

Readability Metrics

Metric	$	$$	$$$	$$$$	std
Automated Readability Index	6.48	6.52	6.59	6.91	0.17
Coleman Liau Index	7.58	7.76	8.07	8.41	0.32
Dale-Chall Score	6.65	6.76	6.94	7	0.14
Flesch-Kincaid Ease	5.42	5.55	5.59	5.82	0.14
Gunning Fog score	13.46	13.7	14.08	14.23	0.31
Linsear Write Formula	6	5.8	5.83	5.72	0.1
Lix index	30.7	31.39	31.69	32.71	0.72

Flesch-Reading	81.06	79.93	79.1	77.39	1.34

(all results are significant at p < 0.01)

LANGUAGES

Automatically detect the language of the reviews and assign a language code to an author. Assume that each author writes in only one language.

Work on English only

Filtering

An example:

id	labels	Y
1	$: 15 - $$$$: 1	$
2	$: 15 - $$$$: 14	$

entropy
0.23
0.69

Solution: discard authors whose entropy is below mean.

DATA SET

512 authors, 4 balanced classes, more or less clean (i.e. parsable) representative English texts

From ~1 million authors and ~5 millions reviews to...

Features & Modelling

Bag-of-Words (and characters)

POS Tags

Dependency Trees

Abstract features

Logistic Regression

Convolutional Network

Words and c h a r a c t e r s

NNS CONJ NNS

[(NNS, cc, CONJ), ...

Cvccc_05_True_117...

MODELS

FEATURES

VARIATION?

Words

Cvccc - shape

117 - frequency

05 - length

True - alphanumeric?

Cvccc_05_True_117

Bleaching

Results

model

random baseline

LR BOW (lexical) baseline

CNN lexical

CNN pos tags

CNN dependency tree

CNN bleaching

0.25

0.53

0.54

0.33

0.52

0.46

Conclusions

Positive results

There is significant

variation

between the groups in our dataset

syntactic

RQ1

RQ2

While lexical information is highly predictive, it is restricted to topic. In contrast, syntactic information is almost as predictive and is a much better signal for stylistic variation

What about interaction?

Shortcomings

Data is still noisy

(source: B. Plank)

fast

kids

coffee

customer

clean

they

order

came

always

pizza

tried

happy

staff

won

put

phoenix

find

try

place

salsa

$$$

clubs

wynn

music

pretty

night

club

vegas

buffet

hotel

$$$$

excellent

gras

las

steak

tasting

foie

wine

course

vega

RC CliC-it 2019

By Angelo

RC CliC-it 2019

1,228

The Problem

The Problem

Related work

Related work 2

Background

Research Questions

Framing

Data

Distant Supervision

X

Y

LABELLING

Readability scores

Readability Metrics

Readability Metrics

LANGUAGES

Filtering

DATA SET

Features & Modelling

Bleaching

Results

Conclusions

Positive results

Shortcomings

RC CliC-it 2019

RC CliC-it 2019

Angelo

More from Angelo