Stylometry

quantitative authorship attribution

and beyond

Daniil Skorinkin, DH Network Potsdam

Agenda for today

  1. What is stylometry & why you might need it
  2. Does it actually work? Live demo 🎥 
  3. How it actually works
  4. Real world applications in authorship attribution
  5. Applications beyond authorship
    1. Stylochronology
    2. Translation studies
    3. Authors' collaboration

1. What is Stylometry

Stylometry in a nutshell

is using frequencies of countable units of texts to detect the 'authorial signal'
 

(that's a narrow definition, but will do for now)

 

Джоан Роулинг

 underlying stylometric studies is that authors have an unconscious as well as conscious aspect to their style

 

Encyclopaedia of Statistical Sciences

 

The main assumption

Basic stylometric procedure

(aka what actually happens)

Basic stylometric vizualisations

(how to show these distances)

Dimensionality reduction methods (PCA MDS tSNE etc)

Hierarchical clustering dendrograms

Weighted graphs

(Weighted networks)

Basic stylometric vizualisation

(how to show these distances)

Dimensionality reduction methods (PCA MDS tSNE etc)

Hierarchical philogenetic tree style dendrograms

Weighted graphs

(Weighted networks)

Stylometry goes

beyond authorship

  1. Intra-authorial stylometry 
    1. Style (stylochronology)
    2. Genres within one author (e.g. Shakespeare)
    3. Heteronymy
    4. ...
  2. Collaboration of authors
    1. Co-authorship
    2. Translation
    3. Influence of the editor

But attribution is still the main purpose

Because authorship disputes are hot topics

Presumably, each national literature has its own famous unsolved attribution case, such as the Shakespearean canon, a collection of Polish erotic poems of the 16th century ascribed to Mikołaj Sęp Szarzyński, the Russian epic poem The Tale of Igor’s Campaign, and many other. 

 

Eder M. (2011) Style-markers in authorship attribution: A cross-language study of the authorial fingerprint.

2. Does it actually work?

Yes, here is the live demo 🎥

Disclaimer

  • ⚠️ Stylometric attribution is no magic or silver bullet
  • ⚠️ There are many cases in which stylometry wont help detect the author
  • 🟢 But at the same time modern state-of-the-art stylometry is not an ad hoc method — it works universally given certain conditions

BTW this is the main potential issue with the Navalny research by KyivPost

But with enough homogenous data it works

It works on Russian

It works on Armenian

It works on Chinese

3. How does it work?

in all their variety of material and method, have two features in common: the <...> texts they study have to be coaxed to yield numbers, and the numbers themselves have to be processed via statistics.

 

M. Eder, M. Kestemont, J. Rybicki. ‘Stylo’: a package for stylometric analyses

Stylometric studies

  • Words
    • especially function words
  • Lemmas
  • Symbol N-grams
  • Word N-grams

Less typical but still works sometimes:

  • POS tags
  • Syntactic structures
  • Metric structures

 

Frequencies of

So behind this picture:

There's the frequency table:

Each text is a column

Each text is a vector

Let us simplify to just two dimensions

Let us simplify to just two dimensions

Let us simplify to just two dimensions

the 'and' axis

the 'the' axis

Now we can

measure distances between texts!

Stylometry does the same, but with many more words, not just 2. So same happens in 100/300/1000-dimensional space

Джоан Роулинг

But there are ways to compress such spaces and vizualize in 2D

But wait! These features are meaningless!!

How can they contain the 'authorial signal'?

J. F. Burrows:

Most readers and critics behave as though common prepositions, conjunctions, personal pronouns, and articles — the parts of speech which make up at least a third of fictional works in English — do not really exist. But far from being a largely inert linguistic mass which has a simple but uninteresting function, these words and their frequency of use can tell us a great deal <...>

Preface to Computation into Criticism, 1987

Burrows Delta 

  • State of the art in authorship attribution since 2002
  • Makes use of measuring distances between vectors of N most frequent words / charcacter n-grams
  • (though more complex features are also possible) 

4. Some history

Quantifying style

  • 1851 — A. De Morgan suggests mean word-length as an authorship feature

  • 1873 — New Shakespeare Society (Furnival, Fleay et al)

  • 1887 — T. Mendenhall, The Characteristic Curves of Composition, the first known work on quantitative authorship attribution

Mostly classic studies at first:

  • 1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge

  • 1890 — W. Lutosławski, Principes de stylométrie

  • 1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings

Mostly classic studies at first:

  • 1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge

  • 1890 — W. Lutosławski, Principes de stylométrie

  • 1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings

Federalist papers

  • 12 disputed papers (Hamilton or Madison)
  • Mosteller F., Wallace D., (1963) Inference in an Authorship Problem.  
  • '<...> to solve the authorship question of The Federalist papers; and to propose routine methods for solving other authorship problems'.

 

Mosteller, Wallace, 1963

Mosteller, Wallace, 1963

  • The function words of the language appear to be a fertile source of discriminators, and luckily the high-frequency words are the strongest.
  • <...>it is important to have a variety of sources of material, to allow “between writings” variability to emerge

Mosteller, Wallace, 1963

In summary, the following points are clear:

  • Madison is the principal author. These data make it possible to say far more than ever before that the odds are enormously high that Madison wrote the 12 disputed papers. <...>
  • <...> While choice of under­lying constants (choice of prior distributions) matters, it doesn’t matter very much, once one is in the neighborhood of a distribution suggested by a fair body of data.

5. Real world applications

Who wrote 'To Kill a Mockingbird'? 

Harper Lee

Oh, Real-Lee?

The new 'old' book (2015)

Causes for suspicions (external)

  • After publishing 'To kill a Mockingbird' Lee hasnt published a book in 55 years
  • The manuscript was 'accidentaly found' by Lee's lawyer 
  • In 2015 Lee was 88, blind and severely disabled
  • Alabama authorities actually did an investigation of Lee's legal capacity 
  • Contradicting statements regarding the manuscript
    • a draft of 'To Kill a Mockingbird'
    • or a separate work in the same fictional realm

Causes for suspicions (internal)

  • Many deemed the 'new' work 'poor' compared to the classical 'To Kill a Mockingbird'
  • Plot-wise the new text is a sequel (the main herione is an adult), though the claim was it had been written earlier
  • A lot of disappointment in Atticus Finch who turns out to be kind of a racist in the 'new' book

Were the books written by one person?

Harper Lee and Truman Capote

Why Capote?

  • Harper Lee's childhood friend, grew up together in the city which became the prototype for the city in 'To Kill a Mockingbird'
  • prototype for one of the characters in 'To Kill a Mockingbird' 
  • In the time of writing 'To Kill a Mockingbird' , Capote published nothing big
  • Afterwards Capote wrote his true-crime bestseller "In Cold Blood" which Lee helped to work with
  • Hypotheses: then Lee thanked Capote by helping with "In Cold Blood"

This is what the stylometrists went on to test

Harper Lee homogeneity (dendrogram)

Harper Lee homogeneity (network)

By the way we can reproduce it in stylo:

> data(lee)

> stylo(frequencies=lee)

Case 2: Elena Ferrante

Who's Ferrante

  1. Elena Ferrante's books have been published since 1992
  2. In the 2000s, Ferrante became very popular — first in the USA, then in Italy
  3. In 2005, journalist Luigi Galella compared Ferrante's book to Domenico Starnone's novel and found textual similarities.
  4. In 2006 the same journalist published a quantitative study of the books of Ferrante, Starnone and other Italian authors by the physicist Vittorio Loretto; Domenico Starnone was again the closest
  5. In 2016, journalist Claudio Gatti researched the financial flows of the publishing house E/O — and pointed to the translator Anita Raja (Anita Raja)

Vizualisation of this gigantic experiment

What sort of a visualisation is it?

Eder, M.  . Digital Scholarship in the Humanities 32, 50–64 (2017).

Figure 4 shows a network visualization of this set, and results are quite clear again.

Here, too, Starnone seems to be married to Ferrante rather than to Raja; 

Other outcomes

Partners in Life, Partners in Crime? (J. Rybicki):
A series of stylometric tests for authorship, based on Burrows’s Delta procedure, which compares usage of most frequent words, was run on a corpus of novels by contemporary Italian writers, supplemented with translations by Anita Raja, recently the main suspect for being Elena Ferrante. Rather than to Raja, the tests point overwhelmingly to her husband, the writer Domenico Starnone.

Blended Authorship Attribution: Unmasking Elena Ferrante Combining Different Author Profling Methods (G. Mikros):
all profling results were highly accurate (over 90%) indicating that the person behind Ferrante is a male, aged over 60, from the region Campania and the town Saviano.  The combination of these characteristics indicate a single candidate (among the authors of our corpus), Domenico Starnone.

All traces lead to Starnone

and only some to Anita Raja.. who is Starnone's wife

and only some to Anita Raja.. who is Starnone's wife

 Stylometry beyond authorship attribution

But the study of literature and authorship is not only who wrote what, and who didn’t

Maciej Eder, Jan Rybicki (2016). Go Set A Watchman while we Kill the Mockingbird in Cold Blood, with Cats and Other People

 

Beyond authorship

  1. Intra-authorial stylometry 
    1. Style (stylochronology)
    2. Genres within one author (e.g. Shakespeare)
    3. Heteronymy
    4. ...
  2. Collaboration of authors
    1. Co-authorship
    2. Translation
    3. Influence of the editor

Intra-authorial stylometry

Диккенс: датировка

Maciej Eder, Jan Rybicki

Inside Shakespeare

Agatha Christie: dates and..

...a pseudonym

Mary Westmacott  Robert Galbrait

Pessoa's heteronyms

Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023

Pessoa's heteronyms

Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)

Distance table

Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)

Tolstoy: dates + 'cycles' of work

Chronological signal is often visible

Stylochronology at scale:

1000 novels from 300 years

This is where we can get back to Ferrante

  • I already said that Maciej Eder's research was not about authorship, but about "developing your own style in a virtual author"
  • Rather than simply unmasking the name, the paper will test whether – and if yes, then to which extent – the unmasked author’s own novels differ stylistically from the works published as “Ferrante”.  

Dynamic Delta: rolling.classify () 

 

  • Delta distance with a sliding window
  • Good for collaboration studies
  • More here: rolling stylomety 

(me testing the method)

What Eder saw:

 Arguably, a clear pattern appears: while the early novels show little similarity with the assumed virtual “Ferrante”, the late works are assigned to this class with more and more confidence of the classifier. Almost all of the segments of L’amore molesto from 1992 (Fig. 4a) are classified as “Starnone”, with an exception of a relatively short passage at the end of the novel.  The voice of the virtual “Ferrante” is more noticeable in I Giorni dell’abbandono from 2002 (Fig. 4b), this time at the beginning of the novel. In La  glia oscura (2006) the share of segments by “Ferrante” is roughly equal to those of “Starnone”. In the novel L’amica geniale. Infanzia, adolescenza (2011) the style of “Ferrante” becomes predominant, which is even more visible in Storia del nuovo cognome published 2012  (Fig. 4c). This novel is a triumph of the virtual author

 

Conclusion:

  • Apparently, Domenico Starnone demonstrates  <..> the ability to differentiate his own stylistic profile and the voice of his alter ego.
  • Ferrante has been gradually emerging, to become predominant in the late novels.

Translation &

other forms of collaboration

From French...

...to English

works just a bit worse with Polish

Google Translate and DeepL are just as good:

Though they used to be really bad

Collaborative translation

Night and Day

by Virginia Woolf

Anna Kołyszko -> Magda Heydel

Maciej Eder, Jan Rybicki

rolling stylometry

& the Shakespeare question

Shakespeare

...and Marlowe

Генрих VI: последовательный анализ

Editor's influence

Choiński, M., Rybicki, J. (2016). Jonathan Edwards and Thomas Foxcroft: In Pursuit of Stylometric Traces of the Editor. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 147-149.

Young Edwards: no editor

Consecutive segments of Edwards's Mind (1723); throughout the work, Edward's signal (red) dominates over the (absent) signal of Foxcroft.

Old Edwards: editor becomes somewhat visible

Consecutive segments of Edwards's Humble Inquiry (1749); in many other fragments, dominated by Edwards (red), Foxcroft's impact is still visible. The lower band shows the strongest signal; the upper, the second strongest.

Bonus topics

Adversarial stylometry

  • deceiving authorship detection
  • countermeasures to deception
  • de-anonymization
  • demographics detection
  • native language identification

  • ...potentially allows you to harrypoterize your fanfic =)

What about the actual generated text?

Stylometry still beats GPT 

But does not beat a neural network specifically trained on author X

Code stylometry

Some references

6. Bonus 1: more Stylo functions

classify ()

  • text classification with stylometry features
  • main tool for actual authorship attribution
  • employs standard machine-learning algorithms
  • requires two sets of documents
    • training (primary_set)
    • test (secondary_set) 

rolling.classify ()

  • dynamic changes in the text
  • text window of adjustable size

oppose ()

  • contrastive analysis 
  • words significantly preferred/avoided
  • comparison studies (e.g. male vs female styles) 
  • when launching with non-latin script data:
    oppose(corpus.lang="Other")

Oppose

Cyrillic issues on Mac

  • Open Terminal and execute

defaults write org.R-project.R force.LANG en_US.UTF-8 

  • ...or in R execute this:

>system("defaults write org.R-project.R force.LANG en_US.UTF-8")

Z-score

Text

where

  • x – frequency of a word
  • µ - mean frequency of a word in the whole corpus (collection of texts)
  • σ - standard deviation

remember: texts "have to be coaxed to yield numbers"

so it's mostly about counting frequencies

Stylometry Yerevan

By danilsko

Stylometry Yerevan

Stylometry SMTB Yerevan lecture

  • 144