Stylometry
quantitative authorship attribution
and beyond
Daniil Skorinkin, DH Network Potsdam
Agenda for today
- What is stylometry & why you might need it
- Does it actually work? Live demo 🎥
- How it actually works
- Real world applications in authorship attribution
- Applications beyond authorship
- Stylochronology
- Translation studies
- Authors' collaboration
1. What is Stylometry
Stylometry in a nutshell
is using frequencies of countable units of texts to detect the 'authorial signal'
(that's a narrow definition, but will do for now)
Джоан Роулинг
underlying stylometric studies is that authors have an unconscious as well as conscious aspect to their style
Encyclopaedia of Statistical Sciences
The main assumption
Basic stylometric procedure
(aka what actually happens)
Basic stylometric vizualisations
(how to show these distances)
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical clustering dendrograms
Weighted graphs
(Weighted networks)
Basic stylometric vizualisation
(how to show these distances)
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical philogenetic tree style dendrograms
Weighted graphs
(Weighted networks)
Stylometry goes
beyond authorship
- Intra-authorial stylometry
- Style (stylochronology)
- Genres within one author (e.g. Shakespeare)
- Heteronymy
- ...
- Collaboration of authors
- Co-authorship
- Translation
- Influence of the editor
But attribution is still the main purpose
Because authorship disputes are hot topics
Presumably, each national literature has its own famous unsolved attribution case, such as the Shakespearean canon, a collection of Polish erotic poems of the 16th century ascribed to Mikołaj Sęp Szarzyński, the Russian epic poem The Tale of Igor’s Campaign, and many other.
Eder M. (2011) Style-markers in authorship attribution: A cross-language study of the authorial fingerprint.
2. Does it actually work?
Yes, here is the live demo 🎥
Disclaimer
- ⚠️ Stylometric attribution is no magic or silver bullet
- ⚠️ There are many cases in which stylometry wont help detect the author
- 🟢 But at the same time modern state-of-the-art stylometry is not an ad hoc method — it works universally given certain conditions
BTW this is the main potential issue with the Navalny research by KyivPost
But with enough homogenous data it works
It works on Russian
It works on Armenian
It works on Chinese
3. How does it work?
in all their variety of material and method, have two features in common: the <...> texts they study have to be coaxed to yield numbers, and the numbers themselves have to be processed via statistics.
M. Eder, M. Kestemont, J. Rybicki. ‘Stylo’: a package for stylometric analyses
Stylometric studies
- Words
- especially function words
- Lemmas
- Symbol N-grams
- Word N-grams
Less typical but still works sometimes:
- POS tags
- Syntactic structures
- Metric structures
Frequencies of
So behind this picture:
There's the frequency table:
Each text is a column
Each text is a vector
Let us simplify to just two dimensions
Let us simplify to just two dimensions
Let us simplify to just two dimensions
the 'and' axis
the 'the' axis
Now we can
measure distances between texts!
Stylometry does the same, but with many more words, not just 2. So same happens in 100/300/1000-dimensional space
Джоан Роулинг
But there are ways to compress such spaces and vizualize in 2D
But wait! These features are meaningless!!
How can they contain the 'authorial signal'?
J. F. Burrows:
Most readers and critics behave as though common prepositions, conjunctions, personal pronouns, and articles — the parts of speech which make up at least a third of fictional works in English — do not really exist. But far from being a largely inert linguistic mass which has a simple but uninteresting function, these words and their frequency of use can tell us a great deal <...>
Preface to Computation into Criticism, 1987
Burrows Delta
- State of the art in authorship attribution since 2002
- Makes use of measuring distances between vectors of N most frequent words / charcacter n-grams
- (though more complex features are also possible)
4. Some history
Quantifying style
-
1851 — A. De Morgan suggests mean word-length as an authorship feature
-
1873 — New Shakespeare Society (Furnival, Fleay et al)
-
1887 — T. Mendenhall, The Characteristic Curves of Composition, the first known work on quantitative authorship attribution
Mostly classic studies at first:
-
1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge
-
1890 — W. Lutosławski, Principes de stylométrie
-
1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings
Mostly classic studies at first:
-
1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge
-
1890 — W. Lutosławski, Principes de stylométrie
-
1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings
Federalist papers
- 12 disputed papers (Hamilton or Madison)
- Mosteller F., Wallace D., (1963) Inference in an Authorship Problem.
- '<...> to solve the authorship question of The Federalist papers; and to propose routine methods for solving other authorship problems'.
Mosteller, Wallace, 1963
Mosteller, Wallace, 1963
- The function words of the language appear to be a fertile source of discriminators, and luckily the high-frequency words are the strongest.
- <...>it is important to have a variety of sources of material, to allow “between writings” variability to emerge
Mosteller, Wallace, 1963
In summary, the following points are clear:
- Madison is the principal author. These data make it possible to say far more than ever before that the odds are enormously high that Madison wrote the 12 disputed papers. <...>
- <...> While choice of underlying constants (choice of prior distributions) matters, it doesn’t matter very much, once one is in the neighborhood of a distribution suggested by a fair body of data.
5. Real world applications
Who wrote 'To Kill a Mockingbird'?
Harper Lee
Oh, Real-Lee?
The new 'old' book (2015)
Causes for suspicions (external)
- After publishing 'To kill a Mockingbird' Lee hasnt published a book in 55 years
- The manuscript was 'accidentaly found' by Lee's lawyer
- In 2015 Lee was 88, blind and severely disabled
- Alabama authorities actually did an investigation of Lee's legal capacity
- Contradicting statements regarding the manuscript
- a draft of 'To Kill a Mockingbird'
- or a separate work in the same fictional realm
Causes for suspicions (internal)
- Many deemed the 'new' work 'poor' compared to the classical 'To Kill a Mockingbird'
- Plot-wise the new text is a sequel (the main herione is an adult), though the claim was it had been written earlier
- A lot of disappointment in Atticus Finch who turns out to be kind of a racist in the 'new' book
Were the books written by one person?
Harper Lee and Truman Capote
Why Capote?
- Harper Lee's childhood friend, grew up together in the city which became the prototype for the city in 'To Kill a Mockingbird'
- prototype for one of the characters in 'To Kill a Mockingbird'
- In the time of writing 'To Kill a Mockingbird' , Capote published nothing big
- Afterwards Capote wrote his true-crime bestseller "In Cold Blood" which Lee helped to work with
- Hypotheses: then Lee thanked Capote by helping with "In Cold Blood"
This is what the stylometrists went on to test
Harper Lee homogeneity (dendrogram)
Harper Lee homogeneity (network)
By the way we can reproduce it in stylo:
> data(lee)
> stylo(frequencies=lee)
Case 2: Elena Ferrante
Who's Ferrante
- Elena Ferrante's books have been published since 1992
- In the 2000s, Ferrante became very popular — first in the USA, then in Italy
- In 2005, journalist Luigi Galella compared Ferrante's book to Domenico Starnone's novel and found textual similarities.
- In 2006 the same journalist published a quantitative study of the books of Ferrante, Starnone and other Italian authors by the physicist Vittorio Loretto; Domenico Starnone was again the closest
- In 2016, journalist Claudio Gatti researched the financial flows of the publishing house E/O — and pointed to the translator Anita Raja (Anita Raja)
Vizualisation of this gigantic experiment
What sort of a visualisation is it?
Eder, M. . Digital Scholarship in the Humanities 32, 50–64 (2017).
Figure 4 shows a network visualization of this set, and results are quite clear again.
Here, too, Starnone seems to be married to Ferrante rather than to Raja;
Other outcomes
Partners in Life, Partners in Crime? (J. Rybicki):
A series of stylometric tests for authorship, based on Burrows’s Delta procedure, which compares usage of most frequent words, was run on a corpus of novels by contemporary Italian writers, supplemented with translations by Anita Raja, recently the main suspect for being Elena Ferrante. Rather than to Raja, the tests point overwhelmingly to her husband, the writer Domenico Starnone.
Blended Authorship Attribution: Unmasking Elena Ferrante Combining Different Author Profling Methods (G. Mikros):
all profling results were highly accurate (over 90%) indicating that the person behind Ferrante is a male, aged over 60, from the region Campania and the town Saviano. The combination of these characteristics indicate a single candidate (among the authors of our corpus), Domenico Starnone.
All traces lead to Starnone
and only some to Anita Raja.. who is Starnone's wife
and only some to Anita Raja.. who is Starnone's wife
Stylometry beyond authorship attribution
But the study of literature and authorship is not only who wrote what, and who didn’t
Maciej Eder, Jan Rybicki (2016). Go Set A Watchman while we Kill the Mockingbird in Cold Blood, with Cats and Other People
Beyond authorship
- Intra-authorial stylometry
- Style (stylochronology)
- Genres within one author (e.g. Shakespeare)
- Heteronymy
- ...
- Collaboration of authors
- Co-authorship
- Translation
- Influence of the editor
Intra-authorial stylometry
Диккенс: датировка
Maciej Eder, Jan Rybicki
Inside Shakespeare
Agatha Christie: dates and..
...a pseudonym
Mary Westmacott ≈ Robert Galbrait
Pessoa's heteronyms
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023
Pessoa's heteronyms
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Distance table
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Tolstoy: dates + 'cycles' of work
Chronological signal is often visible
Stylochronology at scale:
1000 novels from 300 years
This is where we can get back to Ferrante
- I already said that Maciej Eder's research was not about authorship, but about "developing your own style in a virtual author"
- Rather than simply unmasking the name, the paper will test whether – and if yes, then to which extent – the unmasked author’s own novels differ stylistically from the works published as “Ferrante”.
Dynamic Delta: rolling.classify ()
- Delta distance with a sliding window
- Good for collaboration studies
- More here: rolling stylomety
(me testing the method)
What Eder saw:
Arguably, a clear pattern appears: while the early novels show little similarity with the assumed virtual “Ferrante”, the late works are assigned to this class with more and more confidence of the classifier. Almost all of the segments of L’amore molesto from 1992 (Fig. 4a) are classified as “Starnone”, with an exception of a relatively short passage at the end of the novel. The voice of the virtual “Ferrante” is more noticeable in I Giorni dell’abbandono from 2002 (Fig. 4b), this time at the beginning of the novel. In La glia oscura (2006) the share of segments by “Ferrante” is roughly equal to those of “Starnone”. In the novel L’amica geniale. Infanzia, adolescenza (2011) the style of “Ferrante” becomes predominant, which is even more visible in Storia del nuovo cognome published 2012 (Fig. 4c). This novel is a triumph of the virtual author
Conclusion:
- Apparently, Domenico Starnone demonstrates <..> the ability to differentiate his own stylistic profile and the voice of his alter ego.
- Ferrante has been gradually emerging, to become predominant in the late novels.
Translation &
other forms of collaboration
From French...
...to English
works just a bit worse with Polish
Google Translate and DeepL are just as good:
Though they used to be really bad
Collaborative translation
Night and Day
by Virginia Woolf
Anna Kołyszko -> Magda Heydel
Maciej Eder, Jan Rybicki
rolling stylometry
& the Shakespeare question
Shakespeare
...and Marlowe
Генрих VI: последовательный анализ
Editor's influence
Choiński, M., Rybicki, J. (2016). Jonathan Edwards and Thomas Foxcroft: In Pursuit of Stylometric Traces of the Editor. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 147-149.
Young Edwards: no editor
Consecutive segments of Edwards's Mind (1723); throughout the work, Edward's signal (red) dominates over the (absent) signal of Foxcroft.
Old Edwards: editor becomes somewhat visible
Consecutive segments of Edwards's Humble Inquiry (1749); in many other fragments, dominated by Edwards (red), Foxcroft's impact is still visible. The lower band shows the strongest signal; the upper, the second strongest.
Bonus topics
Adversarial stylometry
- deceiving authorship detection
- countermeasures to deception
- de-anonymization
- demographics detection
-
native language identification
- ...potentially allows you to harrypoterize your fanfic =)
What about the actual generated text?
Stylometry still beats GPT
But does not beat a neural network specifically trained on author X
Code stylometry
Some references
- Style-markers in authorship attribution: A cross-language study of the authorial fingerprint (great paper by Maciej Eder)
- His other papers
- ... and the papers of his colleagues Jan Rybicki (including the Translation studies)
- A lecture by Jan Rybicki: youtu.be/XoZ2HMYw2U4
- the Stylo tool: https://computationalstylistics.github.io/
- Как работают метрики Delta: Understanding and explaining Delta measures for authorship attribution
6. Bonus 1: more Stylo functions
classify ()
- text classification with stylometry features
- main tool for actual authorship attribution
- employs standard machine-learning algorithms
- requires two sets of documents
- training (primary_set)
- test (secondary_set)
rolling.classify ()
- dynamic changes in the text
- text window of adjustable size
oppose ()
- contrastive analysis
- words significantly preferred/avoided
- comparison studies (e.g. male vs female styles)
- when launching with non-latin script data:
oppose(corpus.lang="Other")
Oppose
Cyrillic issues on Mac
- Open Terminal and execute
defaults write org.R-project.R force.LANG en_US.UTF-8
- ...or in R execute this:
>system("defaults write org.R-project.R force.LANG en_US.UTF-8")
Z-score
Text
where
- x – frequency of a word
- µ - mean frequency of a word in the whole corpus (collection of texts)
- σ - standard deviation
remember: texts "have to be coaxed to yield numbers"
so it's mostly about counting frequencies
Stylometry Yerevan
By danilsko
Stylometry Yerevan
Stylometry SMTB Yerevan lecture
- 257