quantitative authorship attribution
and beyond
Daniil Skorinkin, DH Network Potsdam
is using frequencies of countable units of texts to detect the 'authorial signal'
(that's a narrow definition, but will do for now)
underlying stylometric studies is that authors have an unconscious as well as conscious aspect to their style
Encyclopaedia of Statistical Sciences
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical clustering dendrograms
Weighted graphs
(Weighted networks)
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical philogenetic tree style dendrograms
Weighted graphs
(Weighted networks)
Presumably, each national literature has its own famous unsolved attribution case, such as the Shakespearean canon, a collection of Polish erotic poems of the 16th century ascribed to Mikołaj Sęp Szarzyński, the Russian epic poem The Tale of Igor’s Campaign, and many other.
Eder M. (2011) Style-markers in authorship attribution: A cross-language study of the authorial fingerprint.
in all their variety of material and method, have two features in common: the <...> texts they study have to be coaxed to yield numbers, and the numbers themselves have to be processed via statistics.
M. Eder, M. Kestemont, J. Rybicki. ‘Stylo’: a package for stylometric analyses
Less typical but still works sometimes:
the 'and' axis
the 'the' axis
Now we can
measure distances between texts!
But there are ways to compress such spaces and vizualize in 2D
How can they contain the 'authorial signal'?
Most readers and critics behave as though common prepositions, conjunctions, personal pronouns, and articles — the parts of speech which make up at least a third of fictional works in English — do not really exist. But far from being a largely inert linguistic mass which has a simple but uninteresting function, these words and their frequency of use can tell us a great deal <...>
Preface to Computation into Criticism, 1987
1851 — A. De Morgan suggests mean word-length as an authorship feature
1873 — New Shakespeare Society (Furnival, Fleay et al)
1887 — T. Mendenhall, The Characteristic Curves of Composition, the first known work on quantitative authorship attribution
1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge
1890 — W. Lutosławski, Principes de stylométrie
1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings
1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge
1890 — W. Lutosławski, Principes de stylométrie
1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings
In summary, the following points are clear:
> data(lee)
> stylo(frequencies=lee)
Eder, M. . Digital Scholarship in the Humanities 32, 50–64 (2017).
Figure 4 shows a network visualization of this set, and results are quite clear again.
Here, too, Starnone seems to be married to Ferrante rather than to Raja;
Partners in Life, Partners in Crime? (J. Rybicki):
A series of stylometric tests for authorship, based on Burrows’s Delta procedure, which compares usage of most frequent words, was run on a corpus of novels by contemporary Italian writers, supplemented with translations by Anita Raja, recently the main suspect for being Elena Ferrante. Rather than to Raja, the tests point overwhelmingly to her husband, the writer Domenico Starnone.
Blended Authorship Attribution: Unmasking Elena Ferrante Combining Different Author Profling Methods (G. Mikros):
all profling results were highly accurate (over 90%) indicating that the person behind Ferrante is a male, aged over 60, from the region Campania and the town Saviano. The combination of these characteristics indicate a single candidate (among the authors of our corpus), Domenico Starnone.
Maciej Eder, Jan Rybicki (2016). Go Set A Watchman while we Kill the Mockingbird in Cold Blood, with Cats and Other People
Maciej Eder, Jan Rybicki
Mary Westmacott ≈ Robert Galbrait
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Arguably, a clear pattern appears: while the early novels show little similarity with the assumed virtual “Ferrante”, the late works are assigned to this class with more and more confidence of the classifier. Almost all of the segments of L’amore molesto from 1992 (Fig. 4a) are classified as “Starnone”, with an exception of a relatively short passage at the end of the novel. The voice of the virtual “Ferrante” is more noticeable in I Giorni dell’abbandono from 2002 (Fig. 4b), this time at the beginning of the novel. In La glia oscura (2006) the share of segments by “Ferrante” is roughly equal to those of “Starnone”. In the novel L’amica geniale. Infanzia, adolescenza (2011) the style of “Ferrante” becomes predominant, which is even more visible in Storia del nuovo cognome published 2012 (Fig. 4c). This novel is a triumph of the virtual author
Maciej Eder, Jan Rybicki
Choiński, M., Rybicki, J. (2016). Jonathan Edwards and Thomas Foxcroft: In Pursuit of Stylometric Traces of the Editor. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 147-149.
Consecutive segments of Edwards's Mind (1723); throughout the work, Edward's signal (red) dominates over the (absent) signal of Foxcroft.
Consecutive segments of Edwards's Humble Inquiry (1749); in many other fragments, dominated by Edwards (red), Foxcroft's impact is still visible. The lower band shows the strongest signal; the upper, the second strongest.
defaults write org.R-project.R force.LANG en_US.UTF-8
>system("defaults write org.R-project.R force.LANG en_US.UTF-8")
Text
where