Kann ein anonymer Text mithilfe von Statistiken einem Autor zugeordnet werden?
Oracle Inspiration Day
Daniil Skorinkin, DH-Netzwerk Potsdam
beschäftigt sich damit, die Häufigkeiten bestimmter „atomarer“ Textelemente zu untersuchen, um darin das „autoriale Signal“ eines Textes zu erkennen.
(Das ist eine enge Definition – aber für den Anfang ausreichend.)
Schauen wir uns einen Text an. Wie könnten wir ihn aufteilen?
Wörter
Zeichen-n-Gramme (hier n = 4)
Wort-n-Gramme (hier n = 3)
und die der ich sie das in er nicht zu den es ein sich mit so von war aber ist auf dem wie auch als eine an daß was noch du mir wenn ihr da hatte aus nur mich im nach doch wir des um ja einen man sein ihm wieder immer vor denn für über oder einem sagte dann alles hat nun ihn einer ganz schon haben bei sind seine frau am hier meine mehr nichts jetzt etwas diese habe ihre uns waren durch alle mein dieser werden bis unter wo kann zum wird will seiner sah wurde herr...
the and to of a was i in he said you that it his had on at her with as for not him they she were but be have up all out is from them me been what this about into like back my there would we could one now know if their so or no do down your an did by are when who looked more over then see again time just don^t still very think got will off re go eyes than before right here get away thought i^m came too through only long way going face come some can
the and to of a was i in he said you that it his had on at her with as for not him they she were but be have up all out is from them me been what this about into like back my there would we could one now know if their so or no do down your an did by are when who looked more over then see again time just don^t still very think got will off re go eyes than before right here get away thought i^m came too through only long way going face come some can
Word counts are the variables used for discrimination. Since the topic written about heavily influences the rate with which a word is used, care in selection of words is necessary. The filler words of the language such as an, of, and upon, and, more generally, articles, prepositions, and conjunctions provide fairly stable rates, whereas more meaningful words like war, executive, and legislature do not.
Mosteller, F. & Wallace, D. (1963): Inference in an Authorship Problem.
Most readers and critics behave as though common prepositions, conjunctions, personal pronouns, and articles — the parts of speech which make up at least a third of fictional works in English — do not really exist. But far from being a largely inert linguistic mass which has a simple but uninteresting function, these words and their frequency of use can tell us a great deal <...>
Preface to Computation into Criticism, 1987
Book A by author A (AA)
Book B by author A (BA)
Book С by author A (CA)
Book A by author B (AB)
Book B by author B (BB)
Book С by author B (CB)
Book A by author C (AC)
Book B by author C (BC)
Book С by author C (CC)
Some dubious book (??)
Aber was passiert
dazwischen?
die 'and'-Achse
die 'the'-Achse
Jetzt können wir
'Abstände' zwischen Texten messen!
the
and
of
to
Es gibt Methoden, solche multidimensionalen Räume zu komprimieren und die Ähnlichkeiten in 2D zu visualisieren
It becomes quite obvious that samples shorter than 5000 words provide a poor "guessing", because they can be immensely affected by random noise. Below the size of 3000 words, the obtained results are simply disastrous. Other analyzed corpora showed that the critical point of attributive success could be found between 5000 and 10000 words per sample (and there was no significant difference between inflected and non-inflected languages).
Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem.
Digital Scholarship in the Humanities.
"Not unexpectedly, it works least well with texts of a genre uncharacteristic of their author and, in one case, with texts far separated in time across a long literary career. Its possible use for other classificatory tasks has not yet been investigated".
John Burrows, ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship, Literary and Linguistic Computing, Volume 17, Issue 3, September 2002, Pages 267–287, https://doi.org/10.1093/llc/17.3.267
Orekhov, B. (2024) ‘Does Burrows’ Delta really confirm that Rowling and Galbraith are the same author?’ https://arxiv.org/abs/2407.10301
Blended Authorship Attribution: Unmasking Elena Ferrante Combining Different Author Profling Methods (G. Mikros):
all profling results were highly accurate (over 90%) indicating that the person behind Ferrante is a male, aged over 60, from the region Campania and the town Saviano. The combination of these characteristics indicate a single candidate (among the authors of our corpus), Domenico Starnone.
(Jan Rybicki, at the DH 2019 Conference, Utrecht)
Maciej Eder, Jan Rybicki (2016). Go Set A Watchman while we Kill the Mockingbird in Cold Blood, with Cats and Other People
Maciej Eder, Jan Rybicki
Text
where
Presumably, each national literature has its own famous unsolved attribution case, such as the Shakespearean canon, a collection of Polish erotic poems of the 16th century ascribed to Mikołaj Sęp Szarzyński, the Russian epic poem The Tale of Igor’s Campaign, and many other.
Eder M. (2011) Style-markers in authorship attribution: A cross-language study of the authorial fingerprint.
in all their variety of material and method, have two features in common: the <...> texts they study have to be coaxed to yield numbers, and the numbers themselves have to be processed via statistics.
M. Eder, M. Kestemont, J. Rybicki. ‘Stylo’: a package for stylometric analyses
Less typical but still works sometimes:
How can they contain the 'authorial signal'?
> data(lee)
> stylo(frequencies=lee)
1851 — A. De Morgan suggests mean word-length as an authorship feature
1873 — New Shakespeare Society (Furnival, Fleay et al)
1887 — T. Mendenhall, The Characteristic Curves of Composition, the first known work on quantitative authorship attribution
defaults write org.R-project.R force.LANG en_US.UTF-8
>system("defaults write org.R-project.R force.LANG en_US.UTF-8")