quantitative authorship attribution
and beyond
Daniil Skorinkin, DH Network Potsdam
is using frequencies of textual units to detect the 'authorial signal' in texts
(that's a narrow definition, but enough for now)
underlying stylometric studies is that authors have an unconscious as well as conscious aspect to their style
Encyclopaedia of Statistical Sciences
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical clustering dendrograms
Weighted graphs
(Weighted networks)
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical philogenetic tree style dendrograms
Weighted graphs
(Weighted networks)
Presumably, each national literature has its own famous unsolved attribution case, such as the Shakespearean canon, a collection of Polish erotic poems of the 16th century ascribed to Mikołaj Sęp Szarzyński, the Russian epic poem The Tale of Igor’s Campaign, and many other.
Eder M. (2011) Style-markers in authorship attribution: A cross-language study of the authorial fingerprint.
Taking tools built by warmongers, spy agencies & investment bankers and using them to study literature, philosophy, culture and the classics
(Elijah Meeks, Stanford Digital Scholarship)
in all their variety of material and method, have two features in common: the <...> texts they study have to be coaxed to yield numbers, and the numbers themselves have to be processed via statistics.
M. Eder, M. Kestemont, J. Rybicki. ‘Stylo’: a package for stylometric analyses
Less typical but still works sometimes:
the 'and' axis
the 'the' axis
Now we can
measure distances between texts!
But there are ways to compress such spaces and vizualize in 2D
How can they contain the 'authorial signal'?
Most readers and critics behave as though common prepositions, conjunctions, personal pronouns, and articles — the parts of speech which make up at least a third of fictional works in English — do not really exist. But far from being a largely inert linguistic mass which has a simple but uninteresting function, these words and their frequency of use can tell us a great deal <...>
Preface to Computation into Criticism, 1987
> data(lee)
> stylo(frequencies=lee)
Maciej Eder, Jan Rybicki (2016). Go Set A Watchman while we Kill the Mockingbird in Cold Blood, with Cats and Other People
Maciej Eder, Jan Rybicki
1851 — A. De Morgan suggests mean word-length as an authorship feature
1873 — New Shakespeare Society (Furnival, Fleay et al)
1887 — T. Mendenhall, The Characteristic Curves of Composition, the first known work on quantitative authorship attribution
In summary, the following points are clear:
defaults write org.R-project.R force.LANG en_US.UTF-8
>system("defaults write org.R-project.R force.LANG en_US.UTF-8")
Text
where