количественное определение авторства и не только
Семинар "Языки психиатрии" 4 декабря 2025
Даниил Скоринкин, DH Network, Universität Potsdam
использование частотностей некоторых атомарных элементов текстов (чаще всего слов или их фрагментов) для сопоставления текстов между собой — например, для задач определения авторства, но не только
(это моё рабочее техническое определение)
Автор А, книга А (A_A)
Автор А, книга Б (A_Б)
Автор А, книга В (А_В)
Автор Б, книга А (Б_A)
Автор Б, книга Б (Б_Б) Автор Б, книга В (Б_В)
Автор В, книга А (В_A)
Автор В, книга Б (В_Б) Автор В, книга В (В_В)
Книга, про которую мы сомневаемся (Dubia)
Автор А, книга А (A_A)
Автор А, книга Б (A_Б)
Автор А, книга В (А_В)
Автор Б, книга А (Б_A)
Автор Б, книга Б (Б_Б) Автор Б, книга В (Б_В)
....
Так что же
происходит между?
Стилометрические исследования во всем их разнообразии имеют две общие черты: тексты должны быть каким-то образом преобразованы в числа, а числа — исследованы статистическими методами
M. Eder, M. Kestemont, J. Rybicki. ‘Stylo’: a package for stylometric analyses
ось 'and'
ось 'the'
И вот мы уже можем измерять расстояния между текстами
the and to of a was i in he said you that it his had on at her with as for not him they she were but be have up all out is from them me been what this about into like back my there would we could one now know if their so or no do down your an did by are when who looked more over then see again time just don^t still very think got will off re go eyes than before right here get away thought i^m came too through only long way going face come some can
Неужели частотности (в основном служебных) слов позволяют отличить одного автора от другого?
It becomes quite obvious that samples shorter than 5000 words provide a poor "guessing", because they can be immensely affected by random noise. Below the size of 3000 words, the obtained results are simply disastrous. Other analyzed corpora showed that the critical point of attributive success could be found between 5000 and 10000 words per sample (and there was no significant difference between inflected and non-inflected languages).
Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem.
Digital Scholarship in the Humanities.
"Not unexpectedly, it works least well with texts of a genre uncharacteristic of their author and, in one case, with texts far separated in time across a long literary career. Its possible use for other classificatory tasks has not yet been investigated".
John Burrows, ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship, Literary and Linguistic Computing, Volume 17, Issue 3, September 2002, Pages 267–287, https://doi.org/10.1093/llc/17.3.267
где
на самом деле — все что угодно
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical clustering dendrograms
'Philogenetic tree'-like
dendrograms
Weighted graphs
(Weighted networks)
1851 — математик А. де Морган предлагает длину слова как признак авторства
1887 — Томас Менденхолл (T. Mendenhall), The Characteristic Curves of Composition, первая известная работа по количественному определению авторства
1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge
1890 — W. Lutosławski, Principes de stylométrie
1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings
1915 — Морозов Н.А. Лингвистические спектры
(вдохновлен Лютославским)
Word counts are the variables used for discrimination. Since the topic written about heavily influences the rate with which a word is used, care in selection of words is necessary. The filler words of the language such as an, of, and upon, and, more generally, articles, prepositions, and conjunctions provide fairly stable rates, whereas more meaningful words like war, executive, and legislature do not.
Mosteller, F. & Wallace, D. (1963): Inference in an Authorship Problem.
Most readers and critics behave as though common prepositions, conjunctions, personal pronouns, and articles — the parts of speech which make up at least a third of fictional works in English — do not really exist. But far from being a largely inert linguistic mass which has a simple but uninteresting function, these words and their frequency of use can tell us a great deal <...>
Preface to Computation into Criticism, 1987
2. Нормализация частотностей Z-преобразованием
3. Измерение манхэттенского расстояния между нормализованными векторами
Burrows’s Delta <...> corresponds to the Manhattan distance of the word frequencies' z-scores
Stefan Evert, Thomas Proisl, Fotis Jannidis, Isabella Reger, Steffen Pielström, Christof Schöch, Thorsten Vitt,
Understanding and explaining Delta measures for authorship attribution, Digital Scholarship in the Humanities, Volume 32, Issue suppl_2, December 2017, Pages ii4–ii16, https://doi.org/10.1093/llc/fqx023
Да черт его знает...
(Ян Рыбицкий на одном из стилометрических докладов во время конференции DH 2019 в Утрехте)
Blended Authorship Attribution: Unmasking Elena Ferrante Combining Different Author Profling Methods (G. Mikros):
all profling results were highly accurate (over 90%) indicating that the person behind Ferrante is a male, aged over 60, from the region Campania and the town Saviano. The combination of these characteristics indicate a single candidate (among the authors of our corpus), Domenico Starnone.
> data(lee)
> stylo(frequencies=lee)
Тереза "Тэй" Хохоф, редактор "Убить пересмешника"
фото отсюда
Maciej Eder, Jan Rybicki (2016). Go Set A Watchman while we Kill the Mockingbird in Cold Blood, with Cats and Other People
J. Rybicki, M.Heydel. The stylistics and stylometry of collaborative translation: Woolf’s Night and Day in Polish // Literary and Linguistic Computing 28 (4), 708-717
Choiński, M., Rybicki, J. (2016). Jonathan Edwards and Thomas Foxcroft: In Pursuit of Stylometric Traces of the Editor. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 147-149.
Consecutive segments of Edwards's Mind (1723); throughout the work, Edward's signal (red) dominates over the (absent) signal of Foxcroft.
Consecutive segments of Edwards's Humble Inquiry (1749); in many other fragments, dominated by Edwards (red), Foxcroft's impact is still visible. The lower band shows the strongest signal; the upper, the second strongest.
Maciej Eder, Jan Rybicki
Мэри Вестмакотт ≈ Роберт Гэлбрейт
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Not unexpectedly, it works least well with texts of a genre uncharacteristic of their author and, in one case, with texts far separated in time across a long literary career. Its possible use for other classificatory tasks has not yet been investigated.
John Burrows, ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship, Literary and Linguistic Computing, Volume 17, Issue 3, September 2002, Pages 267–287, https://doi.org/10.1093/llc/17.3.267
Запись доклада: dh2025.adho.org/july-16th
Rybicki, J. (2025) ‘Back to Writing after Aphasia: a Stylometric Case Study’, in DH 2025 Book of Abstracts. DH 2025, Lisbon
Сравнение с другими польскими абсурдистами, современными Мрожеку
Сравнение с другими польскими абсурдистами, современными Мрожеку
[Arslan, S., Devers, C. and Ferreiro, S.M. (2021) ‘Pronoun processing in post-stroke aphasia: A meta-analytic review of individual data’, Journal of Neurolinguistics, 59, p. 101005.
Barrios Rudloff, J. et al. (2023) ‘Detecting Psychological Disorders with Stylometry’, in Computational Humanities Research. Paris, France. Available at: https://doi.org/10.31234/osf.io/s5cm3.
Trifu, R.N. et al. (2024) ‘Linguistic markers for major depressive disorder: a cross-sectional study using an automated procedure’, Frontiers in Psychology, 15. Available at: https://doi.org/10.3389/fpsyg.2024.1355734.
Ehlen, F. et al. (2023) ‘Linguistic findings in persons with schizophrenia—a review of the current literature’, Frontiers in Psychology, 14. Available at: https://doi.org/10.3389/fpsyg.2023.1287706.
Lancashire, I., & Hirst, G. (2009). “Vocabulary Changes in Agatha Christie’s Mysteries as an Indication of Dementia: A Case Study.” 19th Annual Rotman Research Institute Conference, Cognitive Aging: Research and Practice, 8-10.
Mean sentence length per work is presented in Fig. 1; Mrożek’s post-aphasia numbers do not diverge in any way from those for his ante-aphasia sentence length. The Grade Level for Baltazar is only slightly higher than in his earlier work (Fig. 2). Vocabulary richness of the autobiography is also within the range of the pre-stroke texts (Fig. 3).
[Rybicki, J. (2025) ‘Back to Writing after Aphasia: a Stylometric Case Study’, in DH 2025 Book of Abstracts. DH 2025, Lisbon]
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical clustering dendrograms
'Philogenetic tree'-like
dendrograms
Weighted graphs
(Weighted networks)
avoided words
для женской речи
на фоне мужской
avoided words
для женской речи
на фоне мужской
preferred words
для женской речи
на фоне мужской