количественное определение авторства и не только
Даниил Скоринкин, DH Network Potsdam
использование частотностей некоторых атомарных элементов текстов (чаще всего слов или их фрагментов) для сопоставления текстов между собой — например, для задач определения авторства, но не только
(это моё рабочее техническое определение)
Стилометрические исследования во всем их разнообразии имеют две общие черты: тексты должны быть каким-то образом преобразованы в числа, а числа — исследованы статистическими методами
M. Eder, M. Kestemont, J. Rybicki. ‘Stylo’: a package for stylometric analyses
ось 'and'
ось 'the'
И вот мы уже можем измерять расстояния между текстами
Неужели частотности (в основном служебных) слов позволяют отличить одного автора от другого?
It becomes quite obvious that samples shorter than 5000 words provide a poor "guessing", because they can be immensely affected by random noise. Below the size of 3000 words, the obtained results are simply disastrous. Other analyzed corpora showed that the critical point of attributive success could be found between 5000 and 10000 words per sample (and there was no significant difference between inflected and non-inflected languages).
Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem.
Digital Scholarship in the Humanities.
где
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical clustering dendrograms
'Philogenetic tree'-like
dendrograms
Weighted graphs
(Weighted networks)
1851 — математик А. де Морган предлагает длину слова как признак авторства
1887 — Томас Менденхолл (T. Mendenhall), The Characteristic Curves of Composition, первая известная работа по количественному определению авторства
1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge
1890 — W. Lutosławski, Principes de stylométrie
1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings
1880 — W. Dittenberger, Sprachliche Kriterien für die Chronologie der Platonischen Dialoge
1890 — W. Lutosławski, Principes de stylométrie
1897 — W. Lutosławski, The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings
1915 — Морозов Н.А. Лингвистические спектры
(вдохновлен Лютославским)
In summary, the following points are clear:
Most readers and critics behave as though common prepositions, conjunctions, personal pronouns, and articles — the parts of speech which make up at least a third of fictional works in English — do not really exist. But far from being a largely inert linguistic mass which has a simple but uninteresting function, these words and their frequency of use can tell us a great deal <...>
Preface to Computation into Criticism, 1987
2. Нормализация частотностей Z-преобразованием
3. Измерение манхэттенского расстояния между нормализованными векторами
Burrows’s Delta <...> corresponds to the Manhattan distance of the word frequencies' z-scores
Stefan Evert, Thomas Proisl, Fotis Jannidis, Isabella Reger, Steffen Pielström, Christof Schöch, Thorsten Vitt,
Understanding and explaining Delta measures for authorship attribution, Digital Scholarship in the Humanities, Volume 32, Issue suppl_2, December 2017, Pages ii4–ii16, https://doi.org/10.1093/llc/fqx023
Да черт его знает...
(Ян Рыбицкий на одном из стилометрических докладов во время конференции DH 2019 в Утрехте)
> data(lee)
> stylo(frequencies=lee)
Тереза "Тэй" Хохоф, редактор "Убить пересмешника"
фото отсюда
Blended Authorship Attribution: Unmasking Elena Ferrante Combining Different Author Profling Methods (G. Mikros):
all profling results were highly accurate (over 90%) indicating that the person behind Ferrante is a male, aged over 60, from the region Campania and the town Saviano. The combination of these characteristics indicate a single candidate (among the authors of our corpus), Domenico Starnone.
Maciej Eder, Jan Rybicki (2016). Go Set A Watchman while we Kill the Mockingbird in Cold Blood, with Cats and Other People
Maciej Eder, Jan Rybicki
Мэри Вестмакотт ≈ Роберт Гэлбрейт
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Skorinkin D., Orekhov B. Hacking stylometry with multiple voices: imaginary writers can override authorial signal in Delta.
In: Digital Scholarship in the Humanities, 2023 (forthcoming)
Not unexpectedly, it works least well with texts of a genre uncharacteristic of their author and, in one case, with texts far separated in time across a long literary career. Its possible use for other classificatory tasks has not yet been investigated.
John Burrows, ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship, Literary and Linguistic Computing, Volume 17, Issue 3, September 2002, Pages 267–287, https://doi.org/10.1093/llc/17.3.267
J. Rybicki, M.Heydel. The stylistics and stylometry of collaborative translation: Woolf’s Night and Day in Polish // Literary and Linguistic Computing 28 (4), 708-717
Choiński, M., Rybicki, J. (2016). Jonathan Edwards and Thomas Foxcroft: In Pursuit of Stylometric Traces of the Editor. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 147-149.
Consecutive segments of Edwards's Mind (1723); throughout the work, Edward's signal (red) dominates over the (absent) signal of Foxcroft.
Consecutive segments of Edwards's Humble Inquiry (1749); in many other fragments, dominated by Edwards (red), Foxcroft's impact is still visible. The lower band shows the strongest signal; the upper, the second strongest.
Dimensionality reduction methods (PCA MDS tSNE etc)
Hierarchical clustering dendrograms
'Philogenetic tree'-like
dendrograms
Weighted graphs
(Weighted networks)
функция stylo()
функция classify()
avoided words
для женской речи
на фоне мужской
avoided words
для женской речи
на фоне мужской
preferred words
для женской речи
на фоне мужской