An Ensemble-Rich Multi-Aspect Approach for Robust Style Change Detection
D. Zlatkova, D. Kopev, K. Mitov,
A. Atanasov, M. Hardalov, I. Koychev
Sofia University, Bulgaria
PAN at CLEF-2018
P. Nakov
Qatar Computing Research Institute, HBKU, Doha, Qatar
10-14 Sept. 2018, CLEF, Avignon
The Task
10-14 Sept. 2018, CLEF, Avignon
Related Work
- General approaches for Style Breach Detection:
- unsupervised methods
- stylometry and TF-IDF features
- Wilcoxon Signed Rank test to check whether two segments are likely to come from the same distribution (Karas et al.)
- Outlier detection using cosine-based distance between sentence vectors using pre-trained skip-thought models (Safin and Kuznetsova)
10-14 Sept. 2018, CLEF, Avignon
Data Preprocessing
- Special tokens
- http://www.java2s.com -> _URL_
- 66657345299563332126532111111 -> _LONG_NUM_
- /Users/Shared/Client/Blizzard -> _FILE_PATH
- ================== -> _CHAR_SEQ
- Taumatawhakatangihangakoauauo-> _LONG_WORD_
- Split hyphenated words
- Pretends-To-Be-Scrum-But-Actually-Is-Not-Even-Agile
10-14 Sept. 2018, CLEF, Avignon
Text Segmentation
- Sliding Window
- 1/3 overlap
- Window size: 1/3 of doc length
- Max diff of feature vectors
10-14 Sept. 2018, CLEF, Avignon
Lexical Features
- spaces
- digits
- commas
- (semi)colons
- apostrophes
- quotes
- parenthesis
- number of paragraphs
- POS-tags
- short (< 4 chars)
- long (> 6 chars)
- average length
- all-caps
- capitalized
- question
- period
- exclamation
- short (<100chars)
- long (>200 chars)
Characters:
Words:
Sentences:
10-14 Sept. 2018, CLEF, Avignon
More Features
- Stop words: you, the, is, of, ...
- Function words: least, well, etc, whether, ...
- Readability, e.g Flesch reading ease:
- Vocabulary richness
- Average word frequency class
- frequency class of 'the' is 1
- frequency class of 'doppelganger' is 19
- Proportion of unknown words (not in corpus)
- Average word frequency class
10-14 Sept. 2018, CLEF, Avignon
Even More Features
- Repetition
- average number of occurrences of unigrams, bigrams, ..., 5-grams
- Grammar Contractions
- I will vs. I'll
- are not vs. aren't
- Quotation variation: ' vs. "
10-14 Sept. 2018, CLEF, Avignon
LightGBM + TF-IDF
- Character [2-6]-grams (up to 300k)
- Word [1-2]-grams (up to 300k)
- Logistic Regression for feature selection
- Parameter tuning to avoid overfitting
- Bagging
- Training TF-IDF on test documents
10-14 Sept. 2018, CLEF, Avignon
Stacking
10-14 Sept. 2018, CLEF, Avignon
Results
Classifier | Dataset | Accuracy |
---|---|---|
MLP w/ TF-IDF (Baseline) | validation | 70.64 |
LightGBM w/ TF-IDF | validation | 86.53 |
Stacking | validation | 80.47 |
Stacking w/ LightGBM | validation | 87.00 |
Stacking w/ LightGBM | test | 89.35 |
10-14 Sept. 2018, CLEF, Avignon
Results
10-14 Sept. 2018, CLEF, Avignon
Style Breach Detection
-
PAN 2017 dataset
- 134 training examples
- 0 to 8 breaches
- use the developed supervised method
- search for breaches recursively
- outperforms baseline models
10-14 Sept. 2018, CLEF, Avignon
Conclusion
- High accuracy for Style Change Detection is achievable.
- Ensembles perform best.
- Using a supervised method to detect exact breaches is promising, but needs further work.
10-14 Sept. 2018, CLEF, Avignon
References
- Karaś, D., Śpiewak, M., Sobecki, P.: OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection—Notebook for PAN at CLEF 2017.
- Safin, K., Kuznetsova, R.: Style breach detection with neural sentence embeddings—notebook for PAN at CLEF 2017.
- Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, Günther Specht, Benno Stein, Martin Potthast: Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection.
10-14 Sept. 2018, CLEF, Avignon
CLEF
By Dimitrina Zlatkova
CLEF
- 562