An Ensemble-Rich Multi-Aspect Approach for Robust Style Change Detection

D. Zlatkova, D. Kopev, K. Mitov,

A. Atanasov, M. Hardalov, I. Koychev

Sofia University, Bulgaria

PAN at CLEF-2018

P. Nakov

Qatar Computing Research Institute, HBKU, Doha, Qatar

10-14 Sept. 2018, CLEF, Avignon

The Task

10-14 Sept. 2018, CLEF, Avignon

Related Work

  • General approaches for Style Breach Detection:
    • unsupervised methods
    • stylometry and TF-IDF features
  • Wilcoxon Signed Rank test to check whether two segments are likely to come from the same distribution (Karas et al.)
  • Outlier detection using cosine-based distance between sentence vectors using pre-trained skip-thought models (Safin and Kuznetsova)

10-14 Sept. 2018, CLEF, Avignon

Data Preprocessing

  • Special tokens
    • http://www.java2s.com -> _URL_
    • 66657345299563332126532111111 -> _LONG_NUM_
    • /Users/Shared/Client/Blizzard -> _FILE_PATH
    • ================== -> _CHAR_SEQ
    • Taumatawhakatangihangakoauauo-> _LONG_WORD_
  • Split hyphenated words
    • Pretends-To-Be-Scrum-But-Actually-Is-Not-Even-Agile

10-14 Sept. 2018, CLEF, Avignon

Text Segmentation

  • Sliding Window
  • 1/3 overlap
  • Window size: 1/3 of doc length
  • Max diff of feature vectors

10-14 Sept. 2018, CLEF, Avignon

Lexical Features

  • spaces
  • digits
  • commas
  • (semi)colons
  • apostrophes
  • quotes
  • parenthesis
  • number of paragraphs
  • POS-tags
  • short (< 4 chars)
  • long (> 6 chars)
  • average length
  • all-caps
  • capitalized
  • question
  • period
  • exclamation
  • short (<100chars)
  • long (>200 chars)

Characters:

Words:

Sentences:

10-14 Sept. 2018, CLEF, Avignon

More Features

  • Stop words: you, the, is, of, ...
  • Function words: least, well, etc, whether, ...
  • Readability, e.g  Flesch reading ease:
  • Vocabulary richness
    • Average word frequency class
      • frequency class of 'the' is 1
      • frequency class of 'doppelganger' is 19
    • Proportion of unknown words (not in corpus)

10-14 Sept. 2018, CLEF, Avignon

Even More Features

  • Repetition
    • average number of occurrences of unigrams, bigrams, ..., 5-grams
  • Grammar Contractions
    • I will vs. I'll
    • are not vs. aren't
  • Quotation variation: ' vs. "

10-14 Sept. 2018, CLEF, Avignon

LightGBM + TF-IDF

  • Character [2-6]-grams (up to 300k)
  • Word [1-2]-grams (up to 300k)
  • Logistic Regression for feature selection
  • Parameter tuning to avoid overfitting
  • Bagging
  • Training TF-IDF on test documents

10-14 Sept. 2018, CLEF, Avignon

Stacking

10-14 Sept. 2018, CLEF, Avignon

Results

Classifier Dataset Accuracy
MLP w/ TF-IDF (Baseline) validation 70.64
LightGBM w/ TF-IDF validation 86.53
Stacking validation 80.47
Stacking w/ LightGBM validation 87.00
Stacking w/ LightGBM test 89.35

10-14 Sept. 2018, CLEF, Avignon

Results

10-14 Sept. 2018, CLEF, Avignon

Style Breach Detection

  • PAN 2017 dataset
    • 134 training examples
    • 0 to 8 breaches
  • use the developed supervised method
  • search for breaches recursively
  • outperforms baseline models

10-14 Sept. 2018, CLEF, Avignon

Conclusion

  • High accuracy for Style Change Detection is achievable.
  • Ensembles perform best.
  • Using a supervised method to detect exact breaches is promising, but needs further work.

10-14 Sept. 2018, CLEF, Avignon

References

  1. Karaś, D., Śpiewak, M., Sobecki, P.: OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection—Notebook for PAN at CLEF 2017.
  2. Safin, K., Kuznetsova, R.: Style breach detection with neural sentence embeddings—notebook for PAN at CLEF 2017.
  3. Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, Günther Specht, Benno Stein, Martin Potthast: Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection.

10-14 Sept. 2018, CLEF, Avignon

CLEF

By Dimitrina Zlatkova