PAN 2018

Style Change Detection

    Атанас, 25858        Димитрина, 25620

Даниел, 25830         Кристиян, 25625

The Task

Today's topics:

  • External Data
  • Features
  • Gradient Boosting Machines
  • LightGBM
  • Stacking
  • Style Breach Detection
  • Results

Data

  • PAN
    • StackExchange forum
    • 3k train, 1.5k validation, 1.5k test
    • 0 to 3 breaches
  • External Movies
    • Movie reviews from Amazon
    • 270k
    • 0 or 1 breach
  • External StackExchange
    • 35 StackExchange sites
    • 50k
    • 0 to 2 breaches

Text Segmentation

  • Sliding Window
  • 1/3 overlap
  • number-of-words/3 segment size
  • Max diff of feature vectors

Features (1)

  • spaces
  • digits
  • commas
  • (semi)colons
  • apostrophes
  • quotes
  • parenthesis
  • POS-tags
  • short (< 4 chars)
  • long (> 6 chars)
  • average length
  • all-caps
  • capitalized
  • question
  • period
  • exclamation
  • short (<100chars)
  • long (>200 chars)
  • number of paragraphs

Characters:

Words:

Sentences:

Features (2)

  • Stop words: you, the, is, of
  • Function words: least, well, etc, whether
  • Readability, e.g  Flesch reading ease:
  • Vocabulary richness
  • British/American English spelling

, f is a frequency function

Gradient Boosting Machines

LightGBM

  • Leaf-wise (Best-first) Tree Growth
  • Gradient-based One-Side Sampling
  • Exclusive Feature Bundling
  • Optimizations in Speed and Memory Usage

LightGBM + TF-IDF = 

  • Character [2-6]-grams (up to 300k)
  • Word [1-2]-grams (up to 300k)
  • LogisticRegression for feature selection
  • Parameter tuning
  • Bagging
  • Train TF-IDF on test set???

Top features

Stacking

Validation Results

Test Results

Style Breach Detection

  • PAN 2017
  • 134 training examples
  • 0 to 8 breaches
  • unbalanced classes (80:20)
  • recursive search for breaches
  • threshold > 50% ?

Results

Our CV on train results:

0.77265566993 0.2403935597545 0.852499252757 0.32950197859

Last year PAN test results:

PAN Data Mining

By Dimitrina Zlatkova