PAN 2018
Style Change Detection
Атанас, 25858 Димитрина, 25620
Даниел, 25830 Кристиян, 25625
The Task
Today's topics:
- External Data
- Features
- Gradient Boosting Machines
- LightGBM
- Stacking
- Style Breach Detection
- Results
Data
- PAN
- StackExchange forum
- 3k train, 1.5k validation, 1.5k test
- 0 to 3 breaches
- External Movies
- Movie reviews from Amazon
- 270k
- 0 or 1 breach
-
External StackExchange
- 35 StackExchange sites
- 50k
- 0 to 2 breaches
Text Segmentation
- Sliding Window
- 1/3 overlap
- number-of-words/3 segment size
- Max diff of feature vectors
Features (1)
- spaces
- digits
- commas
- (semi)colons
- apostrophes
- quotes
- parenthesis
- POS-tags
- short (< 4 chars)
- long (> 6 chars)
- average length
- all-caps
- capitalized
- question
- period
- exclamation
- short (<100chars)
- long (>200 chars)
- number of paragraphs
Characters:
Words:
Sentences:
Features (2)
- Stop words: you, the, is, of
- Function words: least, well, etc, whether
- Readability, e.g Flesch reading ease:
- Vocabulary richness
- British/American English spelling
, f is a frequency function
Gradient Boosting Machines
LightGBM
- Leaf-wise (Best-first) Tree Growth
- Gradient-based One-Side Sampling
- Exclusive Feature Bundling
- Optimizations in Speed and Memory Usage
LightGBM + TF-IDF =
- Character [2-6]-grams (up to 300k)
- Word [1-2]-grams (up to 300k)
- LogisticRegression for feature selection
- Parameter tuning
- Bagging
- Train TF-IDF on test set???
Top features
Stacking
Validation Results
Test Results
Style Breach Detection
- PAN 2017
- 134 training examples
- 0 to 8 breaches
- unbalanced classes (80:20)
- recursive search for breaches
- threshold > 50% ?
Results
Our CV on train results:
0.77265566993 | 0.2403935597545 | 0.852499252757 | 0.32950197859 |
Last year PAN test results:
PAN Data Mining
By Dimitrina Zlatkova
PAN Data Mining
- 437