PAN 2018
Style Change Detection
Атанас, 25858 Димитрина, 25620
Даниел, 25830 Кристиян, 25625
The Task
Today's topics:
- Data Exploration
- Data Preprocessing
- Features
- Things that didn't work out
- Word-based CNN + SE
- Style Breach Detection
- Results
Data Exploration
Data Preprocessing
- On raw text
- http://www.java2s.com -> _URL_
- 66657345299563332126532111111 -> _LONG_NUM_
- After segmentation
- /Users/Shared/Client/Blizzard -> _FILE_PATH_
- ================== -> _CHAR_SEQ_
- Taumatawhakatangihangakoauauo-> _LONG_WORD_
- Split hyphenated words
- Pretends-To-Be-Scrum-But-Actually-Is-Not-Even-Agile
Lexical Features
- spaces
- digits
- commas
- (semi)colons
- apostrophes
- quotes
- parenthesis
- POS-tags
- short (< 4 chars)
- long (> 6 chars)
- average length
- all-caps
- capitalized
- question
- period
- exclamation
- short (<100chars)
- long (>200 chars)
- number of paragraphs
Characters:
Words:
Sentences:
More Features
- Tautology
- Grammar Contractions
- I will vs. I'll
- are not vs. aren't
- Quotation variation: ' vs. "
- Named entity spelling
- New York City vs. NYC
Begin and end of author segments
Feature Importance
Things that didn't work out
- Dual input model
- LSTM with self-trained embeddings
- LSTM with pre-trained embeddings
- LSTM on feature vectors
- SVM with tf-idf
- Ridge Regression
Word-based CNN
- Window size: 11
- Dropout rate: 0.1
- Optimizer: adam
- Filter depth: 64
- Batch size: 256
- Max words: 1024
- Vocab size: 60k
- Embedding size: 300
- Bagging
Squeeze & Excitation
Results
Classifier | Dataset | Accuracy |
---|---|---|
CNN | validation | 85.92 |
Stacking w/ CNN | validation | 78.02 |
Stacking w/ CNN and LightGBM | validation | 86.93 |
CNN (train + validation) | test | 88.39 |
Stacking w/ LightGBM | test | 89.35 |
Style Breach Detection
- PAN 2017
- 134 training examples
- 0 to 8 breaches
- unbalanced classes (80:20)
- recursive search for breaches
- stop at 20 sentences
- confidence threshold: 75%
Results
PAN Text Mining
By Dimitrina Zlatkova
PAN Text Mining
- 487