PAN 2018
Style Change Detection
Атанас, 25858 Димитрина, 25620
Даниел, 25830 Кристиян, 25625
The Task
Timeline
March 30, 2018 | Early bird submission |
April 15, 2018 | Evaluation phase opens |
May 11, 2018 | Evaluation phase ends |
May 31, 2018 | Participant paper submission |
Our Plan
- Related work research
- Data Retrieval
- Exploratory Data Analysis
- Features
- Data Vectorization
- Classification
- Parameter Tuning
Data
- Given
- StackExchange forum
- 3k train, 1.5k validation, 1.5k test
- Balanced classes (50/50)
- 300-1000 words per doc
- Different topics
- Scrum
- Pokemon Go
- Bike chains
- External
- Movie reviews
- 270k
- Balanced classes (50/50)
Some Stats
Features
- Stop words as tfidf/tf
- Function words as tfidf/tf
- POS tag ratio
- POS tags as n-grams
- Vocabulary richness
- Readability
- Mean word/sentence lengths
- Punctuation ratio
- Character ratios
- Number of words that appear only once/twice
- Character to space ratio
- Misspelled words ratio
Data Vectorization
- Max Diff (global)
- Max Diff (neighbor)
- Standard deviation
- Gaussian mixture model
- Whole doc
- Sentence tokens
- Sentence segments
- Sliding Window
Problems
- small training data (should rely on external data)
- short texts
- messy - urls, os paths, code (filter unknown tokens?)
- not enough punctuation
- unknown split segment attribution
- how to generate better text to obfuscate style change?
- how to use split information? / should we use it?
Sentence tokenization
When a mod converts a post to a comment, if the post contains a link of this form: | If I delete normally it works fine;| it's only converting to comment (and probably edit, although I didn't try) more than once that leaves behind undelete votes Screenshot of Unix private beta start mouseover http://mrozekma.com/so-a51-unix-beta-private-hover.png | Screenshot of the host list http://so.mrozekma.com/chat-modify-room-host.png | Screenshot of Unix public beta start mouseover http://mrozekma.com/so-a51-unix-beta-public-hover.png | It's going to take users a minimum of two weeks to hit 2000, and past betas have shown that a massive number of questions get asked when the private beta starts at 0 days and when the public beta starts at 7 days. | There are hundreds of new questions just on those two days, plus the early questions tend to define the style of the site, and nobody can edit them until weeks after they've been posted | However, when editing a room, the option doesn't appear on the host list: | Screenshot of the bug http://so.mrozekma.com/chat-title-bug.png
Next steps
- Features!!!
- Handle unknown tokens (features?)
- Data preprocessing/cleaning
- "Same author" model with 2 inputs
- Other ways to use split information
- Try unsupervised approaches
- (No) more research?
Questions
PAN Idea
By Dimitrina Zlatkova
PAN Idea
- 506