PAN 2018

Style Change Detection

    Атанас, 25858        Димитрина, 25620

Даниел, 25830         Кристиян, 25625

The Task

Timeline

March 30, 2018 Early bird submission
April 15, 2018 Evaluation phase opens
May 11, 2018 Evaluation phase ends
May 31, 2018 Participant paper submission

Our Plan

  • Related work research
  • Data Retrieval
  • Exploratory Data Analysis
  • Features
  • Data Vectorization
  • Classification
  • Parameter Tuning

Data

  • Given
    • StackExchange forum
    • 3k train, 1.5k validation, 1.5k test
    • Balanced classes (50/50)
    • 300-1000 words per doc
    • Different topics
      • Scrum
      • Pokemon Go
      • Bike chains
  • External
    • Movie reviews
    • 270k
    • Balanced classes (50/50)

Some Stats

Features

  • Stop words as tfidf/tf
  • Function words as tfidf/tf
  • POS tag ratio
  • POS tags as n-grams
  • Vocabulary richness
  • Readability
  • Mean word/sentence lengths
  • Punctuation ratio
  • Character ratios
  • Number of words that appear only once/twice
  • Character to space ratio
  • Misspelled words ratio

Data Vectorization

  • Max Diff (global)
  • Max Diff (neighbor)
  • Standard deviation
  • Gaussian mixture model
  • Whole doc
  • Sentence tokens
  • Sentence segments
  • Sliding Window

Problems

  • small training data (should rely on external data)
  • short texts
  • messy - urls, os paths, code (filter unknown tokens?)
  • not enough punctuation
  • unknown split segment attribution
  • how to generate better text to obfuscate style change?
  • how to use split information? / should we use it?

Sentence tokenization

When a mod converts a post to a comment, if the post contains a link of this form: | If I delete normally it works fine;| it's only converting to comment (and probably edit, although I didn't try) more than once that leaves behind undelete votes Screenshot of Unix private beta start mouseover http://mrozekma.com/so-a51-unix-beta-private-hover.png | Screenshot of the host list http://so.mrozekma.com/chat-modify-room-host.png | Screenshot of Unix public beta start mouseover http://mrozekma.com/so-a51-unix-beta-public-hover.png | It's going to take users a minimum of two weeks to hit 2000, and past betas have shown that a massive number of questions get asked when the private beta starts at 0 days and when the public beta starts at 7 days. | There are hundreds of new questions just on those two days, plus the early questions tend to define the style of the site, and nobody can edit them until weeks after they've been posted | However, when editing a room, the option doesn't appear on the host list: | Screenshot of the bug http://so.mrozekma.com/chat-title-bug.png

Next steps

  • Features!!!
  • Handle unknown tokens (features?)
  • Data preprocessing/cleaning
  • "Same author" model with 2  inputs
  • Other ways to use split information
  • Try unsupervised approaches
  • (No) more research?

Questions

PAN Idea

By Dimitrina Zlatkova