TM hands on with mallet

Installing mallet 

done if you're using local PC; otherwise

  1. unzip mallet-2.0.8.zip to D:\
  2. rename the folder to 'mallet'

(Windows) setting environment variable 

  • open cmd (Пуск-> type cmd)
  • setx MALLET_HOME "D:\mallet"
  • close cmd (important!)

or

Setting environment variable

Setting environment variable

Testing the installation

  1. open cmd (Пуск-> type cmd)
  2. 'D:'->enter
  3. cd mallet->enter
  4. bin\mallet->enter

Testing on Mac/Linux

  1. Launch terminal
  2. cd your path to mallet
  3. ./bin/mallet
  4. NB: you'll need / instead of \

If you get an error about 'Java'

Download and install JDK:
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

JDK for Windows x64:

Data import

  • read help when in trouble:
    • bin\mallet import-dir --help
  • simplest import (from sample data):
    • bin\mallet import-dir --input sample-data\web\en
  • now lets create output in .mallet format:
    • bin\mallet import-dir --input sample-data\web\en --output tutorial.mallet 

Some import parameters

  • --input
  • --output

Some import parameters

  • --keep-sequence
  • --skip-html 
  • --token-regex
  • ​--remove-stopwords
  • --stoplist-file

Some import parameters

  • Important: add --keep-sequence to use train-topics:
    • bin\mallet import-dir --input sample-data\web\en --output tutorial.mallet  --keep-sequence

training a topic model

  • bin\mallet train-topics

train-topics parameters

  • --input (.mallet format!)
  • --output
  • --num-topics N
  • --optimize-interval N
    • (some maths that helps:)

Your first topic model

from sample data

  • simplest launch:
    • bin\mallet train-topics  --input tutorial.mallet
  • adding topics output and other parameters:
    • ​bin\mallet train-topics  --input tutorial.mallet --num-topics 20 --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_compostion.txt

Looking at the results

  • Open D:\mallet (or wherever your mallet is)
  • Right click on tutorial_keys.txt -> Edit with Notepad++
  • What did we get?
  • Lets look at the document distributions: 

Default data

  • --input sample-data\web\en
    • 12 English wikipedia articles
  • --input sample-data\web\de
    • 12 German wikipedia articles​

Now let's import our data

  • download our-data.zip from email
  • unzip it and put 'our-data' folder to 'D:\mallet'
  • bin\mallet import-dir --input our-data\tolstoy --output tolstoy.mallet --keep-sequence 

..and train the model

  • bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics 
  • What did we get? 

..and train the model

  • bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics 
  • What did we get? 
  • SPOILER: lots of pronouns and function words like 'что', 'он', 'её'...

Intermission: preprocessing

Removing Russian stopwords

  • download stop_ru.txt from email
  • put it to 'D:mallet\stoplists'
  • add parameter
    • --stoplist-file stoplists\stop_ru.txt
  • bin\mallet import-dir --input our-data\writers --output writers.mallet --keep-sequence --stoplist-file\stop_ru.txt

lifehack:

  • for stopwords go to
    http://www.ranks.nl/stopwords/

lifehack 2:

  • ...and then you can just add more stopwords to stop_ru.txt manually! 

  • just don't forget to re-import data after editing the stoplist

training on cleaner data

  • bin\mallet import-dir --input our-data\tolstoy --output tolstoy.mallet --keep-sequence --stoplist-file\stop_ru.txt
  • bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics 

What about morphology and inflection?

training on lemmatized data

  • bin\mallet import-dir --input our-data\tolstoy_lemm --output tolstoy.mallet --keep-sequence --stoplist-file\stop_ru.txt
  • bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics 

training on lemmatized data

  • для своего проекта желающие могут заказать лемматизацию у меня (skorinkin.danil@gmail.com)
  • лемматизировано будет автоматически, поэтому возможны ошибки (но в целом все должно быть ок)
Made with Slides.com