TM hands on with mallet
Installing mallet
done if you're using local PC; otherwise
- unzip mallet-2.0.8.zip to D:\
- rename the folder to 'mallet'
(Windows) setting environment variable
- open cmd (Пуск-> type cmd)
- setx MALLET_HOME "D:\mallet"
- close cmd (important!)
or
Setting environment variable
Setting environment variable
Testing the installation
- open cmd (Пуск-> type cmd)
- 'D:'->enter
- cd mallet->enter
- bin\mallet->enter
Testing on Mac/Linux
- Launch terminal
- cd your path to mallet
- ./bin/mallet
- NB: you'll need / instead of \
If you get an error about 'Java'
Download and install JDK:
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
JDK for Windows x64:
Data import
- read help when in trouble:
- bin\mallet import-dir --help
-
simplest import (from sample data):
- bin\mallet import-dir --input sample-data\web\en
- now lets create output in .mallet format:
- bin\mallet import-dir --input sample-data\web\en --output tutorial.mallet
Some import parameters
- --input
- --output
Some import parameters
- --keep-sequence
- --skip-html
- --token-regex
- --remove-stopwords
- --stoplist-file
Some import parameters
- Important: add --keep-sequence to use train-topics:
- bin\mallet import-dir --input sample-data\web\en --output tutorial.mallet --keep-sequence
training a topic model
- bin\mallet train-topics
train-topics parameters
- --input (.mallet format!)
- --output
- --num-topics N
-
--optimize-interval N
- (some maths that helps:)
Your first topic model
from sample data
- simplest launch:
- bin\mallet train-topics --input tutorial.mallet
- adding topics output and other parameters:
- bin\mallet train-topics --input tutorial.mallet --num-topics 20 --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_compostion.txt
Looking at the results
- Open D:\mallet (or wherever your mallet is)
- Right click on tutorial_keys.txt -> Edit with Notepad++
- What did we get?
- Lets look at the document distributions:
Default data
- --input sample-data\web\en
- 12 English wikipedia articles
-
--input sample-data\web\de
- 12 German wikipedia articles
Now let's import our data
- download our-data.zip from email
- unzip it and put 'our-data' folder to 'D:\mallet'
- bin\mallet import-dir --input our-data\tolstoy --output tolstoy.mallet --keep-sequence
..and train the model
- bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics
- What did we get?
..and train the model
- bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics
- What did we get?
- SPOILER: lots of pronouns and function words like 'что', 'он', 'её'...
Intermission: preprocessing
Removing Russian stopwords
- download stop_ru.txt from email
- put it to 'D:mallet\stoplists'
- add parameter
- --stoplist-file stoplists\stop_ru.txt
- bin\mallet import-dir --input our-data\writers --output writers.mallet --keep-sequence --stoplist-file\stop_ru.txt
lifehack:
-
for stopwords go to
http://www.ranks.nl/stopwords/
lifehack 2:
-
...and then you can just add more stopwords to stop_ru.txt manually!
-
just don't forget to re-import data after editing the stoplist
training on cleaner data
- bin\mallet import-dir --input our-data\tolstoy --output tolstoy.mallet --keep-sequence --stoplist-file\stop_ru.txt
- bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics
What about morphology and inflection?
training on lemmatized data
- bin\mallet import-dir --input our-data\tolstoy_lemm --output tolstoy.mallet --keep-sequence --stoplist-file\stop_ru.txt
- bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics
training on lemmatized data
- для своего проекта желающие могут заказать лемматизацию у меня (skorinkin.danil@gmail.com)
- лемматизировано будет автоматически, поэтому возможны ошибки (но в целом все должно быть ок)
Mallet
By danilsko
Mallet
- 1,331