TM hands on with mallet
Installing mallet
done if you're using local PC; otherwise
unzip mallet-2.0.8.zip to D:\
rename the folder to 'mallet'
(Windows) setting environment variable
open cmd (Пуск-> type cmd)
setx MALLET_HOME "D:\mallet"
close cmd (important!)
or
Setting environment variable
Setting environment variable
Testing the installation
open cmd (Пуск-> type cmd)
'D:'->enter
cd mallet
->
enter
bin\mallet
->
enter
Testing on Mac/Linux
Launch terminal
cd
your path to mallet
./bin/mallet
NB: you'll need / instead of \
If you get an error about 'Java'
Download and install JDK:
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
JDK for Windows x64:
Data import
read help when in trouble:
bin\mallet import-dir --help
simplest import (from sample data):
bin\mallet import-dir --input sample-data\web\en
now lets create output in
.mallet
format:
bin\mallet import-dir --input sample-data\web\en --output tutorial.mallet
Some import parameters
--input
--output
Some import parameters
--keep-sequence
--skip-html
--token-regex
--remove-stopwords
--stoplist-file
Some import parameters
Important: add
--keep-sequence
to use
train-topics
:
bin\mallet import-dir --input sample-data\web\en --output tutorial.mallet --keep-sequence
training a topic model
bin\mallet train-topics
train-topics parameters
--input
(.mallet format!)
-
-output
--num-topics N
--optimize-interval N
(some maths that helps:)
Your first topic model
from sample data
simplest launch:
bin\mallet train-topics --input tutorial.mallet
adding topics output and other parameters:
bin\mallet train-topics --input tutorial.mallet --num-topics 20 --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_compostion.txt
Looking at the results
Open D:\mallet (or wherever your mallet is)
Right click on
tutorial_keys.txt
-> Edit with Notepad++
What did we get?
Lets look at the document distributions:
Default data
--input sample-data\web\en
12 English wikipedia articles
--input sample-data\web\de
12 German wikipedia articles
Now let's import our data
download our-data.zip from email
unzip it and put 'our-data' folder to 'D:\mallet'
bin\mallet import-dir --input our-data\tolstoy --output tolstoy.mallet --keep-sequence
..and train the model
bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics
What did we get?
..and train the model
bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics
What did we get?
SPOILER: lots of pronouns and function words like 'что', 'он', 'её'...
Intermission: preprocessing
Removing Russian stopwords
download stop_ru.txt from email
put it to 'D:mallet\stoplists'
add parameter
--stoplist-file stoplists\stop_ru.txt
bin\mallet import-dir --input our-data\writers --output writers.mallet --keep-sequence
--stoplist-file\stop_ru.txt
lifehack:
for stopwords go to
http://www.ranks.nl/stopwords/
lifehack 2:
...and then you can just add more stopwords to
stop_ru.txt
manually!
just don't forget to re-import data after editing the stoplist
training on cleaner data
bin\mallet import-dir --input our-data\tolstoy --output tolstoy.mallet --keep-sequence
--stoplist-file\stop_ru.txt
bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics
What about morphology and inflection?
training on lemmatized data
bin\mallet import-dir --input our-data\tolstoy
_lemm
--output tolstoy.mallet --keep-sequence
--stoplist-file\stop_ru.txt
bin\mallet train-topics --input tolstoy.mallet --num-topics 4 --optimize-interval 4 --output-topic-keys tolstoy_keys.txt --output-doc-topics
training on lemmatized data
для своего проекта желающие могут заказать лемматизацию у меня (skorinkin.danil@gmail.com)
лемматизировано будет автоматически, поэтому возможны ошибки (но в целом все должно быть ок)
Made with Slides.com