PurePos

egy adaptálható morfológiai egyértelműsítő

Orosz György

Pázmány Péter Katolikus Egyetem

Szófaji

egyértelműsítés

Morfológiai
egyértelműsítés


+
%

Egyszerű

  • morfológiai elemzőt használva

 

  • TnT / HunPos alapokon

 

 

  • moduláris  felépítéssel

Testreszabható

  • tetszőleges elemzővel használható

 

  • gyorsan tanítható

 

 

  • sokrétűen paraméterezhető

Hatékony

szavak      //   mondatok

97-98%   //   60-80%

~200 mondat/mp

 

Elérhető

https://github.com/ppke-nlpg/purepos

Parancssor

$ java -jar ./purepos.jar train -m test.model -i train.txt

$ java -jar ./purepos.jar tag -m test.model -i input.txt -o output.txt

$ java -jar ./purepos.jar -h

Usage: java -jar <purepos.jar> [options...] arguments...
 tag|train                        : Mode selection: train for training the
                                    tagger, tag for tagging a text with the
                                    given model.
 -a (--analyzer) <analyzer>       : Set the morphological analyzer. <analyzer>
                                    can be 'none', 'integrated' or a file :
                                    <morphologicalTableFile>. The default is to
                                    use the integrated one. Tagging only option.
 -b (--beam-theta) <theta>        : Set the beam-search limit. The default is
                                    1000. Tagging only option.
 -c (--encoding) <encoding>       : Encoding used to read the training set, or
                                    write the results. The default is your OS
                                    default.
 -d (--beam-decoder)              : Use Beam Search decoder. The default is to
                                    employ the Viterbi algorithm. Tagging only
                                    option.
 -e (--emission-order) <number>   : Order of emission. First order means that
                                    the given word depends only on its tag. The
                                    default is 2.  Training only option.
 -f (--config-file) <file>        : Configuratoin file containg tag mappings.
                                    Defaults to do not map any tag.
 -g (--max-guessed) <number>      : Limit the max guessed tags for each token.
                                    The default is 10. Tagging only option.
 -h (--help)                      : Print this message.
 -i (--input-file) <file>         : File containg the training set (for
                                    tagging) or the text to be tagged (for
                                    tagging). The default is the standard input.
 -m (--model) <modelfile>         : Specifies a path to a model file. If an
                                    exisiting model is given for training, the
                                    tool performs incremental training.
 -n (--max-results) <number>      : Set the expected maximum number of tag
                                    sequences (with its score). The default is
                                    1. Tagging only option.
 -o (--output-file) <file>        : File where the tagging output is redirected.
                                    Tagging only option.
 -r (--rare-frequency) <treshold> : Add only words to the suffix trie with
                                    frequency less than the given treshold. The
                                    default is 10.  Training only option.
 -s (--suffix-length) <length>    : Use a suffix trie for guessing unknown
                                    words tags with the given maximum suffix
                                    length. The default is 10.  Training only
                                    option.
 -t (--tag-order) <number>        : Order of tag transition. Second order means
                                    trigram tagging. The default is 2. Training
                                    only option.

Python

>>> ptrainer = PurePosTrainer("./simple.model", verbose=True)
>>> ptrainer.train(parse_text(train_text), finalize=False)
...

>>> ptagger = PurePosTagger("./simple.model")

>>> ptagger.tag("A mítápon jó a hagulat .".split())
...

>>> ptagger.tag([("Józsi", [("Józsi", "[FN][NOM]")]), 
                 ("ütött", [("üt", "[IGE][Me3]", -0.001), 
                            ("üt", "[IGE][_MIB][NOM]", -99)]), 
                 "."]))
...
https://github.com/ppke-nlpg/purepos.py

oroszgy@itk.ppke.hu