wordseg

text-based word segmentation

 

Mathieu Bernard - CoML

 

https://github.com/bootphon/wordseg

Abasictaskinfirstlanguageacquisitionlikelyinvolvesdiscoveringtheboundariesbetweenwordsormorphemesininputwherethesebasicunitsarenotovertlysegmented...

A basic task in first language acquisition likely involves discovering the boundaries between words or morphemes in input
where these basic units are not overtly segmented.

Started in 2017... Just published!

Why to present it?

  • mini-case of development issues in the team
  • was a set of heteregoneous tools from various sources
  • needed a unified pipeline, easy to use, reliable

What inside?

  • a Python library and collection of command-line tools
  • several algorithms (pure Python or wrapped C++)
  • standardized interfaces and evaluations
  • good coding standards

Full pipeline for word segmentation

A "hello world" example

$ echo "hello world" | phonemize -p ' ' -w ';eword ' -b festival
hh ax l ow ;eword w er l d ;eword
$ echo "hh ax l ow w er l d" | wordseg-baseline -P 0.2
hhaxloww erld
$ echo "hh ax l ow ;eword w er l d ;eword " | wordseg-prep -p' ' -w';eword' -g gold.txt
hh ax l ow w er l d
$ cat gold.txt 
hhaxlow werld
  • text preparation
  • text phonemization
  • text segmentation
  • segmentation evaluation
$ echo "hhaxloww erld" | wordseg-eval gold.txt
token_fscore	        0
type_fscore	        0
boundary_all_fscore	0.6667
boundary_noedge_fscore	0
... more skipped ...

Segmentation algorithms

dpseg               ??                           ??                            ??                                   C++                  external (bug)

language       origin

 

python            own

python            external

python            own

python            internal

C++                  external

heteregoneous implementation but homogoneous interface :

  • python
  • bash
segmented_text = segment(input_text, **args)
cat input_text | wordseg-XXX [--args] > segmented_text

Coding standards

  • coding style

            https://pep8.org

  • unit tests

            https://docs.pytest.org

  • docstrings

           https://www.sphinx-doc.org

  • hosting

           hosted on private gitlab => deploy to oberon

           mirrored on public github => public access

  • continuous integration

         

def gold(text, separator=Separator()):
    """Returns a gold text from a phonologized one

    The returned gold text is the ground-truth segmentation. It has
    phone and syllable separators removed and word separators replaced
    by a single space ' '. It is used to evaluate the output of
    segmentation algorithms.

    Parameters
    ----------
    text : sequence
        The input text to be prepared for segmentation. Each element
        of the sequence is assumed to be a single and complete
        utterance in valid phonological form.
    separator : Separator, optional
        Token separation in the `text`

    Returns
    -------
    gold_text : generator
        Gold utterances with separators removed and words separated by
        spaces. The returned text is the gold version, against which
        the algorithms are evaluated.

    """
    # delete phone and syllable separators. Replace word boundaries by
    # a single space.
    gold = (line.replace(separator.syllable, '')
            .replace(separator.phone or '', '')
            .replace(separator.word, ' ') for line in text)

    # delete any duplicate, begin or end spaces. As for prepare, we
    # ignore empty lines.
    return (line for line in (utils.strip(line) for line in gold) if line)

Docstrings using sphinx

Unit tests using pytest


def test_paiwise():
    assert list(_pairwise([])) == []
    assert list(_pairwise([1])) == []
    assert list(_pairwise([1, 2])) == [(1, 2)]
    assert list(_pairwise([1, 2, 3])) == [(1, 2), (2, 3)]


@pytest.mark.parametrize('utt', BAD_UTTERANCES)
def test_bad_utterances(utt):
    with pytest.raises(ValueError):
        check_utterance(utt, separator=Separator())


@pytest.mark.parametrize('utt', GOOD_UTTERANCES)
def test_good_utterances(utt):
    assert check_utterance(utt, separator=Separator())


@pytest.mark.parametrize(
    'utt', ['n i2 ;eword s. w o5 ;eword s. əɜ n ;eword m o-ɜ ;eword',
            'ɑ5 ;eword j iɜ ;eword (en) aɜ m (zh) ;eword',
            't u b l i ;eword p o i i s^ s^ ;eword'])
def test_punctuation(utt):
    with pytest.raises(ValueError):
        list(prepare([utt], check_punctuation=True))

    list(prepare([utt], check_punctuation=False))


def test_empty_lines():
    text = ['', '']
    assert len(list(prepare(text))) == 0
    assert len(list(gold(text))) == 0

    text = [
        'hh ax l ;esyll ow ;esyll ;eword',
        '',
        'hh ax l ;esyll ow ;esyll ;eword']
    assert len(list(
        prepare(text, separator=Separator(), unit='phone'))) == 2
    assert len(list(gold(text, separator=Separator()))) == 2
(wordseg) mathieu@deaftone:~/dev/wordseg$ pytest 
============================= test session starts ==============================
platform linux -- Python 3.7.2, pytest-4.3.1, py-1.8.0, pluggy-0.9.0 
cachedir: .pytest_cache
rootdir: /home/mathieu/dev/wordseg, inifile: setup.cfg
plugins: cov-2.6.1
collected 290 items                                                            

test/test_algo_ag.py::test_grammar_files PASSED                          [  0%]
test/test_algo_ag.py::test_check_grammar PASSED                          [  0%]
test/test_algo_ag.py::test_segment_single PASSED                         [  1%]
test/test_algo_ag.py::test_counter PASSED                                [  1%]
test/test_algo_ag.py::test_yield_parses PASSED                           [  1%]
test/test_algo_ag.py::test_setup_seed PASSED                             [  2%]
test/test_algo_ag.py::test_ignore_first_parses[-10] PASSED               [  2%]
test/test_algo_ag.py::test_ignore_first_parses[-5] PASSED                [  2%]
test/test_algo_ag.py::test_ignore_first_parses[-1] PASSED                [  3%]
test/test_algo_ag.py::test_ignore_first_parses[0] PASSED                 [  3%]
test/test_algo_ag.py::test_ignore_first_parses[5] PASSED                 [  3%]
test/test_algo_ag.py::test_ignore_first_parses[6] PASSED                 [  4%]
test/test_algo_ag.py::test_parse_counter[1] PASSED                       [  4%]
test/test_algo_ag.py::test_parse_counter[2] PASSED                       [  5%]
test/test_algo_ag.py::test_parse_counter[4] PASSED                       [  5%]
test/test_algo_ag.py::test_parse_counter[10] PASSED                      [  5%]
test/test_algo_ag.py::test_default_grammar PASSED                        [  6%]
test/test_algo_ag.py::test_mark_jonhson PASSED                           [  7%]
test/test_algo_baseline.py::test_proba_bad[1.01] PASSED                  [  7%]
test/test_algo_baseline.py::test_proba_bad[-1] PASSED                    [  7%]
test/test_algo_baseline.py::test_proba_bad[a] PASSED                     [  8%]
test/test_algo_baseline.py::test_proba_bad[True] PASSED                  [  8%]
test/test_algo_baseline.py::test_proba_bad[1] PASSED                     [  8%]
test/test_algo_baseline.py::test_hello PASSED                            [  9%]
test/test_algo_dibs.py::test_basic[gold-0-None] PASSED                   [  9%]
test/test_algo_dibs.py::test_basic[gold-0-0.2] PASSED                    [ 10%]
test/test_algo_dibs.py::test_basic[gold-0.5-None] PASSED                 [ 10%]

...
test/test_wordseg_commands.py::test_command_help[baseline] PASSED        [ 98%]
test/test_wordseg_commands.py::test_command_help[ag] PASSED              [ 98%]
test/test_wordseg_commands.py::test_command_help[dibs] PASSED            [ 98%]
test/test_wordseg_commands.py::test_command_help[dpseg] PASSED           [ 99%]
test/test_wordseg_commands.py::test_command_help[tp] PASSED              [ 99%]
test/test_wordseg_commands.py::test_command_help[puddle] PASSED          [100%]
=================== 280 passed, 10 skipped in 45.54 seconds ====================

Build system

  • mix of Python and C++
  • use CMake instead of setup.py

 

 

 

 

  • not really portable on Windows
  • provide a Docker image
git clone https://github.com/bootphon/wordseg.git
mkdir -p wordseg/build && cd wordseg/build
cmake ..
make
make test
make install

Continuous integration

push

on master

build

test

install

oberon

mirror

build / test

test coverage

doc

citation

Usage on the cluster

./wordseg-slurm.sh <jobs-file> <output-directory> [<options>]
  • <jobs-files> defines the jobs, one per line
  • simply schedule a list of wordseg jobs

<job-name>  <input-file>  <unit>  <separator>         <wordseg-command>
exemple     ./input.txt   phone   -p ' ' -w ';eword'  wordseg-baseline -P 0.2
  • after execution, <output-directory>/<job-name> contains:
input.txt
gold.txt
stats.json
output.txt
eval.txt
eval_summary.json
log.txt
job.sh
  • for each job, execute the full processing pipeline
input | stats | prepare | segment | evaluate

wordseg

By mmmaat

wordseg

  • 15