Word Alignment

Paul Charousset & Yannick Péroux

29 April 2014 - CA446

Dublin City University

The Goal

Can we correct the output of a word-aligner from a set of supervised alignments ?

Testing Data

Data from the Hansard
447 sentences from English to French
Word-alignment made by linguists
Unsupervised word-alignment with Giza++

Tools

Stanford POS-tagger
NLTK
Weka

Methodology

We generate an ARFF from the data
We add some attributes (with the POS tagger)
2 classes

S - a supervised alignment
N - an alignment detected by Giza++ but not in the reference

We train Weka with the ARFF and do cross-validation

Results

Starting point:

80% of the alignments are good (S)
20% are bad (N)

We improved it by a very small degree (< 0.1 %)
Our corpus isn't long enough
We can't use complex structure with Weka

Conclusion

We tried to extract information from a known set of word-alignments
We didn't have enough data
A naive Machine-Learning algorithm isn't sufficient

Questions ?

SMT

By Yannick Péroux

SMT

1,408

Yannick Péroux

k4nar