Word Alignment
Paul Charousset & Yannick Péroux
29 April 2014 - CA446
Dublin City University
The Goal
Can we correct the output of a word-aligner from a set of supervised alignments ?
Testing Data
- Data from the Hansard
- 447 sentences from English to French
- Word-alignment made by linguists
- Unsupervised word-alignment with Giza++
Tools
- Stanford POS-tagger
- NLTK
- Weka
Methodology
- We generate an ARFF from the data
- We add some attributes (with the POS tagger)
- 2 classes
- S - a supervised alignment
- N - an alignment detected by Giza++ but not in the reference
- We train Weka with the ARFF and do cross-validation
Results
- Starting point:
- 80% of the alignments are good (S)
- 20% are bad (N)
- We improved it by a very small degree (< 0.1 %)
- Our corpus isn't long enough
- We can't use complex structure with Weka
Conclusion
- We tried to extract information from a known set of word-alignments
- We didn't have enough data
- A naive Machine-Learning algorithm isn't sufficient