Word Alignment


Paul Charousset & Yannick Péroux
29 April 2014 - CA446
Dublin City University

The Goal






Can we correct the output of a word-aligner from a set of supervised alignments ?

Testing Data


  • Data from the Hansard
  • 447 sentences from English to French
  • Word-alignment made by linguists
  • Unsupervised word-alignment with Giza++

Tools


  • Stanford POS-tagger
  • NLTK
  • Weka

Methodology


  • We generate an ARFF from the data
  • We add some attributes (with the POS tagger)
  • 2 classes
    • S - a supervised alignment
    • N - an alignment detected by Giza++ but not in the reference
  • We train Weka with the ARFF and do cross-validation

Results


  • Starting point:
    • 80% of the alignments are good (S)
    • 20% are bad (N)
  • We improved it by a very small degree (< 0.1 %)
  • Our corpus isn't long enough
  • We can't use complex structure with Weka

Conclusion


  • We tried to extract information from a known set of word-alignments
  • We didn't have enough data
  • A naive Machine-Learning algorithm isn't sufficient




Questions ?

SMT

By Yannick Péroux