Improving Word Alignment by Exploiting Adapted Word Similarity

Publication at Faculty of Mathematics and Physics |

2012

Abstract

This paper presents a method to improve a word alignment model in a phrase-based Statistical Machine Translation system for a low resourced language using a string similarity approach. Our method captures similar words that can be seen as semi-monolingual across languages, such as numbers, named entities, and adapted/loan words.

We use several string similarity metrics to measure the monolinguality of the words, such as Longest Common Subsequence Ratio (LCSR), Minimum Edit Distance Ratio (MEDR), and we also use a modified BLEU Score (modBLEU). Our approach is to add intersecting alignment points for word pairs that are orthographically similar, before applying a word alignment heuristic, to generate a better word alignment.

We demonstrate this approach on Indonesian-to-English translation task, where the languages share many similar words that are poorly aligned given a limited training data. This approach gives a statistically significant improvement by up to 0.66 in terms of BLEU score.

Keywords

improving word alignment exploiting adapted word similarity