Charles Explorer logo
🇬🇧

Selecting Data for English-to-Czech Machine Translation

Publication at Faculty of Mathematics and Physics |
2012

Abstract

We provide a few insights on data selection for machine translation. We evaluate the quality of the new CzEng 1.0, a parallel data source used in WMT12.

We describe a simple technique for reducing out-of-vocabulary rate after phrase extraction. We discuss the benefits of tuning towards multiple reference translations for English-Czech language pair.

We introduce a novel approach to data selection by full-text indexing and search: we select sentences similar to the test set from a large monolingual corpus and explore several options of incorporating them in a machine translation system. We show that this method can improve translation quality.

Finally, we describe our submitted system CU-TAMCH-BOJ.