The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation

Publikace

Abstrakt

This paper presents our on-going efforts to develop a comprehensive data set and benchmark for machine translation beyond highresource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants.

We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Uralic languages and other language groups. Our initial results show the capabilities of training effective multilingual translation models with skewed training data but also stress the shortcomings with low-resource settings and the difficulties to obtain sufficient information through straightforward transfer from related languages.

Klíčová slova

machine translation low-resource languages multilingual NLP