Charles Explorer logo
🇬🇧

Surprise Language Challenge: Developing a Neural Machine Translation System between Pashto and English in Two Months

Publication at Faculty of Mathematics and Physics |
2021

Abstract

In the media industry, the focus of global reporting can shift overnight. There is a compelling need to be able to develop new machine translation systems in a short period of time, in order to more efficiently cover quickly developing stories.

As part of the low-resource machine translation project GoURMET, we selected a surprise language for which a system had to be built and evaluated in two months (February and March 2021). The language selected was Pashto, an Indo-Iranian language spoken in Afghanistan, Pakistan and India.

In this period we completed the full pipeline of development of a neural machine translation system: data crawling, cleaning, aligning, creating test sets, developing and testing models, and delivering them to the user partners. In this paper we describe the rapid data creation process, and experiments with transfer learning and pretraining for Pashto-English.

We find that starting from an existing large model pre-trained on 50 languages leads to far better BLEU scores than pre