Charles Explorer logo

Low-Resource Machine Translation Systems for Indic Languages

Publikace na Matematicko-fyzikální fakulta |

Tento text není v aktuálním jazyce dostupný. Zobrazuje se verze "en".Abstrakt

We present the submission of the CUNI team to the WMT23 shared task in translation between English and Assamese, Khasi, Mizo, and Manipuri. All our systems were pretrained on the task of multilingual masked language modelling and denoising auto-encoding.

Our primary systems for translation into English were further pretrained for multilingual MT in all four language directions and fine-tuned on the limited parallel data available for each language pair separately. We used online back-translation for data augmentation.

The same systems were submitted as contrastive for translation out of English where the multilingual MT pretraining step seemed to harm the translation performance. Other contrastive systems used additional pseudo-parallel data mined from monolingual corpora.