Statistical Machine Translation

Class at Faculty of Mathematics and Physics |

NPFL087

Syllabus

Evaluating machine translation quality (manually and automatically). Empirical confidence bounds and reliability of MT metrics in general.

Machine translation as a problem in information theory. Translation model, language model, general log-linear model. The space of partial hypotheses and search in the space (the "decoding"), phrase-based translation. Open-source toolkit Moses.

Neural MT overview: a direct model of translation probability, subword units, embeddings, sequence-to-sequence model. Open-source toolkits such as Neural Monkey, Nematus, OpenNMT, Marian.

Parallel texts, alignment (sentence and word aligment, IBM models 1 to 3). Open source tools for corpus preparation and alignment (hunalign, GIZA++).

Neural MT details: attention in sequence-to-sequence models, self-attentive models.

Optimization: Tuning parameters of log-linear model (Minimum Error Rate Training, MERT). Specifics of training of neural MT.

Advanced NMT models: multi-task training, multi-lingual translation, multi-modal translation.

Morphological pre-processing, utilizing morphological information in phrase-based and neural MT.

Phrase-structure syntax in MT, translation based on (context-free) parsing. Generic hypergraph search.

Shallow and deep dependency syntax in MT, including tectogrammatical layer and TectoMT.

Presentation of students' contributions.

Students' contribution and grading:

Individuals or groups of two to three students choose a topic early in the term, set up some experiments, implement a modification of an existing MT system or run baseline experiments with an available prototype of an alternative MT method. Each of the projects is concluded by writing up a report and presenting the results in the lectures.

The tutorials ("cviceni") of the subject are devoted to practical application of the algorithms and toolkits described as well as for consulting students' projects.

The final grading reflects: the knowledge of discussed topics, the project report paper and the project presentation.

Annotation

Participants of the seminar will get closely acquainted with methods of machine translation (MT) that rely on automatic processing of (large) training data as well as with open-source implementations of these methods.

We will cover a wide range of approaches organized along two axes: the level of linguistic analysis (uninformed, utilizing morphology, surface and deep syntax) and the depth of machine learning methods used (classical statistical MT that decomposes input into pieces and neural MT that models the task end to end).