At the seminar, we will improve machine translation systems (especially translation into Czech) and take part in the annual translation competitions like http://www.statmt.org/wmt18/. Our machines systems have repeatedly achieved relatively good results, and we won in the three consecutive years of 2013-2015, beating Google Translate among others.
Statistical machine translation is a challenging task especially in terms of the volume of data processed. It is quite common to work in parallel on dozens of computers, and can easily need 100 GB of disk and 100 GB of RAM for a single experiment. Neural machine translation then requires GPUs with at least 8 GB of RAM and training for days or weeks.
We will rely on existing tools that are implemented in a mixture of languages such as Python, C/C++, Perl, Bash, and others. Very often, we will parallelize the calculations on the computing cluster of the department or MetaCentrum, including powerful graphics cards (GPUs).
During the semester, we will collectively improve open-source machine translation systems. People interested in natural language processing or deep learning will focus on analyzing or designing tricks and modifying models for better translation quality; general software engineers can focus on the infrastructure of the experimentation environment or the optimization of existing tools.
The seminar assumes only high school knowledge of the formal description of natural languages.
The seminar will take place at the Unix laboratory.
The seminar can serve as a supplement of Unix classes or a very practical introduction to some aspects of computational linguistics. We will collectively improve existing tools and systems for statistical machine translation, including neural machine translation, and take part in competitions like http://www.statmt.org/wmt18/. Our primary focus will be on Czech and English but other languages will be considered based on the interest of participants.
Practically speaking, the seminar consists of scripting and operating a diverse collection of research tools and tackling a wide range of techn