Results of the WMT20 Metrics Shared Task

Publication at Faculty of Mathematics and Physics |

2020

Abstract

This paper presents the results of the WMT20 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT20 News Translation Task with automatic metrics.

Ten research groups submitted 27 metrics, four of which are reference-less "metrics". In addition, we computed five baseline metrics, including SENT BLEU, BLEU, TER and CHR F using the SacreBLEU scorer.

All metrics were evaluated on how well they correlate at the system-, document- and segment-level with the WMT20 official human scores. We present an extensive analysis on influence of reference translations on metric reliability, how well automatic metrics score human trans- lations, and we also flag major discrepancies between metric and human scores when eval- uating MT systems.

Finally, we investigate whether we can use automatic metrics to flag incorrect human ratings.

Keywords

results wmt20 metrics shared task