Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System

Publication at Faculty of Mathematics and Physics |

2022

Abstract

In this paper, we present the results of two reproduction studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against reference outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evaluation.

In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of reproducibility depending on result type, data and labelling task.

Our resources are available and open-sourced.

Keywords

reproductions human assessed comparative evaluation semantic error detection system