CsFEVER and CTKFacts: acquiring Czech data for fact verification

Publikace na Matematicko-fyzikální fakulta, Fakulta sociálních věd |

2023

Abstrakt

In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info).

As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of WIKIPEDIA corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages.

We discuss its weaknesses, propose a future strategy for their mitigation and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task-the CSFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2 M articles of Czech News Agency.

We present an extended dataset annotation methodology based on the FEVER approach, and, as the underlying corpus is proprietary, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFACTSNLI. We analyze both acquired datasets for spurious cues-annotation patterns leading to model overfitting.

CTKFACTS is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.

Klíčová slova

Automated fact-checking Czech Document retrieval Natural language inference FEVER