Charles Explorer logo
🇬🇧

SumeCzech: Large Czech News-Based Summarization Dataset

Publication at Faculty of Mathematics and Physics |
2018

Abstract

Document summarization is a well-studied NLP task. With the emergence of artificial neural network models, the summarization performance is increasing, as are the requirements on training data.

However, only a few datasets are available for Czech, none of them particularly large. Additionally, summarization has been evaluated predominantly on English, with the commonly used ROUGE metric being English-specific.

In this paper, we try to address both issues. We present SumeCzech, a Czech news-based summarization dataset.

It contains more than a million documents, each consisting of a headline, a several sentences long abstract and a full text. The dataset can be downloaded using the provided scripts available at http://hdl.handle.net/11234/1-2615.

We evaluate several summarization baselines on the dataset, including a strong abstractive approach based on Transformer neural network architecture. The evaluation is performed using a language-agnostic variant of ROUGE.