SYN v9: corpus of contemporary written Czech

Publication

Abstract

Corpus of contemporary written Czech sized 4.7 billion running words (i.e. 5.7 billion tokens). It covers mostly the period of 1990-2019 and it features rich metadata including detailed bibliographical information, text-type classification etc.

Although it contains a wide range of text types (fiction, non-fiction, newspapers), the newspapers prevail noticeably. The corpus is lemmatized and morphologically annotated by a combination of stochastic and rule-based methods.

The main differences when compared to its predecessor, SYN v8, lie in the update of the newspaper part (publication year 2019 added) and in that all the processing (structural markup, lemmatization, morphological tagging) has been updated to correspond to the SYN2020 corpus.

Keywords

language corpus Czech language