Charles Explorer logo
🇬🇧

SYN v12: corpus of contemporary written Czech

Publication

Abstract

Corpus of contemporary written Czech sized over 5 billion running words (i.e. more than 6 billion tokens). It covers mostly the period of 1990-2022.

SYN v12 features rich metadata including detailed bibliographical information, revised text-type classification etc. Although it contains a wide range of text types (fiction, non-fiction, newspapers), the newspapers prevail noticeably.

The corpus is lemmatized and morphologically annotated by a combination of stochastic and rule-based methods. The main differences when compared to its predecessor, SYN v11, lie in the update of the newspaper part (added texts from 2022 sized ca 150 million running words), as well as in the improved lemmatization and morphological tagging.