Charles Explorer logo
🇬🇧

SYN v6: corpus of contemporary written Czech

Publication

Abstract

Corpus of contemporary written Czech sized 4 billion running words (i.e. 4.8 billion tokens). It covers mostly the period of 1990-2016 and it features rich metadata including detailed bibliographical information, revised text-type classification etc.

Although it contains a wide range of text types (fiction, non-fiction, newspapers), the newspapers prevail noticeably. The corpus is lemmatized and morphologically annotated by a combination of stochastic and rule-based methods.

The main difference when compared to its predecessor, SYN v5, lies in the update of the newspaper part (publication year 2016 added).