Charles Explorer logo
🇬🇧

ONLINE: monitor corpus of online Czech

Publication

Abstract

ONLINE corpus (internally subdivided into two soruce ONLINE_NOW and ONLINE_ARCHIV) is a monitor corpus of the dynamic content of the Czech web, i.e. internet journalism, discussions, forums and social networks. The span of the corpus is since 2017 till the present.

It has been created at the CNC with the help of the data kindly provided by the Dataweps company. Both parts of the corpus differ in their extent and periodicity of updates: ONLINE_NOW - contains daily updates from the current month plus 6 preceding months, updated daily; ONLINE_ARCHIVE - contains data since Feb 2017 until the date when ONLINE_NOW begins, updated every month.

The ONLINE_NOW and ONLINE_ARCHIVE corpora are disjunctive, i.e. there is no intersection. Therefore, for searching in the whole time period since 2017, the results of queries on both corpora can simply be joined together, no manual corrections are needed.

As both corpora are identical in their structure and annotation, the following description does not distinguish between them.