Charles Explorer logo
🇬🇧

CzeSL-man v1 searchable - a corpus of non-native Czech with manual error annotation in a simplified tiered scheme

Publication

Abstract

CzeSL-man v1 searchable contains transcripts of texts created by non-native speakers of Czech. It is a manually annotated part of texts from the automatically annotated corpus CzeSL-SGT.

Manual error annotation is a simplified version of a two-stage annotation scheme designed for the CzeSL project. The annotation contains corrections of the source text - the target hypothesis, types of errors, morphosyntactic categories and lemmas for the corrected text and dependency syntactic structure and functions of the corrected text.

Morphological and syntactic annotation is performed automatically. The texts are equipped with metadata about the author and the text.

The corpus can be searched online using the KonText search engine in the Czech National Corpus. The corpus can also be obtained as a dataset in the PML/feat format (see http://utkl.ff.cuni.cz/learncorp/ - CzeSL-man v1 downloadable).

In addition to a different format, the searchable version differs from the downloadable version in two respects: (i) there are no texts with alternative error annotation: each text is annotated by a single annotator (just one version of each doubly annotated text is included), and (ii) the two-tier annotation scheme is radically modified to fit the token-based setup of the search tool.