Charles Explorer logo
🇬🇧

Skript 2015: Acquisition corpus of native speakers' Czech - transcripts of essays by students of primary and secondary schools

Publication

Abstract

The corpus contains short theses written by pupils, native speakers of Czech, at various language levels, including speakers of the Romani ethnolect, a total of 2582 texts and 380 thousand tokens. The pupils, aged 10-15, are from primary and secondary schools.

The texts can be searched and viewed using the corpus tool TEITOK or KonText and are equipped with metadata and facsimiles. The texts were manually transcribed, including manual anonymization and a markup recording the author's corrections.

The texts were manually and automatically annotated and manually revised. The results include: a) manual normalization on several levels: spelling and morphemics, morphosyntax, dictionary, b) automatic morphological analysis of the original text and all corrections, c) automatic identification of the type of spelling and morpheic errors.