Charles Explorer logo
🇬🇧

Introducing a corpus of non-native Czech with automatic annotation

Publication at Faculty of Arts |
2017

Abstract

Learner corpus can be annotated with linguistic categories, target hypotheses and error labels. We show that useful results can be achieved even for non-native Czech by applying methods and tools developed for standard language.

The corpus includes more than 8.6 thousands short essays, nearly one million words. First, the texts are processed by a tagger and lemmatizer.

Then, a stochastic spelling and grammar checker is used to propose correct forms for non-words and some incorrect 'real words'. The precision of this step is above 80%.

The corrected texts are tagged again. Original and corrected forms are compared and error labels, based on criteria applicable in a formally specifiable way, are assigned.

The metadata include, i.a., the author's sex, age, first language, CEFR level of proficiency in Czech, and the task's time limit and topic. The corpus is available on-line via a search interface or for download.