Improvements to Korektor: A case study with native and non-native Czech

Publication at Faculty of Mathematics and Physics, Faculty of Arts |

2015

Abstract

We present recent developments of Korektor, a statistical spell checking system. In addition to lexicon, Korektor uses language models to find real-word errors, detectable only in context.

The models and error probabilities, learned from error corpora, are also used to suggest the most likely corrections. Korektor was originally trained on a small error corpus and used language models extracted from an in-house corpus WebColl.

We show two recent improvements:. We built new language models from freely avail- able (shuffled) versions of the Czech National Corpus and show that these perform consistently better on texts produced both by native speakers and non-native learners of Czech..

We trained new error models on a manually annotated learner corpus and show that they perform better than the standard error model (in error detection) not only for the learners' texts, but also for our standard eval- uation data of native Czech. For error correction, the standard error model outperformed non-native mode

Keywords

improvements korektor case study with native native czech