The paper describes a learner corpus of Czech, compiled from short essays written by students of Czech as a second or foreign language. We discuss the project’s background assumptions, the process of text acquisition, transcription and mark-up, and finally focus on the annotation scheme, consisting of multiple interlinked levels to cope with a wide range of error types present in the input.
Manual annotation is complemented by automatic error identification wherever possible and morphosyntactic tags for all word forms both in the emended and the original text. The annotation schema is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results.