The goal of this study is to investigate whether learners’ written data in highly inflectional Czech can suggest a consistent set of clues for automatic identification of the learners’ L1 background. For our experiments, we use texts written by learners of Czech, which have been automatically and manually annotated for errors.
We define two classes of learners: speakers of Indo-European languages and speakers of non-Indo-European languages. We use an SVM classifier to perform the binary classification.
We show that non-content based features perform well on highly inflectional data. In particular, features reflecting errors in orthography are the most useful, yielding about 89% precision and the same recall.
A detailed discussion of the best performing features is provided.