Charles Explorer logo
🇬🇧

Feature Extraction for Native Language Identification Using Language Modeling

Publication at Faculty of Mathematics and Physics |
2015

Abstract

This paper reports on the task of Native Language Identification (NLI). We developed a machine learning system to identify the native language of authors of English texts written by non-native English speakers.

Our system is based on the language modeling approach and employs cross-entropy scores as features for supervised learning, which leads to a significantly reduced feature space. Our method uses the SVM learner and achieves the accuracy of 82.4 % with only 55 features.

We compare our results with the previous similar work by Tetreault et al. (2012) and analyze more details about the use of language modeling for NLI. We experiment with the TOEFL11 corpus (Blanchard et al., 2013) and provide an exact comparison with results achieved in the First Shared Task in NLI (Tetreault et al., 2013).