Coping with unruly language: non-standard usage in a corpus

Publication at Faculty of Arts |

2018

Abstract

A language as used in real situations may differ substantially from its standard form. Before the entire range of NLP methods and tools can be applied to non-canonical variants of a language, appropriate categories for the analysis of deviant forms and constructions are needed, together with texts annotated by these categories.

A discussion of non-standard language is followed by two case studies. The first study proposes a taxonomy of morphosyntactic categories as an attempt to analyze non-standard forms in non-native learners' Czech.

The second study focuses on the role of a rule-based grammar. and lexicon as tools for the detection and diagnostics of non-standard words and constructions in the process of building and using a parsebank.

Keywords

Non-standard language Czech learner corpus parsebank treebank constraint-based grammar valency HPSG