Charles Explorer logo
🇬🇧

Building NLP Pipeline for Russian with a Handful of Linguistic Knowledge

Publication at Faculty of Mathematics and Physics |
2016

Abstract

This work addresses the issue of building a free NLP pipeline for processing Russian texts from plain text to morphologically and syntactically annotated structures in CONLL format. The pipeline is written in python3.

Segmentation is provided by our own module. Mystem with numerous postprocessing fixes is used for lemmatization and morphology tagging.

Finally, syntactical annotation is obtained with MaltParser utilizing our own model trained on SynTagRus, which was converted into CONLL format for this purpose, with its morphological tagset being converted into Mystem/Russian National Corpus tagset