Morphological tagging and lemmatization in the Czech National Corpus

Publication at Faculty of Mathematics and Physics |

2007

Abstract

This paper presents the methods by which three large textual corpora (SYN2000, SYN2005 and SYN2006PUB) of the Czech National Corpus have been tagged and lemmatised. The process proceeded in several phases: tokenization and segmentation, morphological analysis and disambiguation.

Statistical taggers as well as a rule-based method of disambiguation have been used in the process. SYN2000 has been tagged by a single feature-based tagger, SYN2005 and SYN2006PUB have been tagged by two different combinations of statistical and rule-based methods.

In SYN2006PUB, the number of errors has been further reduced with some simple replacement algorithms. At the end of this paper, an evaluation of the different methods is presented: the method used for corpus SYN2006PUB shows approximatively twice less errors in tagging than in the older tagging of corpus SYN2000.

Keywords

Morphological tagging lemmatization Czech National Corpus