Charles Explorer logo
🇬🇧

Towards the lemmatization of Old Czech texts: data, software, applications

Publication at Central Library of Charles University, Faculty of Arts |
2018

Abstract

The paper introduces a description of declension of Old Czech common nouns (published in print in 2017) employed, among other uses, for tagging and lemmatization of transcribed digital editions of Old Czech text. The original description consists of four parts: a comprehensive set of declension patterns, an analysis of alternations in the morphological basis of word forms, an outline of formal changes mostly related to historical development of the language, and an extensive list of lemmas extracted from modern dictionaries of Old Czech.

Further, the paper gives an overview of software tools used to prepare the description: both pre-existing (OpenRefine) and newly created ("Tokens analyzer"; automatic assignment of a declension pattern to a lemma). Finally, the paper features applications based on the description: a web presentation of Old Czech common noun declension patterns linked to dictionaries of Vokabulář webový and to the "Old Czech Text Bank", and also a word form generator used for tagging and lemmatization.