Charles Explorer logo
🇬🇧

Morphological Tagging and Lemmatization of Spoken Corpora of Czech

Publication at Faculty of Arts |
2023

Abstract

We describe the annotation of corpora of spoken Czech according to a new annotation standard valid since the publication of the SYN2020 corpus of written Czech. The standard distinguishes lemmas and sublemmas, assigns a new attribute to verb forms, deals with multi-word tokens in an appropriate way.

In order to annotate the corpora of spoken Czech by the same standard, new training data for the annotation of spoken text was created and experiments with using both written and spoken data for training a neural tagger were performed.