Charles Explorer logo
🇨🇿

Morphological Tagging and Lemmatization of Spoken Corpora of Czech

Publikace na Filozofická fakulta |
2023

Tento text není v aktuálním jazyce dostupný. Zobrazuje se verze "en".Abstrakt

We describe the annotation of corpora of spoken Czech according to a new annotation standard valid since the publication of the SYN2020 corpus of written Czech. The standard distinguishes lemmas and sublemmas, assigns a new attribute to verb forms, deals with multi-word tokens in an appropriate way.

In order to annotate the corpora of spoken Czech by the same standard, new training data for the annotation of spoken text was created and experiments with using both written and spoken data for training a neural tagger were performed.