Charles Explorer logo
🇨🇿

Experimental tagging of the ORAL series corpora : Insights on using a stochastic tagger

Publikace na Filozofická fakulta |
2015

Tento text není v aktuálním jazyce dostupný. Zobrazuje se verze "en".Abstrakt

The ORAL series corpora of spontaneous spoken Czech currently contain neither lemmatization nor part of speech tagging. The main reason for this is that readily available NLP tools, designed primarily with written texts in mind, underperform when applied directly to speech transcripts, due to various morpohological and syntactic specificities of informal spoken language and the ways these are captured in transcription.

Recently, the highly optimized open-source MorphoDiTa toolchain for training and applying stochastic tagging models was released; MorphoDiTa makes it easy and fast to experiment with incremental changes in the training procedure. The article discusses modifications to the morphological dictionary and training data used by the models which are necessary in order to improve their performance on the ORAL series corpora, as well as challenges which remain to be solved.