Charles Explorer logo
🇬🇧

ORTOFON: Corpus of informal spoken Czech with multi-tier transcription

Publication

Abstract

he ORTOFON corpus, with its method of data collection, is a continuation of the corpora of informal spoken Czech from the ORAL series. Together with the DIALEKT corpus, it is one of the first two spoken corpora of the Czech language which have a multi-tier transcription.

Same as with the corpora of the ORAL series, ORTOFON also collects spontaneous spoken language used in informal situations between speakers who know each other. Similarly, as in the corpus ORAL2013, the speakers come from all over the Czech Republic and selected sociological data are collected about them.

ORTOFON is also the first corpus to be fully balanced regarding all the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The corpus is lemmatized morphologically tagged in the same manner as the ORAL corpus, the transcription is linked to the corresponding audio track.

The ORTOFON corpus allows us to explore various aspects of spoken language, i.e. lexis, morphology, syntax, pragmatics, dialogue construction. The corpus is not primarily intended for dialectological 1) or phonetic research, even though a simplified phonetic transcription allows us to verify the existence of pronunciation or regional variants, or phenomena related to pronunciation.

The publication of ORTOFON in connection with the ORAL corpus presents users the chance to explore informal spoken Czech in the most extensive data complex to date, covering a period of fifteen years (2002-2017).