DIALEKT v2: A corpus of Czech dialects

Publication

Abstract

The DIALEKT corpus presents traditional regional dialects captured over the entire Czech Republic. The dialect material was acquired by transcribing sound recordings coming from all dialectal regions of the Czech Republic.

The transcripts have got two transcription tiers - dialectological and orthographic. The corpus is composed of two levels.

The older dialectal level contains recordings which were made in the period from the end of the 1950s until the 1980s. The newer level contains probes covering the period from the 1990s until the present.

For both layers, we have language data which capture archaic dialectal elements which do not generally occur in the present day usage. The second version of the dialect corpus contains more than 220 000 words and will gradually expand.

The corpus has got a supplement, which is an interactive map-based appplication named Mapka. We assume that the corpus will serve not only for specialists (dialectologists, other linguists and researchers from related fields) but also for example as a practical learning aid for high schools and universities.

Keywords

language corpus Czech language dialect