Charles Explorer logo
🇬🇧

Designing a National Corpus in a Minoritised Language

Publication

Abstract

Building a corpus entails the principled collection of a dataset, and corpora designed for general purposes commonly require the submission of that data to an annotation process whereby each item is 'tagged' according to its part of speech (POS). In some cases, a ready-made tag-set is applied to the data, and in other cases a bespoke tag-set is required.

Corpora require a host infrastructure; building or sourcing this is the other essential element of corpus design. The creation of these components, along with a semantic tagger (to mark up the data for meaning as opposed to part of speech) and its own tag-set, plus the bespoke pedagogic toolkit (Y Tiwtiadur) constituted the CorCenCC construction plan.

Decisions relating to the user-driven infrastructure and the collection and processing of data, in particular, present some specific challenges in the context of minoritised languages. In this chapter we outline how these challenges were addressed in the CorCenCC project.