Charles Explorer logo
🇬🇧

Quantitative delimitation of a core lexicon

Publication at Faculty of Arts |
2014

Abstract

The exploitation of hapax legomena, i.e. word or lemma types which occur in a corpus only once, in language description is usually overlooked. These types cannot be systematically used for vast majority of analyses as they do not provide a base for any type of generalization.

On the other hand, the overall number of hapaxes can be used as an indicator of lexical periphery of the language system. This paper suggests that ratio between number of hapaxes and number of all types in relation to the growing corpus size (hapax-type ratio, HTR) can be used for delimitation of lexical core of a language.

It has been shown by previous research (Fengxiang 2010) that HTR in English has a shape of a pipe or chibouque, which means that the pace of emerging new hapaxes and new types in a process of building a corpus differ before and after reaching certain size. In a hypothetically small corpus (a few sentences) the hapax-type ratio will be equal to one (each word-type is also a hapax).

As we add texts to a corpus (up to a few million words), the hapax-type ratio decreases (the number of new words including hapaxes is continuously increasing but the majority of added tokens are new instances of words already present in the corpus) from its maximal value (=1) to the local minimum. After reaching this turning point, extending the corpus increases the ratio because the amount of hapaxes grows at a faster pace than the number of non-hapaxes (i.e. types with frequency higher than one).

This empirical finding tested on corpora of Czech and English brings us closer to an exact determination of the range of the core lexicon. Subsequently, we can deduce the approximate size of a corpus sufficient for compiling a dictionary that covers the core lexicon.