Charles Explorer logo
🇨🇿

Unpacking lexical intertextuality: Vocabulary shared among texts

Publikace na Filozofická fakulta |
2022

Tento text není v aktuálním jazyce dostupný. Zobrazuje se verze "en".Abstrakt

This paper focuses on lexical intertextuality, namely the three following intertextual properties: 1) the number of word-types shared by two texts; 2) the number of word-types shared by all texts in a collection; 3) the number of wordtypes shared by equal-sized segments of a collection. We have observed that the relation between the number of texts and the number of shared types follows a power law; similar behavior can be seen if text borders are disregarded and the corpus is artificially divided into equal-sized segments.

The number of shared types is proportional to the size of these sequences. We developed baseline models for the number of shared types, i.e. models predicting the number of types shared by texts if all tokens were randomly shuffled and evenly spread among texts.

The comparison between the empirical data and the baseline model can be used for contrastive purposes, to compare the number of shared types in corpora of different languages.