Expressing TIME in English and Czech children's literature : a contrastive n-gram-based study of typologically distant languages

Publication at Faculty of Arts |

2018

Abstract

The study addresses two issues raised by previous studies dealing with children's literature and phraseology. First, we explore how TIME is expressed in English and Czech children's fiction (cf.

Hunt, 2005; Thompson & Sealey, 2007). Our approach relies on the neo-Firthian phraseological tradition, "where meaning... is said to reside in multi-word units rather than single words" (Ebeling & Ebeling, 2013: 65).

The study is data-driven, based on n-gram extraction. This raises the question of "the potential contribution" of n-gram-based approaches to language comparison (Granger, 2014).

N-grams appear a useful starting point when comparing typologically related languages, and rather "challenging" when dealing with distant ones, e.g. predominantly analytical English and inflectional Czech (Čermáková & Chlumská, 2017; Hasselgård, 2017; Ebeling & Ebeling, 2013). The study uses comparable English and Czech corpora of children's fiction: two small (650,000 words each) and two large ones (2,700,000 words each, sub-corpora of the Czech National Corpus (SYN) and British National Corpus).

For technical reasons, queries are restricted to 250,000 hits in the large corpora. The small corpora enabled detailed examination, the large ones served to verify our small-corpus findings, supplementing them by lemma and POS queries.

We extracted 2-5-grams (i.e. continuous sequences of 2-5 words excluding punctuation) from the smaller corpora. Numbers of n-grams above the threshold are consistently higher in English.

The ratios suggest a larger extent of recurrent patterning in analytical English than in Czech, characterized by high morphological variability and free word-order (cf. Czech 4-grams: se nedá nic dělat, nedá se nic dělat, nedalo se nic dělat).

Higher type/token ratios in Czech again point to a higher variability of Czech. Another difference is the higher representation of verbs within the most frequent n-grams in Czech (e.g. se vydal na cestu), and prepositional phrases in English (e.g. for a long time).

This is again in accord with the typological expectations, Czech generally preferring (finite) verbal expression and English being more 'nominal'. The POS observations highlighted the importance of verbs for Czech but also their high morphological variability as a potential hindrance to the use of the n-gram approach.

Frequent 3-5-grams in the small corpora were classified semantically. We then focused on TIME n-grams.

The expression of TIME tends to rely on n-grams comprising temporal nouns in English (e.g. end, time, moment), while in Czech adverbs and conjunctions were salient (pak, hned, když), pointing to the 'nominal' vs. 'verbal' character of English and Czech, respectively. The recurrent lexemes can then be used to identify (partly lemmatized) patterns expressing TIME in both languages (e.g. a pak SE, by the time) (Ebeling & Ebeling, 2013; Gries, 2008).

The n-gram method proved a useful starting point in corpus-driven cross-linguistic genre analysis, highlighting typological characteristics of the languages compared. Owing to the limitations on the n-gram method in Czech, a combination of approaches seems beneficial, including semantic analysis, partial lemmatization and n-gram based patterns.

Keywords

n-grams patterns contrastive linguistics children's literature