Charles Explorer logo
🇬🇧

Comparing the incomparable? Rethinking n-grams for free word order languages

Publication at Faculty of Arts |
2018

Abstract

Ever since the boom of corpus-based resources, linguists have explored and employed various methods to identify and extract recurring language patterns from texts as these can reveal a lot about the syntagmatic nature of language and its grammatical, lexical and syntactic tendencies. One of these methods is the n-gram method based on extracting frequent sequences of n consecutive words.

N-grams seem computationally and linguistically trivial but proved to be successful in identifying suitable candidate sequences of words worthy of further analytical attention. Like in many other linguistic areas, the majority of studies were carried out on the English language.

N-grams in corpus linguistics were first used extensively by Biber et al. (1999) who identified a number of recurrent 4-6-grams occurring commonly in different register types. Not much later, n-grams came into use in contrastive and translation corpus-based analysis.

Baker (2004) tried various lengths of n-grams to compare translated and non-translated language, while in cross-linguistic contrastive studies, Forchini and Murphy (2008) analyzed 4-grams in Italian and English; Cortes (2008) analyzed 4-grams in English and Spanish; Ebeling and Oksefjell Ebeling (2013), in their book-length study, analyzed n-grams in English and Norwegian; Granger (2014) and Granger and Lefer (2013) used n-gram methodology in a comparison of English and French; and finally, Čermáková and Chlumská (2017) in a study of English and Czech place expressions. As evidenced by the growing number of studies, the n-gram approach seems to be rather popular lately; however, it raises a number of serious methodological issues when applied cross-linguistically.

One of the biggest challenges seems to be the issue of a suitable length of the n-gram as pointed out by Ebeling and Oksefjell Ebeling (2013), Granger (2014) or Čermáková and Chlumská (2017). The length of the n-gram may carry over in cross-linguistic analysis, but may also be substantially different (e.g. 4:4 as in EN: from side to side - CZ: ze strany na stranu, but also 4:1 as in EN: for the first time - CZ: poprvé).

A major point raised by Granger (2014) is related to the contrastive study of typologically different languages, e.g. there may be variation in an n-gram in inflectional languages (EN: I am sure - CZ: jsem si jistý/jistá) that could possibly be resolved, as Granger suggests, by using lemmatization. However, the rich variety of word forms belonging to one lexeme is not the only problem in such languages; the free word-order seems to be even more challenging as it strongly influences the very extraction of n-grams.

As Čermáková and Chlumská (2017) report, if we look at the differences between analytical English and inflectional Czech, there are approximately ten times more n-gram tokens above the same threshold found in English than in Czech. It is quite clear that the analytical nature of English relies to a much greater extent on patterning based on rigid sequences of words than free word-order and morphologically highly variable Czech.

Patterning in an inflectional language is less regular and the patterns themselves allow for more extensive variability. To provide an example, four words in the same structure, such as CZ: myslel jsem si že (EN: "I thought that"), can appear in several different combinations due to the free word order, and even not immediately next to one another: jsem si myslel že (EN: "I thought that"), myslel jsem si původně že (EN: "First I thought that") etc.

Such differences are impossible to abstract over using n-grams, because these are always ordered and contiguous (although some positions within the n-gram may be underspecified, cf. skip-grams). Identifying such constructions is thus challenging, because none of the variants individually might make it above the given frequency cut-off point.

The adverse effects of free word order on n-gram frequency counts can be mitigated by breaking their rigid linear structure. One possible way of achieving this is by the following process: 1. slide a window of size n over the target corpus; 2. tally counts of all subsets of k < n elements taken from each window.

We call such subsets n-choose-k-grams, because they arise as the different k-element combinations over a given n-gram window. Obviously, the choice of n and k determines how computationally onerous the process will be.

Compare this with n-grams: whereas with e.g. 4-grams, each 4-gram window in the corpus is tallied exactly once, with 8-choose-4-grams, each 8-gram window yields 70 4-combinations. These n-choose-k-grams have two desirable properties with respect to the task at hand: 1. being sets, k-combinations are unordered, i.e. word order differences are neutralized; 2. being subsets, k-combinations can abstract over extraneous words being inserted at any position within the original n-gram.

It can easily be seen how combinatorial explosion can make the task computationally intractable for improperly selected n and k (in particular, when n is large and k is about half). In practice, n and k should be selected in accordance with the locality principle: words that work towards a common goal may not always occur in the same order and may be interspersed with other words, but they will tend to occur in one another's close neighborhood.

Sticking to this maxim should help yield combinations which span words which actually form a functional grouping, not just unrelated co-occurrences within a long n-gram window, as well as restrict n and k to manageable values. Ironing out the details of the procedure is work in progress, but very early results are encouraging.

In trying to identify the aforementioned CZ: myslel jsem si že structure, a 4-gram scan of our test corpus yields 4 different word order variants with frequencies 18, 17, 2 and 1 (one of which is spurious, an instance of another construction). By contrast, a 8-choose-4-gram scan yielded 177 matches within 8-gram windows, i.e. over three times as much.

It remains to be determined whether this increase in recall is not outweighed by too great a decrease in precision.