Engrammer: Introducing a new tool for the identification of phraseological patterning. Demo and case study on Czech, English and Arabic

Publication at Faculty of Arts |

2019

Abstract

The present paper introduces a new open-source tool for n-gram extraction and anal-ysis, and a case study to illustrate the potential applications of the tool in contrastive studies. N-grams present an attractive starting point for contrastive linguistic analyses of phraseology, allowing for an efficient identification of recurrent patterning in lan-guage data.

However, in recent contrastive studies, their application to typologically dissimilar languages has proven challenging (cf. Čermáková & Chlumská 2017; Hasselgård 2017). Some research has suggested that introducing (partial) lemmatisa-tion (Čermáková & Chlumská 2017) or positional variation (Cheng et al. 2006) into the n-gram method might be an option in addressing these problems.

Engrammer is a single-purpose tool allowing for the extraction of n-grams con-taining a specific word or lemma, the rest of the slots being open. It was developed with the following questions in mind: 1) what lexical patterns is a given word involved in; i.e. which n-grams are dis-proportionately collocated with a given word form/lemma? 2) what are the contexts of these n-grams? 3) what other words/lemmata collocate with these specific n-grams and what are their contexts?

Keywords

n-grams phraseological patterning Czech English multi-word expressions