Charles Explorer logo
🇬🇧

The importance of phonotactic probability in the processing of Czech

Publication at Faculty of Arts |
2022

Abstract

Phonotactic probability refers to the frequency with which phonological segments and sequences of phonological segments occur in words in a given language (Vitevich & Luce, 2004). It has been shown that phonotactic probabilities of words are important in language processing and language acquisition (Jusczyk, Luce & Charles-Luce, 1994; Mattys & Jusczyk, 2001; Pitt & McQueen, 1998).

For example, words with high phonotactic probability are recognized faster by native speakers in lexical decision tasks (Luce & Large, 2001) and pseudowords with high phonotactic probability are judged as more word-like by adults (Vitevitch, Luce, Charles-Luce & Kemmerer, 1997). The first widely available phonotactic probability calculator for English was developed by Vitevitch and Luce (2004) and it rapidly became a reference in the field used as a factor in hundreds of studies with English speakers.

Such a reference is however missing in a Slavic language. In this paper we present a soon-to-be-published script for phonotactic calculator for Czech as well as an on-going experiment revealing the importance of phonotactic probability in processing of pseudowords in Czech.

We created a script in Python that provides estimates of phonotactic probability based on frequency of word tokens in a synchronous reference corpus of contemporary written Czech (Křen & Cvrček et al., 2020) or a spoken corpus of informal conversations (Kopřivová & Lukeš et al., 2017). The words are automatically transcribed into IPA using the CorPy library (Lukeš, 2016).

The phonotactic probability is estimated based on two measures: positional segment frequency and position-specific bisegment frequency, as in line with Vitevich & Luce (2004). The input for the script is any existing or non-existing word or an entire list of (non-)words, the output gives the phonotactic probability estimates.

For the experiment, we created a list of 40 pseudowords following phonotactic rules of Czech and we assessed their phonotactic probability. The next step is to create an online experiment in the PCIbex environment (Zehr & Schwarz, 2018) with a seven item Likert scale for every pseudoword. 30 native speakers of Czech will be asked to judge the pseudowords based on their word-likeness.

The data will be analyzed using a linear regression model with phonotactic probability serving as a predictor for the word-likeness rating. We will control for neighborhood density effects.

The described experiment will show whether phonotactic probability influences processing of pseudowords in the same extent as in English. The presented script can serve to other researchers focusing on Czech as a tool for calculating phonotactic probability which has been shown to be an important factor in many psycholinguistic experiments in English.