Forensic phoneticians, when identifying a speaker, encounter various cases differing in complexity and typicality. Non-contemporary (e.g., Hollien and Schwartz, 2001), disguised (e.g., Eriksson, 2010; Künzel, 2000; Masthoff, 1996) or language- or accent-mismatched recordings are very likely the more complex and less frequent ones.
The aim of this study is to present rather an underresearched area in forensic speaker comparison (FSC), namely, the comparison of native-language and contemporary/non-contemporary recordings of speech produced in a foreign language by the same speaker. This setting simulates two hypothetical situations.
The unknown speaker during the perpetration of a crime uses his L2; however, the suspect recording, obtained either immediately after, or considerably later, is in the speaker’s mother tongue. In case it is the police that record the suspect, they are likely to ask the speaker to produce speech textually identical to that of the unknown speaker (e.g., Rogers, 1998).
There are several studies dealing with foreign accent in FSC; however, they report on foreign accent imitation (Torstensson et al., 2004), listeners’ ability to identify authentic foreign accents (Neuhauser and Simpson, 2007; Sullivan and Schlichting, 2000), or to what degree non-native language background helps witness experts identify a speaker of another language (Schiller at al. 1997). Nevertheless, we are aiming at situations when the suspect recording is for instance wiretapped, and thus it is not possible to ensure textual identity; nor does the speaker attempt to disguise their voice.
Our database comprises a hundred Czech speakers (78 females and 22 males) aged 20–25. All speakers were studying English at university at the time of recording, which took place in a sound-treated recording studio at the Institute of Phonetics, Charles University.
We investigate three recording sessions. In October (henceforward referred to as O), the speakers read a phonetically rich Czech text (L1) and a piece of BBC news in English (L2).
Four months later in January, the same students were recorded reading the same English text again (referred to as J). On average, each participant produced ca. 1 minute of speech in Czech, and 3+3 minutes in English.
We performed the analysis of long-term formant distributions (LTF; Nolan and Grigoras, 2005). A secondary aim was to compare the performance of LTF data originating only from vowels and from all voiced segments: these two types of “vocalic stream” were obtained automatically using a script in the Praat Vocal Toolkit (Corretge, 2020).
We then used a Praat script (Boersma and Weenink, 2020) to extract the first three formants (using the “robust” settings for male and female speakers) and plotted histograms in 25-Hz bins. Since all the speakers are known, but were recorded under three conditions, for each speaker we plan to compare the following condition pairs as if they were the unknown and suspect recordings: O-L1+O-L2 (contemporary pair, language mismatch); O-L2+J-L2 (non-contemporary pair, language match); O-L1+J-L2 (non-contemporary pair, language mismatch).
LTF distributions will then be compared across the different conditions. As mentioned above, we are also interested in finding whether “vocalic streams” are more speaker-specific when extracted from vowels only or from all voiced frames.
In addition, we have conducted speaker comparisons in VOCALISE (Oxford Wave Research, 2019a) using the i-vector PLDA framework, with the i-vector PLDA scores calibrated using cross-validation in the Bio-Metrics software (Oxford Wave Research, 2019b). Preliminary analyses of the ASR-based results suggest very good performance in single mismatch conditions (i.e., only temporal, or only language), with EERs of around 1%, but a much higher EER of 7.5% in double mismatch conditions.