In the forensic speaker comparison (FSC) field, the voice disguise phenomenon has been known and examined for decades. Depending on what type of crime is being committed, the offender is more or less likely to be trying to conceal their voice identity; nevertheless, the numbers (how many do try) vary a great deal (see e.g., Künzel, 2000; or Braun, 2006 for more information). Moreover, perpetrators have come up with a range of ways of altering their voice; from changing the fundamental frequency, imitating regional/foreign accent, placing an object before/into the mouth, to modifying their voices electronically. Luckily for the justice system, criminals opt for rather less sophisticated methods, which might be caused by the fact that it is difficult to combine verbal planning and complex voice disguise means at the same time (Masthoff, 1996).
This study was carried out on the forensic database of Common Czech (Skarnitzl & Vaňková, 2017), which comprises 100 male speakers of a supraregional variety of Czech (aged 19-50, mean 25.6 years); and follows the research of Skarnitzl et al. (2019). The speakers, recorded in quiet environments with a high-quality portable device, were asked to deliver three distinct speech styles
- spontaneous speech (everyday topics, approx. 2-minute samples were cut out of 25-45-minute recordings), read speech (a phonetically rich text, approx. 60 seconds in length), disguised speech
(a phonetically rich text similar to the previous one, using some identical words to facilitate comparison, approx. 60 seconds in length). As for the last style, having been instructed to report to their kingpin, the speakers were given time to select their own technique to conceal their voice identity. The voice modifications employed differed from almost no audible changes at all to complex disguise strategies (see Růžičková & Skarnitzl, 2017).
As automatic speaker recognition (ASR) has been gaining prominence in FSC in the last years
(Gold &French, 2019), the goal of this paper was to compare the three datasets described above using
VOCALISE's i-vectors and x-vectors (Oxford Wave Research, 2019a; Oxford Wave Research, 2019b). Since VOCALISE provides its users with a possibility to tune the performance by condition adaptation and reference normalisation, we were interested to know whether, and to what extent, such tweaking of the system's settings would also help with disguised voices. First, we divided each dataset into two random subsets (N=50) and compared them as such; second, we examined the subsets after adapting and normalising the system with always the other 50 recordings of spontaneous and disguised samples (the disguised ones were always used as the so-called comparison files).
In the end, no matter which tuning procedure had been applied, when the disguised set was subjected to comparison, the performance of the systems fell short of the typically reported results, with EER ranging from 14.9% to 30%. It is worth noting that the i-vector system outperformed the newer x-vectors in several individual comparisons. As for the contribution of various tuning procedures, our results do not suggest a greater benefit of tuning with a subset which featured disguised voices.