Charles Explorer logo
🇨🇿

Tweaking the settings: testing the performance of iVocalise on Czech and Persian

Publikace na Filozofická fakulta |
2019

Tento text není v aktuálním jazyce dostupný. Zobrazuje se verze "en".Abstrakt

In approximately the last decade, forensic sciences in general, and speaker identification in particular, have been undergoing what has come to be known as the paradigm shift. Apart from requirements like scientific validity, reliability and testability, this seems to involve, above all, presenting strength of evidence using the likelihood ratio (LR) framework.

However, that is fraught with difficulty in the acoustic-phonetic approach to speaker identification, since what LR requires is a statement concerning not only the similarity of the compared voice samples, but also the typicality of observed differences. Automatic speaker recognition (ASR) seems to hold a lot of potential for the future, since many of the above-mentioned requirements are, by definition, implemented, and the latest algorithms are becoming more robust with respect to mismatched conditions.

Nevertheless, there has been understandable worry from the forensic phonetic community of the "black-box" architecture of ASR systems. The aim of this study is to contribute to testing the performance of one ASR system, iVocalise, which allows the user to somewhat "open" the black box.

The latest released version, which relies on the i-vector PLDA framework, allows the user to change the settings: it is possible to optimize the system to new conditions (e.g., a specific recording device) by adapting the PLDA, and the comparison scores may be normalized using a reference dataset which is comparable to the test data. The aim of this study is to test these "tuning" possibilities of iVocalise on a database of 100 speakers of Czech and Farsi, performing a reading and spontaneous speech task in quiet (but not laboratory) settings.

The reading takes 1 minute on average, and we extracted approximately 2-minute passages of the spontaneous recording. 40 randomly selected speakers served as the comparison dataset, and 30 were used for PLDA adaptation and 30 for reference normalization. For all comparisons, we used the 2017B-Adaptable-15F-1024 session in iVocalise.

Preliminary results of analyses performed on the Czech dataset indicate that using comparable datasets (i.e., different subsets of the same corpus) for PLDA adaptation and reference normalization has comparable effects on the performance of the system: the equal-error rate (EER) was halved in both cases from 5.0% to 2.5%. Practically identical results were obtained when two subsets were used, one for PLDA adaptation and the other for reference normalization.

However, reference normalization proved to be superior from the perspective of system accuracy, as expressed by the log-likelihood-ratio cost (Cllr) values. The results show that the possibility of "tweaking" the settings in iVocalise is beneficial to the system performance: equal error rates were halved with respect to the base condition.

The presentation will also include the Persian data. In our follow-up research, we are planning to examine the effect of various types of disguise on the performance of iVocalise.