Designing a Multi-Factor Quality Evaluation Protocol for Speaker Verification Systems

Ali Aliyev

Abstract


This work addresses the challenge posed by the complexity of testing speaker verification models and datasets under real-world conditions. The proposed methodology automatically extracts missing metadata for each utterance, such as codec, language, age, gender, emotion, noise level, duration, and systematically stresses models by simulating bandwidth limits, lossy codecs, noise, volume changes, spectro-temporal masking. Using the Equal Error Rate (EER) as the key metric, we test our methods on the VoxCeleb-1 dataset with ResNet-34 model, which reveals accuracy drops at 8 kHz, in low-SNR scenes and in cross-age trials, while showing robustness to moderate compressions and tempo shifts. The protocol offers an automated standardized and reproducible way to discover a speaker verification model’s strengths and weaknesses and can be extended to other speech tasks.

Full Text:

PDF

References


A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, Stockholm, Sweden, Aug. 2017, pp. 2616–2620

G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds, “The NIST speaker recognition evaluation—Overview, methodology, systems, results, perspective,” Speech Communication, vol. 31, nos. 2–3, pp. 225–254, 2000.

M. Lavechin et al., “Brouhaha: Multi-Task Training for Voice Activity Detection, Speech-to-Noise Ratio, and C50 Room Acoustics Estimation,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, Dec. 2023, pp. 1–7.

H. Zeinali, K. A. Lee, J. Alam, and L. Burget, “SdSV Challenge 2020: Large-scale evaluation of short-duration speaker verification,” in Proc. Interspeech 2020, Shanghai, China, Oct. 2020, pp. 731–735.

H. Yamamoto, K. A. Lee, K. Okabe, and T. Koshinaka, “Speaker augmentation and bandwidth extension for deep speaker embedding,” in Proc. Interspeech 2019, Graz, Austria, Sept. 2019, pp. 406–410.

ITU-T Recommendation G.711, Pulse Code Modulation (PCM) of Voice Frequencies, Int. Telecommun. Union, Geneva, Switzerland, Nov. 1988.

ITU-T Recommendation G.722.2, Wideband Coding of Speech at Around 16 kbit/s Using Adaptive Multi-Rate Wideband (AMR-WB), Int. Telecommun. Union, Geneva, Switzerland, Jul. 2003.

T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proc. Interspeech 2015, Dresden, Germany, Sept. 2015, pp. 3586–3589.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778.

J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech 2018, Hyderabad, India, Sept. 2018, pp. 1086–1090.

D. Povey, A. Ghoshal, G. Boulianne et al., “The Kaldi speech recognition toolkit,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Waikoloa, HI, USA, Dec. 2011, pp. 1–4.

D. S. Park et al., “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech 2019, Graz, Austria, Sept. 2019, pp. 2613–2617.

A. Aliyev, “Passive and Active Speaker Verification Testing Protocol – Implementation,” GitHub repository, 2025. [Online]. Available: https://github.com/Spectra456/passive-active-sv-testing-protocl/tree/master


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИБП для ЦОД СНЭ

ISSN: 2307-8162