Developing and analyzing an algorithm for separate speech recording of multiple speakers

Salma Mhammad, Sergey Molodyakov

Abstract


Speech recognition of multiple simultaneous speakers has become an important topic in artificial intelligence tasks. An algorithm for separate voice recording of multiple speakers is developed and analyzed. The algorithm includes the following stages: removal of extraneous sounds, removal of silence, clustering of speech segments with corresponding cluster labels, application of neural network to record text separately for each speaker. The particular feature of the considered algorithm is the application of convolutional neural network at the stage of voice cleaning from extraneous sounds. The Whisper model is used for text recording. It does not take into account the case of multiple speakers, so additional steps are introduced before applying the model. In each step of the algorithm, the best methods and metrics are analyzed. Metrics are defined for both individual steps and the system as a whole. Based on the determination of the metrics evaluations, a study is done and the methods that give the best results are highlighted. In the voice cleaning stage, the best result is given by the application of convolutional neural network. In silence removal stage, a method based on voice activity detection is proposed. When clustering speech segments, it is possible to use LSTM model or Siamese network. A software application has been developed to recognize and separately record the texts of speakers with Russian, English and Arabic language.


Full Text:

PDF (Russian)

References


Wang Q. et al. Speaker diarization with LSTM //2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). – IEEE, 2018. – С. 5239-5243.

Raj D. et al. Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis //2021 IEEE spoken language technology workshop (SLT). – IEEE, 2021. – С. 897-904.

Zhang A. et al. Fully supervised speaker diarization //ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – IEEE, 2019. – С. 6301-6305.

Ermolenko T. V., Klimenko N. S. Using speech signal segmentation to build a complex speaker model in the speaker identification system // Informatics and Automation. - 2013. - Т. 3. - №. 26. - С. 332-348.

Rogov A. A., Petrov E. A. Analysis of existing freely distributed systems of speaker separation on phonogram // Fundamental Research. - 2015. - №. 6-1. - С. 67-72.

Lastochkin A. V., Kobelev V. Yu. V. V., Kobelev V. Yu. A method of noise removal based on wavelet processing adapted to discontinuous signals: Proc. of the 5th Intern. 5th International Conf. “Digital Signal Processing and its Application (DSPA-2003)” [in Russian]. [Electronic resource] // St.-Petersburg: ZAO AVTEKS, 2003. - Access mode: http://www.autex.spb.ru (access date: November 2024).

Sahoo T. R., Patra S. Silence removal and endpoint detection of speech signal for text independent speaker identification //International Journal of Image, Graphics and Signal Processing. – 2014. – Т. 6. – №. 6. – С. 27.

Hanifa R. M. et al. Voiced and unvoiced separation in Malay speech using zero crossing rate and energy //Indones. J. Electr. Eng. Comput. Sci. – 2019. – Т. 16. – №. 2. – С. 775-780.

Ball J. Voice Activity Detection (VAD) in Noisy Environments //arXiv preprint arXiv:2312.05815. – 2023.

Abdul Z. K., Al-Talabani A. K. Mel frequency cepstral coefficient and its applications: A review //IEEE Access. – 2022. – Т. 10. – С. 122136-122158.

Abdul Z. K., Al-Talabani A. K. Mel frequency cepstral coefficient and its applications: A review //IEEE Access. – 2022. – Т. 10. – С. 122136-122158.

Bhukya R. K., Raj A. Automatic speaker verification spoof detection and countermeasures using gaussian mixture model //2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON). – IEEE, 2022. – С. 1-6.

Sinaga K. P., Yang M. S. Unsupervised K-means clustering algorithm //IEEE access. – 2020. – Т. 8. – С. 80716-80727.

An S., Ling Z., Dai L. Emotional statistical parametric speech synthesis using LSTM-RNNs //2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). – IEEE, 2017. – С. 1613-1616.

Analytics B. et al. Self-Organizing Map and Multidimensional Scaling in a Tandem Approach: a Visualization of Bankruptcy Trajectory. – 2019.

Yeo J. H. et al. Visual speech recognition for low-resource languages with automatic labels from Whisper model // arXiv preprint arXiv:2309.08535. – 2023.

Warule P., Mishra S. P., Deb S. Significance of voiced and unvoiced speech segments for the detection of common cold //Signal, image and video processing. – 2023. – Т. 17. – №. 5. – С. 1785-1792.

Andersen L. R., Jacobsen L. J., Campos D. Compressed, Real-Time Voice Activity Detection with Open Source Implementation for Small Devices //Proceedings of the 8th international Workshop on Sensor-Based Activity Recognition and Artificial Intelligence. – 2023. – С. 1-10.

Ashar A., Bhatti M. S., Mushtaq U. Speaker identification using a hybrid cnn-mfcc approach //2020 International Conference on Emerging Trends in Smart Technologies (ICETST). – IEEE, 2020. – С. 1-4.

Liu J. et al. A hybrid news recommendation algorithm based on K-means clustering and collaborative filtering //Journal of Physics: Conference Series. – IOP Publishing, 2021. – Т. 1881. – №. 3. – С. 032050.

Oruh J., Viriri S., Adegun A. Long short-term memory recurrent neural network for automatic speech recognition // IEEE Access. – 2022. – Т. 10. – С. 30069–30079.

Khan U., Hernando Pericás F. J. Unsupervised training of siamese networks for speaker verification //Interspeech 2020: the 20th Annual Conference of the International Speech Communication Association: 25-29 October 2020: Shanghai, China. – International Speech Communication Association (ISCA), 2020. – С. 3002-3006.

Kang H., Park C., Yang H. Evaluation of Denoising Performance of ResNet Deep Learning Model for Ultrasound Images Corresponding to Two Frequency Parameters // Bioengineering. – 2024. – Т. 11, № 7.

Arora M., Kanjilal U., Varshney D. Evaluation of information retrieval: precision and recall //International Journal of Indian Culture and Business Management. – 2016. – Т. 12. – №. 2. – С. 224-236.

Punhani A. et al. Binning-based silhouette approach to find the optimal cluster using K-means //IEEE Access. – 2022. – Т. 10. – С. 115025-115032.

Suraya S., Sholeh M., Lestari U. Evaluation of Data Clustering Accuracy using K-Means Algorithm //International Journal of Multidisciplinary Approach Research and Science. – 2023. – Т. 2. – №. 01. – С. 385-396.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИБП для ЦОД СНЭ

ISSN: 2307-8162