Application of sinusoidal speech modeling to the sound diarization problem

Bulat Nutfullin, Eugene Ilyushin

Abstract


Speech is a specific feature of human and his advantage over other species within evolution. Sound diarization is a process of sound separation, taking into account belonging to the speaker. Before the advent of deep learning and the availability of the necessary computing resources, the quality of the algorithms that determine the speaker by voice left much to be desired. Diarization has numerous applications: smart speakers, mobile phones, automatic speech translation systems. But it should be noted that the existing diarization algorithms have drawbacks, for example, the complexity of work with simultaneous speech by several speakers or the lack of diarization results for its automatic application in some areas. This explains the relevance of research in this area. The sinusoidal model is an algorithm for tracking sequences of points in timeamplitudefrequency space. In existing researches, it is applied to simulations of echolocation, human speech, and speech synthesis. At the time of the study, no applications of the sinusoidal model in the problem of diarization were found in the literature. The paper considers the problem of diarization and the main quality indicators used in assessing the solutions to this problem. The main intermediate representations of sound used in existing solutions are considered, and a diarization algorithm using sinusoidal speech modeling is proposed. The advantage of the proposed algorithm is the ability to operate sinusoidal representations as VAD, which in general made it possible to make the used diarization algorithm more efficient.


Full Text:

PDF (Russian)

References


Joint Speech Recognition and Speaker Diarization via Sequence Transduction https://ai.googleblog.com/2019/08/jointspeechrecognitionandspeaker.html

R. McAulay, T. Quatieri ’Speech analysis/Synthesis based on a sinusoidal representation’

Patrice Guyot, Alice Eldridge, Ying Chen EyreWalker, Alison Johnston, Thomas Pellegrini, et al.. Sinusoidal modelling for ecoacoustics. Annual conference Interspeech (INTERSPEECH 2016), Sep 2016, San Francisco, United States. pp. 26022606. ffhal01474894f

Toru Taniguchi, Mikio Tohyama, Katsuhiko Shirai ’Detection of speech and music based on spectral tracking’

Spectral Modeling Synthesis Tools https://www.upf.edu/web/mtg/smstools

Spectral Modeling Synthesis Tools code https://github.com/MTG/smstools

Jean Larock, Yunnis Stylianou and Eric Moulines ’HNM: A Simple, Efficient Harmonic + Noise Model for Speech’, Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

S. Davis and P. Mermelstein Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 No. 4, 1980.

Xuedong Huang, Alex Acero, HsiaoWuen Hon, ’Spoken Language Processing: A Guide to Theory, Algorithm, and System Development’ Prentice Hall, 2001, ISBN:0130226165

Stephen H. Shum, Najim Dehak, Réda Dehak, James R. Glass ’Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach’

Giovanni Soldi, Massimiliano Todisco, Hector Delgado, Christophe Beaugeant Nicholas Evans ’Semisupervised Online Speaker Diarization for Meeting Data with Incremental Maximum Aposteriori Adaptation’

Aonan Zhang, Quan Wang,Zhenyao Zhu,John Paisley,Chong Wang “FULLY SUPERVISED SPEAKER DIARIZATION”

Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno, “Speaker diarization with lstm,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5239–5243.

Yanick Lukic, Carlo Vogt, Oliver Durr, Thilo Stadelmann, ’SPEAKER IDENTIFICATION AND CLUSTERING USING CONVOLUTIONAL NEURAL NETWORKS’

Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe ’ENDTOEND NEURAL SPEAKER DIARIZATION WITH SELFATTENTION’

Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe,’Endtoend neural speaker diarization withpermutationfree

objectives,’ inProc. Interspeech, 2019.

E. Vincent, T. Virtanen, S. Gannot, Audio source separation and speech enhancement, John Wiley & Sons, 2018.

D. Wang, J. Chen, Supervised speech separation based on deep learning: An overview, IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (2018) 1702–1726.

R. HaebUmbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeister, M. L. Seltzer, H. Zen, M. Souden, Speech processing for digital home assistants: Combining signal processing with deeplearning

techniques, IEEE Signal Processing Magazine 36 (2019) 111–124.

G. Sell, D. Snyder, A. McCree, D. GarciaRomero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, et al., Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge., in: Proceedings of the Annual

Conference of the International Speech Communication Association, 2018, pp. 2808–2812.

N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liberman, The second DIHARD diarization challenge: Dataset, task, and baselines, Proceedings of the Annual Conference of the International Speech Communication Association (2019) 978–982.

M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova, K. Zmolıkova,O. Novotny, K. Vesely, O. Glembek, O. Plchot, et al., BUT system for DIHARD speech diarization challenge 2018., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 2798–2802.

T. Gao, J. Du, L.R. Dai, C.H. Lee, Densely connected progressive learning for lstmbased speech enhancement, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2018

H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, Phasesensitive

and recognitionboosted speech separation using deep recurrent neural networks, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2015, pp. 708–712.

P. C. Loizou, Speech enhancement: theory and practice, CRC press, 2013.

T. Drugman, Y. Stylianou, Y. Kida, M. Akamine, Voice activity detection: Merging source and filterbased information, IEEE Signal Processing Letters 23 (2015) 252–256

T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, K. Vesely, P. Matejka, Developing a speech activity detection system for the darpa rats program, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2012, pp. 1969–1972.

R. Sarikaya, J. H. Hansen, Robust detection of speech activity in thepresence of noise, in: Proceedings of the International Conference on Spoken Language Processing, volume 4, Citeseer, 1998, pp. 1455–8.

D. Snyder, D. GarciaRomero, G. Sell, D. Povey, S. Khudanpur, Xvectors: Robust DNN embeddings for speaker recognition, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333.

D. Snyder, D. GarciaRomero, D. Povey, S. Khudanpur, Deep neural network embeddings for textindependent speaker verification., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 999–1003

K. J. Han, S. S. Narayanan, A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2007.

S. Novoselov, A. Gusev, A. Ivanov, T. Pekhovsky, A. Shulipa, A. Avdeeva, A. Gorlanov, A. Kozlov, Speaker diarization with deep speaker embeddings for dihard challenge ii., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 1003–1007.

A. Ng, M. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems 14 (2001) 849–856.

J. Luque, J. Hernando, On the use of agglomerative and spectral clustering in speaker diarization of meetings, in: Proceedings of Odyssey: The Speaker and Language Recognition Workshop, 2012, pp. 130–137.

T. J. Park, K. J. Han, M. Kumar, S. Narayanan, Autotuning spectral clustering for speaker diarization using normalized maximum eigengap, IEEE Signal Processing Letters 27 (2019) 381–385.

D. Dimitriadis, Enhancements for Audioonly Diarization Systems, arXiv preprint arXiv:1909.00082 (2019).

J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: Proceedings ofInternational Conference on Machine Learning, 2016, pp. 478–487.

S. Horiguchi, P. Garcia, Y. Fujita, S. Watanabe, K. Nagamatsu, Endtoend speaker diarization as postprocessing, arXiv preprint arXiv:2012.10055 (2020).

S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, K. Nagamatsu, Endtoend speaker diarization for an unknown number of speakers with encoderdecoder based attractors, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 269–273.

Diarization hard competition https://dihardchallenge.github.io/dihard3/


Refbacks

  • There are currently no refbacks.


Abava  Absolutech Convergent 2020

ISSN: 2307-8162