Review of existing text-to-speech algorithms

Nikita Kireev, Eugene Ilyushin

Abstract


Scientists have long been working on algorithms for translate text written in natural language into speech. But the quality of work these algorithms left much to be desired until the moment when the application of deep learning methods was not possible. With the advent of the necessary computing resources and the accumulation of a sufficient amount of data for training, these methods have become widely used in machine learning in general and, of course, in speech synthesis in particular. A significant improvement in the quality of the work of text-to-speech algorithms has led to their widespread use, namely in mobile devices, smart speakers, voice assistants, etc. But it is worth noting that the algorithms of this class, developed at the moment, do not always correctly cope with the task. For example, they cannot always correctly emphasize or voice the necessary parts of the text with the necessary intonation. Thus, the study of methods and means of synthesizing speech has become even more relevant.

There are many different ways to synthesize speech by text, such as parametric synthesis, compilation synthesis, subject-oriented synthesis, and full speech synthesis by the rules. The purpose of this work is to review existing algorithms for translating text to speech and conducting their comparative analysis. The main algorithms were considered: WaveNet, DeepVoice, Tacatron, DeepVoice 2, DeepVoice 3 and Tacatron 2. In the course of their comparison, it was determined that the best at the moment are DeepVoice 3 and Tacatron 2, since the assessments of the quality of their work are closest to professionally recorded speech.


Full Text:

PDF (Russian)

References


Sound [Online]. Available: https://en.wikipedia.org/wiki/Sound

Speech synthesis [Online]. Available: https://en.wikipedia.org/wiki/Speech_synthesis

Sintez rechi [Online]. Available: https://ru.wikipedia.org/wiki/Sintez_rechi

Human voice [Online]. Available: https://en.wikipedia.org/wiki/Human_voice

Voice [Online]. Available: https://ru.wikipedia.org/wiki/Voice

Shankar Narayan. (1997, June 24). Intonation adjustment in text-to-speech systems/ Shankar Narayan [Online]. Available: https://patents.google.com/patent/US5642466A/en

Edwin R. AddisonH. Donald WilsonGary MarpleAnthony H. HandalNancy Krebs. (2005 March 8). Text to speech [Online]. Available: https://patents.google.com/patent/US6865533B2/en

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu. (2016, September 19). WAVENET: A GENERATIVE MODEL FOR RAW AUDIO [Online]. Available: https://arxiv.org/pdf/1609.03499.pdf

Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi. (2017, March 7). Deep Voice: Real-time Neural Text-to-Speech [Online]. Available: https://arxiv.org/pdf/1702.07825.pdf

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous. (2017, April 6). Tacotron: Towards End-to-End Speech Synthesis [Online]. Available: https://arxiv.org/pdf/1703.10135.pdf

Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou. (2017, Septembere 20). Deep Voice 2: Multi-Speaker Neural Text-to-Speech [Online]. Available: https://arxiv.org/pdf/1705.08947.pdf

Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, Junichi Yamagishi. (2016, September 8). Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks [Online]. Available: https://pdfs.semanticscholar.org/ed99/08f71d6521a45093ffc0f9365315c1183604.pdf

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller. (2018, February 22). Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning [Online]. Available: https://arxiv.org/pdf/1710.07654.pdf

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu. (2018, September 16). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [Online]. Available: https://arxiv.org/pdf/1712.05884.pdf

Wei Z W ZhangXi Jun MaLing JinHai Xin Chai. (2009, September 15). Method and system for statistic-based distance definition in text-to-speech conversion [Online]. Available: https://patents.google.com/patent/US7590540B2/en

Jun XuTeck Chee LEE. (2017, September 12). System and method for distributed text-to-speech synthesis and intelligibility [Online]. Available: https://patents.google.com/patent/US9761219B2/en

Andy AaronRaimo BakisEllen M. EideWael M. Hamza. (2014, November 11). Systems and methods for text-to-speech synthesis using spoken example [Online]. Available: https://patents.google.com/patent/US8886538B2/en

Bojana GajicShrikanth Sambasivan NarayananSarangarajan ParthasarathyRichard Cameron RoseAaron Edward Rosenberg. (2015, June 16). System and method of performing user-specific automatic speech recognition [Online]. Available: https://patents.google.com/patent/US9058810B2/en

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng. (2014, December 19). Deep Speech: Scaling up end-to-end speech recognition [Online]. Available: https://arxiv.org/pdf/1412.5567.pdf

Liang Lu, Lingpeng Kong, Chris Dyer, Noah A. Smith, and Steve Renals. (2016, June 20). Segmental Recurrent Neural Networks for End-to-end Speech Recognition [Online]. Available: https://arxiv.org/pdf/1603.00223.pdf

Sangramsing Kayte, Monica Mundada, Jayesh Gujrath. (2015, November). Hidden Markov Model based Speech Synthesis: A Review [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.740.1357&rep=rep1&type=pdf

Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, Junichi Yamagishi. (2016, September 8-12). Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks [Online]. Available: https://pdfs.semanticscholar.org/ed99/08f71d6521a45093ffc0f9365315c1183604.pdf


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность IT Congress 2024

ISSN: 2307-8162