Sound augmentation methods

Yulia Romanovskaya, Eugene Ilyushin


The problem of sound recognition is becoming more relevant and in demand every year. Considering the recognition voice commands task, it becomes clear that a large amount of training data is required, since models must take the difference in timbres, speed, diction features, and many other factors into account. The actual collection of this data is to be very time-consuming, but in fact, impossible. As a result, the search for algorithms for the automatic creation of training synthetic datasets is actively underway. Augmentation is a method of creating additional data based on existing ones. There are two fundamentally  different approaches. The first approach takes existing data as input and returns the same data, but with changed characteristics (i.e., accelerated or louder samples). The second method uses the original data only for training the model, and generates new data independently. This article provides an overview of the  entire spectrum of existing augmentation methods. We try several methods in  our experiments and make conclusions about the application and usage of the presented approaches as well as their impact on the quality of sound recognition an example of a voice recognition task.

