Methods of processing the uzbek language corpus texts

B. B. Elov, Sh. M. Khamroeva, R. H. Alayev, Z. Yu. Khusainova, U. S. Yodgorov


Computers are designed to process digital or numerical data. However, data is not always in numerical form. How to process data in the form of symbols, words and text? How to teach computers to process our natural language? How do Alexa, Google Home and many other "smart" assistants today understand and respond to our speech? In this article, text processing methods in the field of artificial intelligence, which are called natural language processing, such as Bag-of-words (BOW), CountVectorizer, TF IDF, Co-Occurrence matrix, Word2Vec, CBOW, Skip-Gram, GloVe, ELMO and BERT are presented in order to process the texts of the Uzbek language corpus. The article presents several advantages and disadvantages of the different methods. Methods that generate discrete numerical values of text are easy to understand, implement, and interpret. Algorithms such as TF-IDF can be used to filter simple and non-sense words. Complex tasks in NLP can be solved using distributed text representation algorithms. Distributed text representations can be used to understand and learn a language corpus. These methods are used in the development of modern NLP applications based on CNNs and LSTMs.

