Two methods for identifying Russian words in Yakut texts

Nicolas Cortegoso Vissio, Victor Zakharov

Abstract


The article discusses two methods for extracting foreign words from Yakut texts. Foreign words refer to non-integrated lexical units, which have not been adapted to Yakut orthography and are therefore written as in the original language. Based on the fact that most foreign words in Yakut texts come from the Russian language, it is assumed that they have a particular form by which they can be distinguished from the Yakut word forms. The first method reviewed here is based on rules. It implements an algorithm that detects letter combinations that are foreign to the Yakut language. The second method applies a statistical approach to model and differentiate Yakut and Russian letter combinations. The effectiveness of both methods in extracting Russian foreign words is compared with the results of manual highlighting performed by Russian speakers on 6 Yakut texts. This work is a continuation of the article “Identification of Russian borrowings in Yakut texts”, published in “Computer Linguistics and Computational Ontologies. Number 5 (Proceedings of the XXIV Joint International Conference "Internet and Modern Society, IMS-2022.

Full Text:

PDF (Russian)

References


DOI: 10.25559/INJOIT.2307-8162.10.202211.26-34

N. Cortegoso-Vissio and V.P. Zakharov, “Vydeleniye russkikh zaimstvovaniy v yakutskikh tekstakh». Komp'yuternaya lingvistika i vychislitel'nyye ontologii”, in Vypusk 5 (Trudy XXIV Mezhdunarodnoy ob"yedinennoy konferentsii «Internet i sovremennoye obshchestvo, IMS-2022), SPb: Universitet ITMO, 2022 (in press).

P.A. Sleptsov, Russkiye leksicheskiye zaimstvovaniya v yakutskom yazyke. Izdates'tvo, Nauka, 1975.

L.N. Kharitonov, Sovremennyy yakutskiy yazyk. Chast' pervaya: fonetika i morfologiya. Nauchno-Issledovatel'skiy Institut yazyka, literatury i istorii YAASSR, Yakutsk: Gosizdat YAASSR, 1947.

N.M. Vasil'yeva, “K voprosu o pravopisanii zaimstvovannykh slov sovremennom yakutskom yazyke”, Izvestiya Rossiyskogo gosudarstvennogo pedagogicheskogo universiteta im. AI Gertsena, no. 131, pp. 166-169, 2011.

W.B. Canvar and J.M. Trenkle, “N-Gram-Based Text Categorization”, In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 161–175.

D. Goldhahn, T. Eckart and U. Quasthoff, “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages”, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012, pp. 769-765 [Online]. Available: http://www.lrecconf.org/proceedings/lrec2012/pdf/327_Paper.pdf.

Sakhamediya. Setevogo izdaniya «sakhamedia.ru» [Online]. Available: https://sakhamedia.ru/gazeta-saha-sire/.

A language identification classifier to extract Russian imports from Yakut texts [Online]. Available: https://github.com/nicolascortegoso/sakha_loanwords.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность IT Congress 2024

ISSN: 2307-8162