Data Mining in the Text Corpus on Corpus and Computational Linguistics

O. A. Mitrofanova; M. A. Adamova; L. A. Bukreeva; R. V. Golubev; P. A. Gusyatskaya; A. K. Zernova; K. V. Makeev; A. A. Litvinova; V. S. Pavlikova; E. P. Plyusnina; P. Ju. Sologub; D. D. Sukhan; A. V. Troshina; A. A. Utkina

Data Mining in the Text Corpus on Corpus and Computational Linguistics

O. A. Mitrofanova, M. A. Adamova, L. A. Bukreeva, R. V. Golubev, P. A. Gusyatskaya, A. K. Zernova, K. V. Makeev, A. A. Litvinova, V. S. Pavlikova, E. P. Plyusnina, P. Ju. Sologub, D. D. Sukhan, A. V. Troshina, A. A. Utkina

Abstract

The article is dedicated to the challenges of creating a corpus of articles on corpus and computational linguistics, which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University (SPBU). The corpus is compiled under the supervision of V.P. Zakharov and includes texts from the "Corpus Linguistics" conference reports from 2002 to 2021, the "Computational Linguistics and Computational Ontologies" seminar from 2011 to 2023, as well as some other materials. During the development of the corpus resource, standardization of text presentation format was carried out, and the structure of the articles was investigated. Experiments were carried out on the generation of keywords and annotations in cases where the original text did not contain this information. Types of named entities recorded in the corpus were examined, and an algorithm for their annotation was implemented. Analysis of distribution of conference reports between thematic blocks of the conferences was fulfilled according to the expert annotation scheme. The results of experiments on training a family of topic models (NMF, LSA, LDA, Biterm) on the text corpus are presented in the paper. Generalization of topics using labels is implemented on the basis of processing data from the output of an information search engine, static predictive Word2Vec models trained on the corpus, as well as a large ChatGPT language model. The results of topic modeling with the assignment of topic labels are compared with data on the distribution of reports by conference thematic blocks in accordance with the expert markup scheme.

Full Text:

PDF (Russian)

References

O. A. Mitrofanova, and V. P. Zakharov, “Automatic Analysis of Terminology in the Russian Text Corpus on Corpus Linguistics,” in Computational Linguistics and Intellectual Technologies: Proceedings of the Annual International Conference "Dialogue 2009" (Bekasovo, May 27-31, 2009), issue. 8(15), Moscow, RSUH, pp. 321 – 328, 2009, URL: https://www.dialog-21.ru/digests/dialog2009/materials/pdf/49.pdf (accessed date: 25.11.2024).

N. V. Vinogradova, and O. A. Mitrofanova, “Formal Ontology as a Tool for Systematizing Data in the Russian Text Corpus on Corpus Linguistics,” in Proceedings of the International Conference "Corpus Linguistics - 2008", St. Petersburg, 2008, URL: https://project.phil.spbu.ru/corpora2011/Works2008/MitrofanovaVinogradova_113_121.pdf (date of access: 25.11.2024).

N. V. Vinogradova, O. A. Mitrofanova, and P. V. Panicheva, “Automatic Classification of Terms in the Russian Text Corpus on Corpus Linguistics,” in Proceedings of the Ninth All-Russian Scientific Conference "Electronic Libraries: Advanced Methods and Technologies, Electronic Collections" (RCDL-2007), Pereslavl-Zalessky, 2007, URL: http://rcdl.ru/doc/2007/paper_31_v1.pdf (date of access: 25.11.2024).

V. P. Zakharov, and S.Yu. Bogdanova, “Corpus Linguistics”, St. Petersburg, 2020.

E. V. Tikhonova, and M. A. Kosycheva, “Effective Keyword(s): Formulation Strategies,” Health, Food & Biotechnology, issue 3(4), pp. 7–15, 2022, URL: https://elibrary.ru/item.asp?id=49446588 (accessed date: 25.11.2024).

O. Kamshilova, L. Beliaeva, and L. Geikhman, “Author’s Choice for Keyword List: Research Aspect,” in R. Piotrowski's Readings in Language Engineering and Applied Linguistics, Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL–2019), CEUR Workshop Proceedings, Saint Petersburg, Russia, November 27, 2019, pp. 47–59, 2020, URL: https://elibrary.ru/item.asp?id=42584043 (accessed date: 25.11.2024).

O. A. Mitrofanova, and D. A. Gavrilik, “Experiments on Automatic Extraction of Key Expressions in Stylistically Diverse Corpora of Russian Text Corpora,” Terra Linguistica, issue 13(4), pp. 22–40, 2022, URL: https://elib.spbstu.ru/dl/2/j23-158.pdf/en/info (accessed date: 25.11.2024).

D. D. Guseva, and O. A. Mitrofanova, “Key Expressions in Russian Popular Science Texts: Comparison of Oral and Written Speech Perception with the Results of Automatic Analysis,” Terra Linguistica, issue 15(1), pp. 20–35, 2024.

A. Moskvina, E. Sokolova, and O. Mitrofanova, “KeyPhrase Extraction from the Russian Corpus on Linguistics by means of KEA and RAKE Algorithm,” in Data Analytics and Management in Data Intensive Domains: XX International Conference DAMDID/RCDL’2018, October 9–12, 2018, Moscow, Russia, Conference Proceedings, ed. by L. Kalinichenko, Y. Manolopoulos, S. Stupnikov, N. Skvortsov, and V. Sukhomlin, FRC CSC RАS, pp. 369 – 372, 2018, URL: https://elibrary.ru/item.asp?id=41112843 (accessed date: 25.11.2024).

D. A. Morozov, et al., “Generation of Keywords for Abstracts of Russian Scientific Articles,” Morozov D.A., Glazkova A.V., Tyutulnikov M.A., Iomdin B.L., Bulletin of NSU. Series: Linguistics and intercultural communication, no. 1, 2023.

A. Aries, D. Zegour, and H. Walid, “Automatic Text Summarization: What has been done and what has to be done,” arXiv:1904.00688, pp. 1–34, 2019, URL: https://arxiv.org/abs/1904.00688 (accessed date: 25.11.2024).

A. Nenkova, and K. McKeown, “Automatic Summarization,” Foundations and Trends in Information Retrieval, vol. 5(2-3), pp. 103–233, 2011, URL: https://core.ac.uk/download/pdf/76383212.pdf (accessed date: 25.11.2024).

M. Allahyari, et al., “Text Summarization Techniques: a Brief Survey,” Allahyari M., Pouriyeh S., ssefi M., Safaei S., Trippe E.D., Gutierrez J.B., and Kochut K., arXiv preprint, 2017, URL: https://arxiv.org/abs/1707.02268 (accessed date: 25.11.2024).

M. Athugodage, O. Mitrofanovа, and V. Gudkov, “Transfer Learning for Russian Legal Text Simplification,” in Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024, pp. 59–69, 2024, URL: https://aclanthology.org/2024.readi-1.6/ (accessed date: 25.11.2024).

V. Gudkov, O. Mitrofanova, and E. Filippskikh, “Automatically Ranked Russian Paraphrase Corpus for Text Generation,” in Proceedings of the Fourth Workshop on Neural Generation and Translation. Association for Computational Linguistics, pp. 54–59, 2020, URL: https://aclanthology.org/2020.ngt-1.6/ (accessed date: 25.11.2024).

J. Pilault, et al., “On Extractive and Abstractive Neural Document Summarization with Transformer Language Models,” Pilault J., Li R., Subramanian S., and Pal C., in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 9308–9319, 2020, URL: https://aclanthology.org/2020.emnlp-main.748/ (accessed date: 25.11.2024).

Automatic Text Summarizer, URL: https://pypi.org/project/sumy/ (accessed date: 25.11.2024).

RuT5SumGazeta, URL: https://huggingface.co/IlyaGusev/rut5_base_sum_gazeta (accessed date: 25.11.2024

M. M. Tikhomirov, N. V. Loukachevitch, and B. V. Dobrov, “Recognizing Named Entities in Specific Domain,” Lobachevskii Journal of Mathematics, vol. 41(8), pp. 1591–1602, 2020, doi: 10.1134/S199508022008020X.

D. M. Kostyuk, and N. K. Shirokov, “Methods for Identifying Named Entities in the Tasks of Processing the Flow of Scientific News,” in Management of University Libraries, Minsk, pp. 50–54, 2021, URL: https://elibrary.ru/item.asp?id=49171334 (accessed date: 25.11.2024).

A. A. Navrotsky, and E. V. Krivaltsevich, “Comparative Analysis of Systems for Extracting Named Entities from Unstructured Journalistic Texts,” in BIG DATA and Advanced Analytics = BIG DATA and high-level analysis, Minsk, pp. 12–18, 2020, URL: https://elibrary.ru/item.asp?id=43934323 (accessed date: 25.11.2024).

V. Yadav, and S. Bethard, “A Survey on Recent Advances in Named Entity Recognition from Deep Learning Models,” in Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, Association for Computational Linguistics, pp. 2145–2158, 2018, URL: https://arxiv.org/abs/1910.11470 (accessed date: 25.11.2024).

Natasha, GitHub Repository, URL: https://github.com/natasha/natasha (accessed date: 02.02.2024).

Yargy, GitHub Repository, URL: https://github.com/natasha/yargy (accessed date: 25.11.2024).

Named Entity Recognition (NER), DeepPavlov, URL: https://docs.deeppavlov.ai/en/master/features/models/NER.html (accessed date: 25.11.2024).

NEREL, GitHub Repository, URL: https://github.com/nerel-ds/NEREL (accessed date: 25.11.2024).

Stanford NER, URL: https://www.davidsbatista.net/blog/2018/01/23/StanfordNER/ (accessed date: 25.11.2024).

K. V. Vorontsov, “Probabilistic Topic Modeling: ARTM Regularization Theory and the BigARTM Open Source Library,” URSS, 2023.

A. Moskvina, E. Sokolova, and O. Mitrofanova, “KeyPhrase Extraction from the Russian Corpus on Linguistics by Means of KEA and RAKE Algorithm,” in Data Analytics and Management in Data Intensive Domains: XX International Conference DAMDID/RCDL’2018, October 9–12, 2018, Moscow, Russia, FRC CSC RAS, pp. 369–372.

D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum, “Optimizing Semantic Coherence in Topic Models,” in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 262−272, 2011.

G. Heinrich, “Parameter Estimation for Text Analysis,” Technical Report, pp. 1–32, 2005.

S. Koltcov, “Application of Rényi and Tsallis Entropies to Topic Modeling Optimization,” Physica A: Statistical Mechanics and its Applications, no. 512, pp. 1192–1204, 2018.

A. Erofeeva, and O. Mitrofanova, “Automatic Assignment of Labels in Topic Modeling for Russian Corpora,” Structural and Applied Linguistics, vol. 12, pp. 122–147, 2019.

A. Kriukova, A. Erofeeva, O. Mitrofanova, and K. Sukharev, “Explicit Semantic Analysis as a Means for Topic Labeling,” in Artificial Intelligence and Natural Language Processing: 7th International Conference, AINL 2018, St. Petersburg, Russia, October 17–19, 2018, Proceedings. Springer, Cham, pp. 167–177, 2018.

O. Mitrofanova, A. Kriukova, V. Shulginov, and V. Shulginov, “E-hypertext Media Topic Model with Automatic Label Assignment,” in Recent Trends in Analysis of Images, Social Networks and Texts: 9th International Conference, AIST 2020, Revised Supplementary Proceeding, Communications in Computer and Information Science, vol. 1357, Springer, pp. 102−114, 2021.

O. A. Mitrofanova, M. M. Athugodage, and L. V. Ten, “Topic Label Generation in the Popular Science Corpus,” in 26th international conference «Internet and Modern Society» (IMS–2023), International Workshop «Computational Linguistics» (CompLing 2023), Proceedings, Springer Nature, 2023.

T. Sherstinova, O. Mitrofanova, T. Skrebtsova, E. Zamiraylova, and M. Kirina, “Topic Modelling with NMF vs Expert Topic Annotation: The Case Study of Russian Fiction,” in Advances in Computational Intelligence: 19th Mexican International Conference on Artificial Intelligence, MICAI 2020, vol. 12469, pt. 2, P. 134–152, 2020.

D. Kuang, J. Choo, and H. Park, “Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering,” Partitional clustering algorithms, pp. 215–243, 2015.

Scikit-Learn, URL: https://scikit-learn.org/ (accessed date: 25.11.2024).

T. K. Landauer, P. W. Foltz, and D. Laham, “Introduction to Latent Semantic Analysis,” Discourse Processes, issue 25, pp. 259–284, 1998.

A. V. Chizhik, “Using Topic Modeling Methods to Assess the Degree of Media Influence on Public Mood,” in Computational Linguistics and Computational Ontologies, issue 5, Proceedings of the XXIV International United Scientific Conference "Internet and Modern Society", IMS-2021, St. Petersburg, June 24–26, 2021, SPb., ITMO University, pp. 70–78, 2021.

M. A. Kirina, “Comparison of Topic Models Based on LDA, STM, and NMF for Qualitative Analysis of Russian Short Fiction,” Bulletin of the Novosibirsk State University. Series: Linguistics and Intercultural Communication, no. 20(2), pp. 93–109, 2022.

D. M. Blei, A.Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” University of California, Berkeley, Berkeley, CA 94720, pp. 993–1022, 2002.

T. Hofmann, “Probabilistic Latent Semantic Indexing,” ACM SIGIR Forum, vol. 51,2, pp. 211–218, 2017.

Gensim, URL: https://radimrehurek.com/gensim/ (accessed date: 25.11.2024).

X. Yan, J. Guo, Y. Lan, and X. Cheng, “A Biterm Topic Model for Short Texts,” in WWW 2013. Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456, 2013.

Biterm, URL: https://pypi.org/project/biterm/ (accessed date: 25.11.2024).

Google, URL: https://www.google.ru/ (accessed date: 25.11.2024).

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” URL: https://arxiv.org/abs/1301.3781 (accessed date: 25.11.2024).

O. A. Mitrofanova, “Search and Ranking of Texts in a Special Corpus based on Topic Modeling,” in Proceedings of the International Conference "Corpus Linguistics - 2023" (SPb Corpora 2023), june 21-23, 2023, St. Petersburg, SPb., 2024.

Refbacks

There are currently no refbacks.

Abava Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162

International Journal of Open Information Technologies