Comparative analysis of applied natural language processing technologies for improving the quality of digital document classification

A.K. Markov, D.O. Semenochkin, A.G. Kravets, T.A. Yanovskiy

Abstract


Tagging digital electronic documents is the process of assigning metadata or labels (tags) to documents in order to simplify their organization, retrieval, and management. This process is essential for effective information management and document accessibility in a digital environment. The choice of tagging method and technology depends on the specific needs of the organization or user. Often a combination of different methods is used to achieve the best results in managing digital electronic documents. This paper performs a comparative analysis of natural language processing techniques to improve the quality of digital document classification, using technical educational documents as an example. The paper discusses the methods used in document preprocessing and the use of NLP, ways to improve preprocessing, and a computational experiment is conducted to determine the improvement in the completeness and accuracy of data classification.

Full Text:

PDF (Russian)

References


URL: https://aws.amazon.com/ru/what-is/nlp/ (accessed

10.2023).What is Natural Language Processing (NLP) // Amazon.

Hobson Lane, Hannes Hapke, Cole Howard Natural Language Processing in Action. - SPb.: Peter, 2020. - C. 68-140.

Ganegedara T. Natural Language Processing with TensorFlow. V. S. Yatsenkov. - Moscow: DMK Press, 2020. - С. 74-102.

Hickman L. et al. Text preprocessing for text mining in organizational research: Review and recommendations //Organizational Research Methods. – 2022. – Т. 25. – №. 1. – С. 114-146.

Kadhim A. I. An evaluation of preprocessing techniques for text classification //International Journal of Computer Science and Information Security (IJCSIS). – 2018. – Т. 16. – №. 6. – С. 22-32.

Denny M. J., Spirling A. Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it //Political Analysis. – 2018. – Т. 26. – №. 2. – С. 168-189.

Tabassum A., Patil R. R. A survey on text pre-processing & feature extraction techniques in natural language processing //International Research Journal of Engineering and Technology (IRJET). – 2020. – Т. 7. – №. 06. – С. 4864-4867.

Etaiwi W., Naymat G. The impact of applying different preprocessing steps on review spam detection //Procedia computer science. – 2017. – Т. 113. – С. 273-279.

Kashina M., Lenivtceva I. D., Kopanitsa G. D. Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification //Procedia Computer Science. – 2020. – Т. 178. – С. 284-290.

Pak M. Y., GUNAL S. The impact of text representation and preprocessing on author identification //Anadolu University Journal of Science and Technology A-Applied Sciences and Engineering. – 2017. – Т. 18. – №. 1. – С. 218-224.

Ideal preprocessing pipelines for NLP models // Temofeev.ru URL: https://temofeev.ru/info/articles/idealnyy-preprotsessingovyy-payplayn-dlya-nlp-modeley/ (date of access: 23.10.2023).

A Gentle Introduction to the Bag-of-Words Model // Machine Learning Mastery URL: https://machinelearningmastery.com/gentle-introduction-bag-words-model/ (date of access: 28.10.2023).

Gensim Word2Vec Tutorial // Kaggle URL: https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial (date of access: 28.10.2023).

Jeffrey Pennington, Richard Socher, Christopher D. Manning // GloVe: Global Vectors for Word Representation URL: https://www-nlp.stanford.edu/projects/glove/ (date of access: 02.11.2023).

Grapheme // Wikipedia URL: https://ru.wikipedia.org/wiki/Графема (date of access 03.11.2023).

Satish Gunjal. Tokenization in NLP [Electronic resource]. – Access mode: https://www.kaggle.com/code/satishgunjal/tokenization-in-nlp (date of access: 04.11.2013).

Machine Learning Mastery. How to Prepare Text Data for Deep Learning with Keras [Electronic resource]. – Access mode: https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/ (date of access: 04.11.2023).

McMahan Brian, Rao Delip. Getting to know PyTorch. - SPb.: Peter, 2020. - С. 88-101

Stemmer Porter. In: Wikipedia: the free encyclopedia [Electronic resource]. - Available at: https://ru.wikipedia.org/wiki/Стеммер_Портера (date of access: 04.11.2023).

Porter Stemmer. In: Snowball: A language for stemming algorithms [Electronic resource]. - Access mode: https://snowballstem.org/algorithms/porter/stemmer.html (date of access: 04.11.2023).

Baeldung. Stemming vs Lemmatization [Electronic resource] // Baeldung.com. - Access mode: https://www.baeldung.com/cs/stemming-vs-lemmatization (date of access: 08.11.2023).

Stopwords-iso repository on GitHub [Electronic resource] // GitHub.com. - Access mode: https://github.com/stopwords-iso (date of access: 08.11.2023).

GitHub.com. Stopwords-iso repository on GitHub [Electronic resource]. Access mode: https://github.com/stopwords-iso (date of access: 08.11.2023).

Stopwords-iso. List of stopwords for Russian language [Electronic resource]. Access mode: https://github.com/stopwords-iso/stopwords-ru/blob/master/stopwords-ru.txt (date of access: 08.11.2023).

Kaggle. NLP Preprocessing [Electronic resource]. Access mode: https://www.kaggle.com/code/abdallahwagih/nlp-preprocessing (date of access: 08.11.2023).

McMahan Brian, Rao Delip. Deep learning in natural language processing. - SPb.: Peter, 2020. - С. 46-92

NCW. Open access to scientific publications [Electronic resource]. Access mode: https://www.nkj.ru/open/36052/ (date of access: 08.11.2023).

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. arXiv preprint arXiv:1603.01360.

NLP Эмбеддинги [Electronic resource]. Access mode: https://blog.bayrell.org/ru/iskusstvennyj-intellekt/495-nlp-embeddingi.html (date of access: 08.11.2023).

Soyalp G. et al. Improving Text Classification with Transformer //2021 6th International Conference on Computer Science and Engineering (UBMK). – IEEE, 2021. – С. 707-712.

Wang C., Banko M. Practical transformer-based multilingual text classification //Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers. – 2021. – С. 121-129.

Shaheen Z., Wohlgenannt G., Filtz E. Large scale legal text classification using transformer models //arXiv preprint arXiv:2010.12871. – 2020.

Tezgider M., Yildiz B., Aydin G. Text classification using improved bidirectional transformer //Concurrency and Computation: Practice and Experience. – 2022. – Т. 34. – №. 9. – С. e6486.

Vaswani A. et al. Attention is all you need //Advances in neural information processing systems. – 2017. – Т. 30.

Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding //arXiv preprint arXiv:1810.04805. – 2018.

Sun C. et al. How to fine-tune bert for text classification? //Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019, Proceedings 18. – Springer International Publishing, 2019. – С. 194-206.

Beltagy I., Peters M. E., Cohan A. Longformer: The long-document transformer //arXiv preprint arXiv:2004.05150. – 2020.

Longformer model designed for Russian language [Electronic resource]. Access mode: https://huggingface.co/kazzand/ru-longformer-base-4096 (date of access: 08.11.2023).

Hossin M., Sulaiman M. N. A review on evaluation metrics for data classification evaluations //International journal of data mining & knowledge management process. – 2015. – Т. 5. – №. 2. – С. 1.

Li Y. et al. A comparative study of pretrained language models for long clinical text // J. Am. Med. Inform. Assoc. 2023. Vol. 30, № 2.

Wei F. et al. An Empirical Comparison of DistilBERT, Longformer and Logistic Regression for Predictive Coding // Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022. 2022.

Mamakas D. et al. Processing Long Legal Documents with Pre-trained Transformers: Modding LegalBERT and Longformer // NLLP 2022 - Natural Legal Language Processing Workshop 2022, Proceedings of the Workshop. 2022.

Khandelwal A. Fine-Tune Longformer for Jointly Predicting Rumor Stance and Veracity // ACM International Conference Proceeding Series. 2020.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность IT Congress 2024

ISSN: 2307-8162