Identifying data labeling errors using classification models for small datasets

Fedor Krasnov


 Labeling up data for classification tasks is a complex process, accompanied by unavoidable errors. Manual or automatic labeling of texts for classification includes systematical errors that can be identified using system approaches based on statistics and machine learning models. It is small data sets that are considered, since the consequences of labeling errors are most noticeable in them. However, in case of small datasets, due to the lack of samples, the problem of sparse distributions arises, which prevents the training of models with high complexity. The author uses the effect of overfitting the model to minimize the limitations imposed by insufficient data. Several experiments were conducted as part of the study. An experiment on a large public dataset showed that when classifying short texts, the overfitted model is able to detect data labeling errors. In an experiment with the formation of facets based on user description of goods, the interdependence of the presence of class definition errors and the work of assessors on the labeling of text data based on different rules was determined. Due to its overfitting, the classification model is capable of identifying significant errors that dramatically affect the engineering application of machine learning in highly loaded Internet systems. As a result of the research, the author provides methods and criteria for achieving the state of "productive overfitting" by the model. The best result on the f1-score weighted metric (98%) was shown by the EmbeddingBag-based classification model.

Full Text:

PDF (Russian)


Hu, S., Ding, N., Wang, H., Liu, Z., Li, J., & Sun, M. (2021). Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. arXiv preprint arXiv:2108.02035.

Meng, Y., Zhang, Y., Huang, J., Xiong, C., Ji, H., Zhang, C., & Han, J. (2020). Text classification using label names only: A language model self-training approach. arXiv preprint arXiv:2010.07245.

Yin, W., Hay, J., & Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. arXiv preprint arXiv:1909.00161.

Yin, W., Rajani, N. F., Radev, D., Socher, R., & Xiong, C. (2020). Universal natural language processing with limited annotations: Try few-shot textual entailment as a start. arXiv preprint arXiv:2010.02584.

Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1715–1725.

Luo, Qun, Weiran Xu, and Jun Guo. "A Study on the CBOW Model's Overfitting and Stability." Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning. 2014.

Debole, Franca and Fabrizio Sebastiani. “Supervised term weighting for automated text categorization.” ACM Symposium on Applied Computing (2003).

Uysal, Alper Kursat. “An improved global feature selection scheme for text classification.” Expert Syst. Appl. 43 (2016): 82-92.

Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135-146.

Feng, X., Liang, Y., Shi, X., Xu, D., Wang, X., & Guan, R. (2017). Overfitting reduction of text classification based on AdaBELM. Entropy, 19(7), 330.

Mazurov M. Russian Social Media Text Classification. [Jelektronnyj resurs] // : Nabory dannyh dlja konkursov. M., 2022. URL: (data obrashhenija: 24.12.2023).

Kuratov, Y., Arkhipov, M. (2019). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. arXiv preprint arXiv:1905.07213.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N. & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.


  • There are currently no refbacks.

Abava  Кибербезопасность MoNeTec 2024

ISSN: 2307-8162