Overview of data cleaning methods for machine learning

Artem Makarov, Dmitry Namiot

Abstract


In the last few years, machine learning models and neural networks have been actively introduced into everyday life. The main parameters in their training are accuracy and efficiency. One of the main steps that allows you to improve these indicators is to prepare a data set. Before applying any method, it is necessary to perform a preliminary cleaning of the data, since otherwise the results obtained may be inaccurate or incorrect. Even though novice researchers prepare data sets, cleaning is often performed incorrectly or inefficiently with lots of errors. This article provides an overview of the main methods, considers their advantages and disadvantages, and gives general recommendations to improve the data cleaning process. In addition, special attention is paid to the importance of the ability to use various tools for data cleaning. The main libraries such as Pandas, scikit-learn, and NumPy, specialized programs such as OpenRefine, various features of the R language, as well as methods of normalization, standardization, and processing of text data are considered. The correct use of data cleaning tools significantly affects the quality of analysis and modeling, contributing to more accurate and reliable results.


Full Text:

PDF (Russian)

References


Ni, Du, Zhi Xiao, and Ming K. Lim. "Machine learning in recycling business: an investigation of its practicality, benefits and future trends." Soft Computing 25 (2021): 7907-7927.

Kumar, Yogesh, Komalpreet Kaur, and Gurpreet Singh. "Machine learning aspects and its applications towards different research areas." In 2020 International conference on computation, automation and knowledge management (ICCAKM), pp. 150-156. IEEE, 2020.

Baldominos, Alejandro, Yago Saez, and Pedro Isasi. "A survey of handwritten character recognition with mnist and emnist." Applied Sciences 9, no. 15 (2019): 3169.

Obaid, Kavi B., Subhi Zeebaree, and Omar M. Ahmed. "Deep learning models based on image classification: a review." International Journal of Science and Business 4, no. 11 (2020): 75-81.

Yim, Aldrin, Claire Chung, and Allen Yu. Matplotlib for Python Developers: Effective techniques for data visualization with Python. Packt Publishing Ltd, 2018.

Dabbas, Elias. Interactive Dashboards and Data Apps with Plotly and Dash: Harness the power of a fully fledged frontend web framework in Python–no JavaScript required. Packt Publishing Ltd, 2021.

Villanueva, Randle Aaron M., and Zhuo Job Chen. "ggplot2: elegant graphics for data analysis." (2019): 160-167.

Elliott, Alan C., Linda S. Hynan, Joan S. Reisch, and Janet P. Smith. "Preparing data for analysis using Microsoft Excel." Journal of investigative medicine 54, no. 6 (2006): 334-341.

Aini, Qurotul, Untung Rahardja, Indri Handayani, Marviola Hardini, and Ahad Ali. "Utilization of google spreadsheets as activity information media at the official site alphabet incubator." In Proc. Int. Conf. Ind. Eng. Oper. Manag, no. 7, pp. 1330-1341. 2019.

Chicco, Davide, Luca Oneto, and Erica Tavazzi. "Eleven quick tips for data cleaning and feature engineering." PLOS Computational Biology 18, no. 12 (2022): e1010718.

Peng, Chao-Ying Joanne, Michael Harwell, Show-Mann Liou, and Lee H. Ehman. "Advances in missing data methods and implications for educational research." Real data analysis 3178 (2006): 102.

Donders, A. Rogier T., Geert JMG Van Der Heijden, Theo Stijnen, and Karel GM Moons. "A gentle introduction to imputation of missing values." Journal of clinical epidemiology 59, no. 10 (2006): 1087-1091.

Yoon, Jinsung, William R. Zame, and Mihaela van der Schaar. "Estimating missing data in temporal data streams using multi-directional recurrent neural networks." IEEE Transactions on Biomedical Engineering 66, no. 5 (2018): 1477-1490.

Kim, Joo-Chang, and Kyungyong Chung. "Multi-modal stacked denoising autoencoder for handling missing data in healthcare big data." IEEE Access 8 (2020): 104933-104943.

Beretta, Lorenzo, and Alessandro Santaniello. "Nearest neighbor imputation algorithms: a critical evaluation." BMC medical informatics and decision making 16, no. 3 (2016): 197-208.

Tavazzi, Erica, Sebastian Daberdaku, Rosario Vasta, Andrea Calvo, Adriano Chiò, and Barbara Di Camillo. "Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach." BMC Medical Informatics and Decision Making 20, no. 5 (2020): 1-23.

Patel, Jagdish K., and Campbell B. Read. Handbook of the normal distribution. Vol. 150. CRC Press, 1996.

Pukelsheim, Friedrich. "The three sigma rule." The American Statistician 48, no. 2 (1994): 88-91.

Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation forest." In 2008 eighth ieee international conference on data mining, pp. 413-422. IEEE, 2008.

Rubin, Donald B. Multiple imputation for nonresponse in surveys. Vol. 81. John Wiley & Sons, 2004.

Maetouq, Ali, Salwani Mohd Daud, Noor Azurati Ahmad, Nurazean Maarop, Nilam Nur Amir Sjarif, and Hafiza Abas. "Comparison of hash function algorithms against attacks: A review." International Journal of Advanced Computer Science and Applications 9, no. 8 (2018).

Miller, Meg, and Natalie Vielfaure. "OpenRefine: An Approachable Open Tool to Clean Research Data." Bulletin-Association of Canadian Map Libraries and Archives (ACMLA) 170 (2022).

Ma, Hong. "Google Refine–http://code. google. com/p/google-refine." Technical Services Quarterly 29, no. 3 (2012): 242-243.

Juneau, Josh, Jim Baker, Frank Wierzbicki, Leo Soto Muoz, Victor Ng, Alex Ng, and Donna L. Baker. The definitive guide to Jython: Python for the Java platform. Apress, 2010.

Hickey, Rich. "The Clojure programming language." In Proceedings of the 2008 symposium on Dynamic languages, pp. 1-1. 2008.

R Core Team, R. "R: A language and environment for statistical computing." (2013): 275-286.

Hallam, Antony, Debajoy Mukherjee, and Romain Chassagne. "Multivariate imputation via chained equations for elastic well log imputation and prediction." Applied Computing and Geosciences 14 (2022): 100083.

Boehmke, Bradley C. Data wrangling with R. New York: Springer, 2016.

Bernard, J. and Bernard, J., 2016. Python data analysis with pandas. Python Recipes Handbook: A Problem-Solution Approach, pp.37-48.

Nelli, Fabio. "Python data analytics with Pandas, NumPy, and Matplotlib." (2018).

Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al. "Scikit-learn: Machine learning in Python." the Journal of machine Learning research 12 (2011): 2825-2830.

Namiot, Dmitry, Eugene Ilyushin, and Oleg Pilipenko. "On Trusted AI Platforms." International Journal of Open Information Technologies 10.7 (2022): 119-127.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность IT Congress 2024

ISSN: 2307-8162