Experimental evaluation of the temporal efficiency of big data processing for specified storage formats

V.A. Belov, E.V. Nikulchev


One of the most important tasks of a modern big data processing platform is the task of choosing data storage formats. The choice of formats is based on various performance criteria, which depend on the class of objects and the requirements. One of the most important criteria is the time spent in various big data processing operations. The paper studies the five most popular formats for storing big data (avro, CSV, JSON, ORC, parquet), proposes an experimental bench for assessing time efficiency, and conducts a comparative analysis of experimental estimates of the characteristics of the formats under consideration. For the experiment, the basic data processing operations were considered using the Apache Spark framework. The format selection algorithm is developed based on the hierarchy analysis method. As a result, a methodology was formed for choosing a format from alternatives based on experimental estimates of parameters and a methodology for analyzing hierarchies for the task of choosing time-efficient basic operations of storage formats for big data in the Apache Hadoop system using Apache Spark.

Full Text:

PDF (Russian)


D. Chong, H. Shi, “Big data analytics: A literature review,” J. Manag. Anal., vol. 2, p. 175–201, 2015.

R. Moro Visconti, D. Morea, “Big Data for the Sustainability of Healthcare Project Financing,” Sustainability, vol. 11, p. 3748, 2019. doi:10.3390/su11133748.

L. Ardito, V. Scuotto, M. Del Giudice, A. Messeni, “A bibliometric analysis of research on Big Data analytics for business and management,” Manag. Decis., vol. 57, p. 1993–2009, 2018. doi:10.1108/MD-07-2018-0754.

F. Cappa, R. Oriani, E. Peruffo, I.P. McCarthy, “Big Data for Creating and Capturing Value in the Digitalized Environment: Unpacking the Effects of Volume, Variety and Veracity on Firm Performance,” Journal of Product Innovation Management, vol. 38, no. 1, p.. 49-67, 2021. https://doi.org/10.1111/jpim.12545.

E. Nikulchev, D. Ilin, A. Silaeva, et al., “Digital Psychological Platform for Mass Web-Surveys,” Data, vol. 5, no. 4, p. 95. doi: 10.3390/data5040095

I. Mavridis, H. Karatza, “Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark,” J. Syst. Softw., vol. 125, p. 133–151, 2017.

S. Lee, J.Y. Jo, Y. Kim, “Survey of Data Locality in Apache Hadoop,” In 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), Honolulu, USA, 29-31 May 2019; pp. 46–53.

K. Garg, D. Kaur, “Sentiment Analysis on Twitter Data using Apache Hadoop and Performance Evaluation on Hadoop MapReduce and Apache Spark,” In Proceedings on the International Conference on Artificial Intelligence (ICAI), Las Vegas, Nevada, USA, 29 July - 01 August 2019; pp. 233–238.

Hive. 2020 Apache Hive Specification. Available online: https://cwiki.apache.org/confluence/display/HIVE.

Impala. 2020 Apache Impala Specification. Available online: https://impala.apache.org/impala-docs.html.

E. Nazari, M.H. Shahriari, H. Tabesh, “BigData Analysis in Healthcare: Apache Hadoop, Apache spark and Apache Flink,” Frontiers in Health Informatics, vol. 8, no. 1, p. 14, 2019.

S. Salloum, R. Dautov, X. Chen, P.X. Peng, J.Z. Huang, ‘Big data analytics on Apache Spark,’ International Journal of Data Science and Analytics, vol. 1, no. 3, pp. 145-164, 2016.

A. Gusev, D. Ilin, E. Nikulchev, “The Dataset of the Experimental Evaluation of Software Components for Application Design Selection Directed by the Artificial Bee Colony Algorithm,” Data, vol. 5, p. 59, 2020.

A. Gusev, D. Ilin, P. Kolyasnikov, E. Nikulchev, “Effective Selection of Software Components Based on Experimental Evaluations of Quality of Operation,” Engineering Letters, vol. 28, no. 2, p. 420–427, 2020.

A. Ramírez, J.A. Parejo, J.R. Romero, S. Segura, A. Ruiz-Cortés, “Evolutionary composition of QoS-aware web services: A many-objective perspective,” Expert Syst. Appl., vol. 72, p. 357–370, 2017,

S. Gholamshahi, S.M.H. Hasheminejad, “Software component identification and selection: A research review,” Softw. Pract. Exp., vol. 49, p. 40–69, 2019.

R.F. Munir, A. Abelló, O. Romero, M. Thiele, W. Lehner, “A cost-based storage format selector for materialized results in big data frameworks,” Distrib Parallel Databases, vol. 38, p. 335–364, 2020. doi:10.1007/s10619-019-07271-0.

X. Wang, Z. Xie, “The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle,”. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 177-186, 2020.

D. He, D. Wu, R. Huang, G. Marchionini, P. Hansen, S.J. Cunningham, ”ACM/IEEE Joint Conference on Digital Libraries 2020 in Wuhan virtually,” ACM Sigweb Newsl, vol. 1, p. 1–7, 2020.

S. Ahmed, M.U. Ali, J. Ferzund, M.A. Sarwar, A. Rehman,; A. Mehmood, “Modern Data Formats for Big Bioinformatics Data Analytics,” Int. J. Adv. Comput. Sci. Appl., vol. 8, no. 4, p. 366-377, 2017, doi:10.14569/IJACSA.2017.080450.

D. Plase, L. Niedrite, R. Taranovs, “A Comparison of HDFS Compact Data Formats: Avro Versus Parquet,” Moksl. Liet. Ateitis, vol. 9, p. 267–276, 2017.

D. Ilin, E. Nikulchev, “Performance Analysis of Software with a Variant NoSQL Data Schemes,” In 2020 13th International Conference "Management of large-scale system development" (MLSD), p. 1-5, 2020. 10.1109/MLSD49919.2020.9247656

T. L. Saaty, “Ob izmerenii neosyazaemogo. Podhod k otnositel'nym izmereniyam na osnove glavnogo sobstvennogo vektora matricy parnyh sravnenij,” Cloud of science, vol. 2, no. 1, p. 5-39, 2015.

T. L. Saaty, “Otnositel'noe izmerenie i ego obobshchenie v prinyatii reshenij. Pochemu parnye sravneniya yavlyayutsya klyuchevymi v matematike dlya izmereniya neosyazaemyh faktorov,”Cloud of science, vol. 3, no. 2, p. 171-262, 2016.

S. Sakr, A. Liu, A.G. Fayoumi, “The family of mapreduce and large-scale data processing systems,” ACM Comput. Surv. (CSUR), vol. 46, p. 1–44, 2013.

S. Chellappan, D. Ganesan, “Introduction to Apache Spark and Spark Core,” In Practical Apache Spark; Apress: Berkeley, CA, USA; pp. 79–113, 2018.

V. Belov, A. Tatarintsev, E. Nikulchev, “Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark,” Symmetry, vol. 13, no. 2, p. 195, 2021. doi: 10.3390/sym13020195


  • There are currently no refbacks.

Abava  Absolutech Convergent 2020

ISSN: 2307-8162