Big Data. Roughly speaking, this term represents a group of technologies that are used for collection, storage, management and processing of data when traditional methods fail. Rapid technological development led to an explosion in the amount of data we are generating in daily activities using various smart devices, social networks, etc. The current estimate is that there are around 8 zettabytes of data in digital format. It is also estimated that within two days we globally generate as much data as was generated by the year 2000. That is why it is no surprise that the BigData concept is still unclear and that it demands continuous evolution and redefining. The main task of Big Data is to respond to the question of what should be done with all that data. How should it be stored and processed in the most efficient and fastest manner with a view to obtaining the most significant conclusions? In order to keep this post concise, we will deal with data processing, focusing on the most important experiences we had in this area. Together with the terms Machine Learning and AI, Big Data is one of the most popular terms in the IT world. Applications of these areas can be found in almost all types of industries, and the challenges are numerous. There is great competition on the market in the form of various solutions responding to the Big Data challenges and it is often the case that that what was until yesterday considered a standard suddenly becomes outdated. Lately, MapReduce is often mentioned in this context in IT circles. MapReduce is found at the very heart of Hadoop and it is responsible for processing data stored within HDFS. With the development of Big Data technologies, new solutions arose which are used for processing, and Apache Spark emerged as one of the main competitors in this group. This post was written as an attempt to answer the title question relying on BigData environments we use in our use cases. Before comparing the two, let’s first get to know the fundamental concepts regarding both technologies, which are crucial for any further considerations.
MapReduce is a programming model which enables parallel and distributed processing of data stored in HDFS. As suggested by its name, this programming model consists of two logical units, Map and Reduce, whereby the Map phase is executed first, and its output is the input of the Reduce phase.
The Map phase is the first processing phase and it is advised to carry out the most complex part of processing precisely in this phase. The Map phase transforms the input data set into (key, value) pairs on the basis of the written code. This is followed by an intermediate step which includes shuffle and sort processes where grouping of all values is performed according to keys and this output is passed on to the Reduce phase. Since the Reduce phase contains all keys and their associated values as input, it is recommended to carry out substantially simpler processing in this phase, such as summing and aggregations.
The main advantage of MapReduce is reflected in parallel data processing. In fact, the entire MapReduce processing activity is divided in datanodes of the BigData environment, which simultaneously perform processing of the part of data assigned to them. This method significantly reduces the time spent on processing, prevents overload of certain machines in the BigData environment and reduces costs. On the other hand, since MapReduce relies on data from HDFS, it must read data and write them on a disk, which significantly increases the time intended for processing and effectively disables realtime data processing. Also, taking into account the small number of supported libs and functionalities, MapReduce is practically predetermined for not-so-complex operations, e.g. summing and aggregations.
Apache Spark is an open-source framework for processing large amounts of data. in accordance with its name and functionalities, in Big Data circles it is also known as lightning-fast cluster computing. It is the most popular and most active Apache data processing project. It is written in the Skala programming language, while enabling API for programming languages Python, Scala, Java, R and SQL. Spark performs its execution in a distributed manner owing to its core process which divides the application into several tasks and distributes them in executor processes, the resources of which can very easily be scaled depending on the application’s needs. The term RDD (Resilient Distributed Datasets) lies at the core of data processing using Spark and it represents an immutable collection of objects on which various operations can be applied in parallel. These operations can also be distributed at the cluster level and executed within parallel batch process, which leads to fast and efficient parallel processing. Also, from the very start, Spark was optimized to perform in-memory processing, which justifies its nickname lightning-fast cluster computing. It is an extremely flexible and simple tool which enables stream processing, machine learning tools, SQL queries, graph algorithm development, as well as the MapReduce programming model.
MapReduce vs Spark
In order to find the answer to the title question, I decided to carry out simple but efficient testing. A Big Data environment of 13 machines was used for the purposes of this testing, of which 11 machines have the role of datanodes, each with 64GB of RAM and 16-core processors with the total memory of 30TB. MapReduce jobs had already been implemented in this environment and they mostly performed basic statistical operations such as avg, min, max, etc. and data aggregation. Therefore, the first step of testing was the implementation of all existing MapReduce jobs through Spark applications. Python was used for the purposes of MapReduce jobs which stood out as a logical solution at the time of their implementation. Therefore, for the purposes of quality testing, it was decided to use pyspark for writing new applications, i.e. Python API within Apache Spark.
The first thing one could notice during testing was greater flexibility of Spark compared to MapReduce when it comes to the choice of programming language. Although theoretically there are some other options, in practice, the choice of programming language most often comes down to JAVA or Python within Hadoop MapReduce processing. Spark is much more flexible here since it offers the possibility to choose between Python, JAVA, Scala, R and SQL depending on the developers’ needs and skills. Also, during the implementation of codes within the Spark technology, it was noticed that the codes are much more concise and efficient compared to Hadoop MapReduce. The reason for this is the fact that Spark uses RDD which enable high-level operators, while each demanding operation needs to be coded during MapReduce processing, further complicating the situation. The importance of this Spark feature is also testified by the fact that within testing there was an instance where 50 lines of code within one MapReduce were reduced to 8 lines of code within Spark processing. Also, for scheduling all jobs, Hadoop MapReduce uses the additional component Oozie, while Spark acts as its own scheduler due to the possibility of processing within random-access memory.
After these jobs were implemented so that they exist within both technologies, we moved on to the following testing stage, i.e. comparison of the data processing performances. The first comparisons were made on small files up to 5GB. In these instances, Spark was up to 15 times faster compared to MapReduce. Also, Spark required more machine resources of the BigData cluster compared to MapReduce; this difference did not have a decisive effect on the operation of the entire environment. The testing was then continued by using larger files of 10GB, 15GB, etc. The results of these tests were actually similar to the previous ones, whereby, as excepted, Spark demanded more and more resources from the Big Data environment. However, the environment still operated as usual with somewhat larger load. The next step was to use substantially larger files, starting from a 100GB file. A shift occurred in this part of testing. As expected, Spark demanded more and more resources of the Big Data environment to a point where these demands started to affect the operation of the environment, even causing failures of certain machines due to overload. Also, due to the environment overload, Spark could not even complete some jobs, while MapReduce successfully completed all test iterations. Of course, at the cost of time spent on processing itself. During MapReduce processing, the cluster also did not operate in the most optimal manner during final testing iterations with large files, but these problems were significantly smaller and did not lead to failures of certain datanodes and all jobs were completed.
The general recommendation is to give advantage to Hadoop MapReduce for linear processing of large amounts of data, primarily due to its energy efficiency. Apache Spark is a better choice when it is required to carry out near real-time processing, for the purposes of machine learning, advanced analytics, etc. If the costs are not a limiting factor for the development of the Big Data environment, Apache Spark can fully replace Hadoop MapReduce processing method and bring many more functionalities.
There is no unique answer to the question of what processing framework should be chosen. Each use case has unique needs and limitations and, in accordance with this, it is necessary to choose the most convenient solution for that particular use case. Also, there is no unique answer to the title question. In some environments, Spark would be an ideal solution, while our test demonstrated that although Spark is a substantially more powerful tool, in linear processing of large amounts of data, MapReduce is definitely a better choice.