Apache Spark and MapReduce

Hadoop and Spark are both popular Apache projects. Apache Spark is an improvement on the original Hadoop MapReduce component of the Hadoop ecosystem. There is great excitement around Apache Spark for developers as it provides real advantage in interactive data interrogation on in-memory data sets and also in multi-pass iterative machine learning algorithms. However, there is a hot debate on whether spark can mount challenge to Apache Hadoop by replacing it and becoming the top big data analytics tool. What elaborates is a detailed discussion on Spark Hadoop comparison and helps users understand why spark is faster than Hadoop.

Hadoop MapReduce

MapReduce is a programming framework of Hadoop that allows us to perform distributed and parallel processing on large data sets in a distributed environment. As the name suggests, the MapReduce algorithm contains two important tasks: Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down and separated into required tuples output key and output value. On the other hand, Reduce takes the output from a map as an input and combines the data tuples into smaller set of tuples. The reducer output will be sent to HDFS (Hadoop Distributed File System).

Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers. In MapReduce, the data is distributed over the cluster and processed, which doesn’t require purchasing and maintaining expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously.

However, a drawback is MapReduce’s speeds. The Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum.

Apache Spark

As an emerging platform and a designated Top Level Project, Apache Spark is an improvement on the original Hadoop MapReduce component of the Hadoop ecosystem. Basically, Spark is developed to overcome MapReduce’s shortcomings that it is not optimized for of iterative algorithms and interactive data analysis which performs cyclic operations on same set of data. For this Spark depends on Resilient Distributed Datasets (RDDs) as a base unit.

The difference is clearly noted by its extremely fast speeds. Spark execute batch processing jobs , about 10 to 100 times faster than the Hadoop MapReduce framework by making the use of in-memory processing of data compared to persistence storage used by Hadoop. This in-memory processing is a faster process as there is no time spent in moving the data/processes in and out of the disk, whereas MapReduce requires a lot of time to perform these input/output operations thereby increasing latency.

With Spark, we can utilize the inbuild libraries to perform Batch Processing, Streaming, Machine Learning and Interactive SQL quesries in a single cluster unlike Hadoop which only provides Batch Processing at the core.

There is great excitement around Apache Spark as it provides real advantage in interactive data interrogation on in-memory data sets and also in multi-pass iterative computations in use cases like machine learning when computation need to be performed multiple times on same set of data. Thus, Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverages iterations and yields better results than one pass approximations sometimes used on MapReduce.

Sources