Apache Spark vs Hadoop
Big data processing can be done by scaling up computing resources (adding more resources to a single system) or scaling out (adding more computer nodes). Traditionally, increased demand for computing resources in data processing has led to scaled-up computing, but it couldn’t keep up with data growth, so scaling out or a distributed processing approach has taken over in the last decade. It has evolved from processing data at rest (batch processing) to stream processing and complex event processing (CEP).
Hadoop was one of the pioneers in leveraging the distributed system for data processing. Later, Apache Spark overcame the shortcomings of Hadoop, providing features such as real-time processing, reduced latency, and better abstraction of distributed systems. Both are widely used in the industry.
In this article, we’ll compare Hadoop and Apache Spark across different vital metrics.
Table 1. Comparison of Apache Hadoop and Apache Spark
Parallel vs. distributed processing
Comparing Apache Spark and Hadoop requires understanding the difference between parallel and distributed computing. The data processing community adapted different architectures as the trend changed from vertical scaling to horizontal scaling. There are three different taxonomies for parallel architectures.
The shared-nothing approach doesn’t share any resources—each node is independent. Shared disk taxonomy has a shared disk, while shared memory has both shared disk and memory.
Distributed computing is the simultaneous use of more than one computer connected over the network to solve a problem, similar to a shared-nothing architecture. Parallel computing, in comparison, is a shared memory architecture involving the simultaneous use of more than one process to solve a problem.
Hadoop uses distributed computing concepts while Apache Spark leverages both distributed and parallel computing. Parallelization and distributed come with their own challenges, so it’s vital to choose the right technology for your use case. Some of the challenges include the following:
- How to ensure fault tolerance in a distributed cluster
- How to make it easy for developers to write distributed programs
- How to distribute computation tasks to different nodes
Hadoop is an ecosystem that provides different tools for parallel data processing. It introduced the concept of leveraging scaled-out commodity hardware compared to traditional scaled-up data processing systems. Furthermore, it overcame the challenge of handling unstructured data such as text, audio, videos, and logs.
Hadoop is comprised of the following modules:
- Hadoop Distributed File System (HDFS): A highly fault-tolerant file system designed to run on commodity hardware.
- Yet Another Resource Negotiator (YARN): A resource manager and job scheduling platform that sits between HDFS and MapReduce. It has two main components, a scheduler, and an application manager.
- MapReduce: A parallel data processing framework for large clusters.
MapReduce is a shared-nothing architecture for big data processing using distributed algorithms while leveraging a commodity hardware cluster. It takes care of data distribution, parallelization, and fault tolerance transparently while providing abstractions for the developer. MapReduce supports batch processing on big datasets and executes algorithms in parallel. Since storage and computing are on the same node, it also provides good performance.
MapReduce processes the data in three main stages—Map, Shuffle, and Reduce. First, the input file is distributed, and inputs are sent to each parallel instance. Second, the Map task operates on a single HDFS block and generates an intermediate key/value pair after the process. Third, the Shuffle stage consolidates all the map task outputs and sorts them based on a key. Finally, the Reduce task combines and reduces all the intermediate values associated with the same key.
Apache Spark is an open source data processing framework that was developed at UC Berkeley and later adapted by Apache. It was designed for faster computation and overcomes the high-latency challenges of Hadoop. However, Spark can be costly because it stores all the intermediate calculations in memory.
Apache Spark supports batch, stream, interactive, and iterative processing. Spark streaming also lets you stream and process the data in real-time. It also has features that Hadoop doesn’t offer, such as running SQL queries and performing complex analytics using machine learning and graph algorithms. Spark is backward compatible with the existing Hadoop ecosystem components, such as HDFS and HBase, which makes it relatively straightforward to integrate older data systems.
An Apache Spark application consists of a driver process and a set of executor processes. The driver process is the heart of the application and orchestrates code execution on the executors. The driver process also coordinates user input and maintains information about the application. Apache Spark jobs are directed acyclic graphs (DAGs) composed of input data sources, operators, and output data sinks.
Apache Spark uses the resilient distributed dataset (RDD)—a distributed memory abstraction. RDDs are fault-tolerant, immutable data structures that support the parallel processing of events across the cluster. An RDD is divided into several partitions stored on different nodes of a cluster.
Comparison of Apache Hadoop and Apache Spark
Now we will discuss the different vital metrics in detail and compare them for Hadoop and Apache Spark.
One of the key differentiators between Apache Spark and Hadoop is performance. It’s unfair to compare their performance directly because they use different types of storage due to their architecture. Hadoop stores the results of each intermediate step on disk, but Apache Spark can process in-memory. Although Hadoop reads data from local HDFS, it cannot match Spark’s memory-based performance. Apache Spark’s iterations are 100 times faster than Hadoop for in-memory operations and ten times quicker for on-disk operations.
Hadoop offers batch processing, while Apache Spark offers much more. In addition, both frameworks handle data in different ways: Hadoop uses MapReduce to split large datasets across a cluster for parallel data processing, while Apache Spark provides real-time streaming processing as well as graph processing. However, Hadoop needs to be combined with other tools to achieve such functionality.
Hadoop doesn’t offer real-time processing—it uses MapReduce to execute the operations designed for batch processing. Apache Spark provides low-latency processing and delivers near-real-time results via Spark Streaming. It can read real-time event streams with millions of events per second, such as the stock market, Twitter, or Facebook streams.
Spark and Apache Hadoop are both open source, so there is no licensing cost involved in their use. However, development and infrastructure costs need to be taken into consideration. Hadoop relies on disk storage and can be used with commodity hardware, making it a low-cost option. In contrast, Apache Spark operations are memory-intensive and require a lot of RAM, which increases the cost of the infrastructure.
Scheduling and resource management
Apache Spark has Spark Scheduler, which is responsible for distributing the DAG into stages. Each stage has multiple tasks scheduled to computing units and executed by the Spark execution engine. Spark Scheduler with Block Manager (a key-value store for blocks of data in Spark) takes care of job scheduling, monitoring, and distribution in a cluster.
Hadoop doesn’t have a native scheduler and needs to use an external one. It uses YARN for resource management and Oozie for workflow scheduling. However, YARN only distributes the processing power, so the application state needs to be managed by the developers.
Both frameworks provide a fault-tolerant mechanism and take care of failures transparently, so users don’t have to restart applications in the case of failure. However, they have different techniques to handle failures.
Hadoop has fault tolerance as the basis of its operation. The master node keeps track of all the nodes in the system; if a node dies, its tasks are delegated to the other nodes. Also, each data block is replicated on multiple nodes. When failure happens, it resumes the operation by replicating the missing or faulty block from other locations.
Apache Spark manages fault-tolerance on the basis of each RDD block. It has the lineage of the RDD in the DAG, including its source and the different operations performed. In case of a failure, Spark can regenerate an RDD all the way up to its inception. Additionally, it supports checkpointing as well, similar to Hadoop, to reduce the dependency on RDDs.
Hadoop is more secure than Apache Spark. Hadoop supports multiple authentication mechanisms, including Kerberos, Apache Ranger, Lightweight Directory Access Protocol (LDAP), and access control lists (ACLs). It also offers the standard file permissions on the HDFS. In contrast, Apache Spark has no security by default, making it more vulnerable to attacks unless configured adequately. In addition, Spark only supports authentication via shared secret passwords.
Programming language and ease of use
Apache Hadoop is developed in Java and supports Python as well. Apache Spark is developed using Scala and supports multiple languages, including Java, Python, R, and Spark SQL. Thus, it helps developers by choosing the programming language of their choice to avoid a learning curve.
One significant advantage of Apache Spark is that it offers an interactive mode for development, allowing developers to analyze data interactively using the programming language of their choice: Scala or Python. Furthermore, it is easy to program since it offers many high-level operators for RDD. On the flip side, Hadoop development is complex as developers need to understand the APIs and hand-code most operations.
Hadoop doesn’t have native machine learning libraries. Apache Spark provides machine learning support via MLlib. MLlib is easier to use and get started with for development on Spark for machine learning use cases due to excellent community support. In addition, it’s faster due to in-memory iterations.
Apache Mahout is used for machine learning development for Hadoop as Mahout uses MapReduce. However, performance is slow, and data fragments can be too large and become bottlenecks. Furthermore, Apache Mahout’s recent releases use Spark instead of MapReduce.
Hadoop offers basic data processing capabilities, while Apache Spark is a complete analytics engine. Apache Spark provides low latency, supports more programming languages, and is easier to use. However, it’s also more expensive to operate and less secure than Hadoop. In addition, both platforms require trained technical resources for development. Besides these two frameworks, developers can use different frameworks for big data processing like Apache Flink and Apache Beam.
Macrometa offers a hosted service with streaming and CEP (complex event processing) capabilities, a pay-as-per-usage model, and robust documentation. In addition, Macrometa’s global data network (GDN) is a fully managed real-time and low-latency materialized view engine that lets you develop a pipeline with minimal code and no significant learning curve due to its easy and straightforward interface. You can try it for free here.