Doug Cutting and Mike Cafarella officially introduced Apache Hadoop in April 2006, and it has been continuously evolving ever since. Apache Hadoop is a collection of open source software utilities that run on the network cluster using commodity hardware to solve many data and computation problems. In this article, we’ll discuss a utility from Hadoop called Hadoop Streaming and compare it to other technologies and explain how it works.
The Hadoop framework consists of a storage layer known as the Hadoop Distributed File System (HDFS) and a processing framework called the MapReduce programming model. Hadoop splits large amounts of data into chunks, distributes them within the network cluster, and processes them in its MapReduce Framework.
People often use Hadoop to refer to the ecosystem of applications built on it, including Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm, and more. There are software/utilities that don’t need Hadoop to process data, such as Kafka and Spark (which can run on both Hadoop and non-Hadoop environments), and others.
Introduction to Hadoop Streaming
Hadoop Streaming is another feature/utility of the Hadoop ecosystem that enables users to execute a MapReduce job from an executable script as Mapper and Reducers. Hadoop Streaming is often confused with real-time streaming, but it’s simply a utility that runs an executable script in the MapReduce framework. The executable script may contain code to perform real-time ingestion of data.
The basic functionality of the Hadoop Streaming utility is that it executes the Mapper and Reducer without any external script, creates a MapReduce job, submits the MapReduce job to the cluster, and monitors the job until it completes.
Hadoop vs Spark vs Kafka
There are other open source tools and utilities available in the market that perform real-time and batch analysis of big data. Choosing the best one requires extensive learning about these technologies. Hence, in this article, we have made a comparison between some popular tools that can help you to choose the best one suited for your requirements. Let us have a brief look at these tools.
Apache Spark is a fast, open source unified analytics tool suitable for big data and machine learning. Spark uses in-memory cluster computing and can handle a large volume of data by parallelizing data against its executor and cores, making it 100 times faster than conventional Hadoop analytics. Spark supports various programming languages, like Python, Java, Scala, and R, and it can run on top of Hadoop or Mesos, in the cloud or standalone.
Apache Kafka is an open source distributed event streaming tool built for real-time stream processing. Event streaming means to capture data in real time from event sources like IoT sensors, mobile devices, and databases to create a series of events. The real-time data in Kafka can be processed by the event-processing tools like Apache Spark, Kafka-SQL, and Apache NiFi.
Kafka has three main components: Producers, Topics, and Subscribers (Consumers). The Producer is the source that creates the data/event, Topics store that data/event, and the data/event is later consumed by Subscribers (Consumers). The Producers and Consumers can be programs (Java, Python, etc.), IoT devices, databases, or anything else that generates events.
Refer to the table below for other comparison criteria.
Hadoop Streaming architecture
In the diagram above, the Mapper reads the input data from Input Reader/Format in the form of key-value pair, maps them as per logic written on code, and then passes through the Reduce stream, which performs data aggregation and releases the data to the output.
We will discuss the operation of Hadoop Streaming in detail in the next section.
Features of Hadoop Streaming
Streaming provides several important features:
- Users can execute non-Java-programmed MapReduce jobs on Hadoop clusters. Supported languages include Python, Perl, and C++.
- Hadoop Streaming monitors the progress of jobs and provides logs of a job’s entire execution for analysis.
- Hadoop Streaming works on the MapReduce paradigm, so it supports scalability, flexibility, and security/authentication.
- Hadoop Streaming jobs are quick to develop and don't require much programming (except for executables).
Getting Started with Hadoop Streaming and its API
Shown below is the syntax to submit a MapReduce job using Hadoop Streaming and to use a Mapper and Reducer defined in Python.
- input = Input location for Mapper from where it can read input
- output = Output location for the Reducer to store the output
- mapper = The executable file of the Mapper
- reducer = The executable file of the Reducer
Now let’s discuss the working steps of this Hadoop Streaming example:
- In the above scenario, we have used Python to write the Mapper (which reads the input from the file or stdin) and Reducer (which writes the output to the output location as specified).
- The Hadoop Streaming utility creates a MapReduce job, submits the job to the cluster, and monitors the job until completion.
- Depending upon the input file size, the Hadoop Streaming process launches a number of Mapper tasks (based on input split) as a separate process.
- The Mapper task then reads the input (which is in the form of key-value pairs as converted by the input reader format), applies the logic from the code (which is basically to map the input), and passes the result to Reducers.
- Once all the mapper tasks are successful, the MapReduce job launches the Reducer task (one, by default, unless otherwise specified) as a separate process.
In the Reducer phase, data aggregation occurs and the Reducer emits the output in the form of key-value pairs.
Popular Hadoop Streaming parameters
Hadoop Streaming use case
A popular example is the WordCount program developed in Python and run on Hadoop Streaming. Let's look at this example step by step.
- Create a new file and name it mapper.py. Paste the code below into the file and save it.
- Create another new file named reducer.py, containing the code below.
- Create an input file with some lines in it, such as some contents from Wikipedia, and paste them into the file. Save the file in the Hadoop location /user/input/input.txt
The file may look like this -
- Submit the Mapreduce job with following command:
- Once the job is successful, you will see some part files in /user/output location.
- To view the content of the file, open it in any editor. It will look something like this. (Content may vary depending upon the content in the input file)
This article reviewed Hadoop Streaming, its features, architecture, and its use case. Hadoop Streaming executes non-Java code in the MapReduce paradigm, but new open source projects like Spark and Kafka are gradually replacing Hadoop. One of the reasons for this is that Spark is faster than Hadoop due to its in-memory computation. Spark also supports multiple languages, accommodating a broader spectrum of development projects.Kafka is known for its scalability, resilience, and performance. Together, Spark and Kafka process data in real-time and offer real-time analytics – something that Hadoop does not offer.
Macrometa offers a modern alternative designed to simplify and shorten the development time of streaming applications. Macrometa offers noSQL database, pub/sub, and event processing services into a single global data network (GDN) that lets you develop a pipeline with minimal code. Try it for free.