Chapter 10 of Event Stream Processing
Apache Storm is an open source distributed stream processing engine. Mainly written in the Java and Clojure languages, it became popular after being acquired by Twitter in 2011. Similar to Hadoop for general-purpose distributed processing, Storm can be considered a pioneer in the domain of real-time distributed stream processing.
Over the years, other comparable frameworks have been developed in the domain of distributed stream processing. Some of the most popular of these are Apache Spark and Apache Flink. All these frameworks aim for low-latency, real-time stream processing as their fundamental capability. The purpose of this article is not solely to compare these frameworks. However, due to their popularity, it is worth noting that Apache Storm provides an API that is more low-level and has finer granularity than Apache Flink and Apache Spark Streaming.
In this article, we will study Apache Storm with respect to stream processing concepts we have covered in previous articles
Apache Storm building blocks
Apache Storm shares many low-level details with its batch processing counterpart, Apache Hadoop. For instance, the key-value pair is the fundamental data structure that Hadoop uses. The basic data unit in Apache Storm is a tuple, which can be defined as a list of ordered elements. All interactions and data flow among system components occur in the form of tuples.
From a technical perspective, Storm tuples are dynamically typed: The values of the elements in a tuple can be of any type and can be determined at runtime. This is particularly useful in stream processing environments where incoming data can be structured, semi-structured, or unstructured. Since all data is represented in the form of tuples, every tuple needs to have a proper serialization mechanism to convert it to/from bytes and to send/receive the data across the network to the system nodes. This can prove to be an extra step for developers while working on non-trivial systems.
All sources of data in Apache Storm are called spouts. Every ingress interaction with an external source is done via a spout. Spouts read data from connected external sources and emit them into the system as tuples for further processing. Some examples of spouts are Twitter streaming API, Kafka, Google PubSub, etc. A single spout is capable of emitting multiple streams from a single external source.
A bolt is the basic processing unit in Apache Storm. Every bolt consists of logical steps that need to be performed on every tuple in the incoming stream. A bolt ingests one or more streams of tuples and emits one or more streams.
A bolt is also the basic unit of parallelism in Apache Storm. Bolts are scheduled and executed in parallel on processing nodes in the underlying distributed cluster. These are comparable to map functions in Apache Hadoop. Bolts contain code that can be executed independently on a subset or partition of the input data stream in parallel on distributed resources.
To summarize, Apache Storm is built on basic data units of tuples and processing units of bolts. Tuples and an array of bolts are connected together to form a topology. Like other big data processing frameworks, a topology is typically arranged in the form of a directed acyclic graph (DAG). Apache Spark follows a similar DAG-style workflow. However, while Spark DAGs start with sources and end at sinks, every record in the incoming stream goes through the DAG-like topologies and keeps running indefinitely until preempted explicitly by the user.
Figure 1: Typical Apache Storm DAG architecture. Every record in the input stream goes through the same DAG.
See Macrometa in action
Request a demo with one of our expert solutions architects.
- Extend your existing applications
- Built for hyper distributed use cases
- Up to 100x faster than AWS or GCP
- Serve billions of users in real-time
Stream processing characteristics
Messaging semantics relate to the nature of distributed processes, which can fail and restart independently of each other. This can lead to message loss or duplication, which can directly affect the results. Apache Storm guarantees at-least-once processing using record-level acknowledgements. While this ensures that a message is always accounted for, it can result in data duplication and may need to be manually handled by the developer.
Stream processing engines employ various strategies to handle data, the most popular being micro-batching and operator-based computational models. While micro-batched strategies provide near-real-time processing, operator-based models tend to be more suitable for true streaming capabilities. The basic Apache Storm framework is built on a micro-batched mechanism and cannot be categorized as true streaming like Apache Flink.
State management is an important characteristic of any distributed system. Apache Storm is stateless in nature, which allows it to process data as fast as possible, making it suitable for real-time data processing.
A considerable disadvantage of statelessness in distributed systems is its effect on the framework’s fault tolerance. Apache Storm tries to mitigate fault tolerance issues by employing record-level acknowledgements. This strategy, while necessary for fault tolerance, is not as lightweight as other existing distributed stream processing frameworks such as Apache Flink. To further address fault tolerance, Apache Storm uses Apache Zookeeper to store and keep track of its state, which it can use to restart in case of failure.
In real-world distributed stream processing, analysis is usually performed across multiple data streams. These streams can originate from one or more external sources or can also be internal to the system after a single source stream gets divided into many based on user logic.
Any processing performed on more than one stream is susceptible to back-pressure build-up. This happens when different streams operate at different speeds while being ingested into a shared processing task. The shared processing task has direct dependency on both the streams and cannot proceed until data from both streams is available. This is especially an issue in distributed stream processing environments, where the processing of different operator tasks can be affected by all sorts of factors.
This is usually handled at the framework level, where multiple network handling strategies are employed to mitigate the back-pressure build-up, the details of which are beyond the scope of this article. Initial versions of Apache Storm did not have support for these strategies. However, starting with version 1.0, Apache Storm gracefully handles back-pressure build-up.
Data distribution and partitioning
Data stream partitioning and distribution is done through stream groupings that basically define how the incoming stream of tuples should be distributed among bolt tasks running in parallel on distributed resources. By default, Apache Storm comes with several built-in stream groupings:
- Shuffle Grouping: Random balanced distribution
- Fields Grouping: Distribution based on a user-specified field
- Partial Key Grouping: User-specified field distribution with load balancing
- All Grouping: All data replicated across all distributions
- Global Grouping: All data forwarded to a single bolt task with the lowest ID
- Direct Grouping: Predefined grouping for special direct streams only
- Local or Shuffle Grouping: Normal shuffle grouping with load balancing among local tasks
In addition to the above-mentioned groupings, Storm also allows developers to come up with custom grouping strategies to meet their requirements.
Apache Storm limitations
Despite being a pioneer in the distributed stream processing domain, Apache Storm is less popular than other frameworks. This can mostly be attributed to the following fundamental limitations.
Time management, including event time and processing time, are fundamental concepts in advanced stream processing. Unfortunately, Apache Storm does not explicitly support event time out of the box.
Another crucial capability of stream processing is windowing and support for window-based aggregations. This requires clear definition and support of timing semantics from the underlying framework. As mentioned earlier, Apache Storm has limited to no support for event-based time management, which directly translates to an absence of windowing mechanisms in Apache Storm.
Most of the competitive environments provide support for SQL-like APIs to develop and work with their underlying frameworks, e.g., Spark SQL & Dataframe API and Flink SQL & Table API. Such API integrations and support have a direct impact on development speed and framework adoption. Despite being around for quite some time, Storm SQL is still in the experimental stage as of version 2.2.0.
Apache Storm high-level API
In order to address the above-mentioned limitations, Trident was developed on top of Apache Storm starting in version 0.8. Trident is a high-level abstraction that employs a mini-batch processing model with stateful processing while ensuring exactly-once delivery guarantees. At the same time, compared to vanilla Storm, Trident topologies are relatively complex and have lower overall system throughput due to state generation. As a rule of thumb, if achieving the lowest possible latency is your primary goal, and you can manage some message duplication due to at-least-once semantics, you should opt for vanilla Storm instead of Trident.
Edge computing needs a next generation database technology
- Ultra fast distributed writes with Conflict-free Replicated Data Types (CRDTs)
- Solve scaling constraints due to geo-distributed time-stamping with Version Vectors
- A unified query language for KV, Docs, Graphs and Search with C8QL
Apache Storm vs Spark vs Flink
|True Streaming||Languages||Message Semantics||Complex Event Processing||Processing Model||Window Support||State Management|
|Apache Spark Streaming||No||Java Scala||Exactly once||Manual||Micro batching||Time-based only||Stateful|
|Apache Flink||Yes||Java Scala||Exactly once||Built-in||Streaming||Multiple||Built-in and custom|
|Apache Storm||No||Java Closure||At least once||Third-party||Micro-batching||Limited||Stateless|
Apache Storm is one of the first open source stream processing frameworks developed to operate in a distributed environment. It provides a breadth of low-level semantics and operators to efficiently perform low-latency, real-time processing on input data streams. This low-level control has proven to be a double-edged sword for the framework, however: On the one hand, it provides many fundamental stream processing operations with maximum control, but on the other, it has lower adoption than other frameworks with better out-of-the-box abstractions. Despite its limitations, I would strongly argue against Apache Storm being totally deprecated and replaced by other popular stream processing engines such as Apache Spark Streaming and Apache Flink.