The guide to stream processing
Stream processing is a concept that has been in the data management community for several decades. With the advent of big data technology, the term has taken on a new meaning. Interestingly, stream processing relates to all three Vs of big data: Volume, Velocity and Variety.
We’ll discuss the common uses, describe the challenges of stream processing, and outline stream processing best practices that optimize scalability
Why stream processing?
Historically, the term stream processing was used analogous to real-time processing. The process generally entails defining and taking actions on a continuous series of data as soon as it arrives or is created in the system. From a practitioner's or developer’s point of view, the most important component in real time stream processing is to define and implement meaningful actions on continuous incoming data.
The proliferation of smart devices, Internet of Things (IoT) , collaborative data collection projects, and multi-sensory technology has contributed to the generation of data at unprecedented rate and scale. The need for real-time solutions is essential to many applications ranging from logistics, to anomaly and intrusion detection, to health sciences and clinical diagnosis. Analysis presents major challenges for large scale, time-sensitive processing, due to the extreme scale, demand for fast responses, and its multidimensional nature.
Over the years, much work has gone into building systems and frameworks for defining semantics for real-time stream processing. Prominent data stream engines such as Aurora and Borealis, use a concept called Continuous Query. This means that a query is issued once and executed continuously on the input data stream. Aurora provides a rule-based stream query language whereas Borealis integrates it with Medusa for a distributed environment. Similarly, STREAM provides Continuous Query Language (CQL) with constructs to handle data streams in a relational environment. TelegraphCQ extends PostgreSQL with fault tolerant adaptive query operators.
Recent systems tried to take relatively newer approaches to the same problem of handling data streams. PipelineDB is specialized for time series data aggregation by incrementally storing results of continuous SQL queries on data streams. RethinkDB follows document oriented NoSQL principles to handle data streams and VoltDB provides an ACID compliant in-memory database using a shared-nothing architecture.
Despite their usability, these traditional data handling approaches lack the ability to scale to the needs of high-speed, real-time processing due to their IO and storage-based, centralized architecture where data needs to go through several pre-stages before being ready to generate results.
The challenges of stream processing
Contemporary stream processing sounds straightforward, but vast amounts of real-time data adds complexity at multiple levels..
Big data (volume, velocity and variety)
All data management and handling solutions are bound by the rules of big data. In general, the problem of “volume” in big data is a difficult problem to tackle. Over the past decade, Hadoop-based distributed systems have served as the industry standard for large scale big data processing. Big data complexity then combines with the high volume and high velocity of stream processing. Adding data variety only increases the overall complexity of the problem.
An important characteristic of stream processing is its time sensitivity. Data coming in real-time requires results to be generated in real-time as well. This is especially important for applications where outdated results are useless, harmful, or potentially dangerous. For instance, delay in response to security alerts can lead to hazardous situations for workers in manufacturing environments. Similarly, late data accumulation in business intelligence reports can cause monetary losses. In order to minimize latency and maximize data freshness, contemporary stream processing systems utilize strategies such as result estimation, change data capture, and file tailing.
To avoid these situations, while developing stream processing solutions, the notion of time needs to be clearly defined. While the actual definition of time can vary by use case, its definition has a direct effect on the quality and completeness of the results generated by the system. One important time consideration is the distinction between event time and processing time in data streams. As a general rule, while processing time definition and handling is easier since it is not “part” of the data and is dependent on the processing system, the results based on processing time are non-deterministic. On the other hand, event time is extracted from the data itself allowing for consistent results despite out-of-order or late data using watermarks. Various stream processing semantics such as triggers, windowing, timers etc. are directly dependent on the defined time attribute in the system.
In batch processing, data is processed from multiple entities and combined to form a single entity to generate overall results. In the stream processing world, data comprises potentially unbounded continuous streams, but this would require unlimited storage. This is not practical in real-world scenarios. Furthermore, raw data is seldom used as-is for analysis and generally goes through various enrichment procedures. This increases the overall storage requirements for keeping data. To tackle this, stream processing systems usually provide rich operations such as data filtration, aggregation, deduplication, redundancy removal, and change detection.
Parallel & distributed stream processing (scalability)
To handle the complexity of large scale stream processing, one popular option is distributed computing. As mentioned earlier, distributed computing is dominated by the Hadoop ecosystem. Hadoop-based systems excel at providing scalability, performance, and efficiency. However, Hadoop-based systems were originally developed for large scale batch processing and are IO-bound, making them highly inefficient for real-time stream processing. In general, any storage-based IO-bound system is not suitable for handling high velocity data processing because disk I/O is slower than in-memory processing.
The fundamental concepts of stream processing
Keeping in mind the challenges of large-scale stream processing, we present the basic components of a system capable of processing continuous streams of high frequency data in real-time.
The traditional programming model is an active model where operations are performed on data. Event-driven architecture is more of a passive model where operations are triggered on predefined events. Unlike traditional models, this fits more with the stream processing world.
The user defines and registers events with the system. At the same time, the user also registers event consumers and reaction procedures within the system. A logical system layer, Event Processing Engine, ingests and monitors streams of data for registered events. Once detected, the events are forwarded to consumers and reaction procedures are triggered for appropriate data.
Stream processing driven by event processing can generally be categorized into simple event processing and complex event processing. Simple event processing relates to particular and measurable changes in data. An event generally directly associates with a consumer or reaction. Complex event processing, on the other hand, deals with detecting and managing patterns, combinations and correlations of simple events. These can be causal, temporal, or spatial, and generally require sophisticated semantics and techniques.
An important aggregation semantics in a stream processing environment is window management. Windowing allows periodic data aggregation and processing. The actual processing or aggregation may implement batching or running function variants based on the underlying framework. Window semantics are tightly coupled with the underlying framework’s time handling. Different frameworks support different windowing notions such as tumbling, sliding, session etc.
Messaging semantics relate to the nature of distributed workers which can fail and restart independently of each other. This can lead to message loss or duplication which can directly affect the results. In general, there are three types of message guarantees provided by stream processing distributed systems: at least once, at most once, and exactly once.
“At least once” is the most basic message delivery semantic. If a worker fails, upon restart it always re-issues or replays the same message. While this ensures that a message is always accounted for, it can result in data duplication and may need to be manually handled by the developer.
“At most once” avoids duplication at all costs but it can lead to data loss because workers only issue messages once, regardless of whether they were successfully delivered or not.
“Exactly once” is the hardest message delivery guarantee to ensure in a distributed system. Usually it is achieved using periodic checkpointing which can potentially affect overall system performance. However, one possible strategy adopted by systems is asynchronous checkpointing. Adding checkpoints synchronously could create bottlenecks and disrupt the streaming process whereas asynchronous checkpointing would avoid this risk.
Stream processing engines employ various strategies to handle data, the most popular being micro-batching and operator-based computational models. While micro-batched strategies provide near real-time processing, operator-based models tend to be more suitable for true streaming capabilities since they operate in real-time.
Stream Processing Best Practices
Minimize IO as much as possible
Since processing time is critical for stream handling, it is best to keep data as near to the processor as possible at all times and avoid disk-based IO. With the affordability of main memory, modern techniques usually follow in-memory processing models that allow for maximizing performance while minimizing latency.
Optimize data distribution/partitioning/sharding
Partitioning and data distribution is an essential component in distributed processing. A distributed system is only as efficient as its slowest worker. Uneven data distribution or partitioning can result in more data ending up on one or more workers, increasing processing load and making straggler workers. Therefore, considerable efforts should be dedicated to tuning data partitioning.
Scalability is of utmost importance
Modern big data processing heavily relies on the capability to scale on-demand. Modern systems provide scaling up and scaling out capabilities. While it is generally more advisable to scale out, the decision is more use case dependent.
Data stream technologies
There are many implementations of stream processing technologies available for use by developers requiring it for their application development projects. It’s often cumbersome to switch technologies once you invest time learning and embedding one of these technologies, so we recommend investing more evaluation time upfront before finalizing your selection. In the table below we compare the main attributes of a few chosen implementations.
Stream processing by itself is not a new concept, but stream processing at scale definitely poses modern challenges. A popular strategy to tackle scalability is to take advantage of distributed processing. Event-driven architecture coupled with time and window management simplifies stream processing from the developer’s perspective. Existing popular stream processing frameworks provide different fine grained capabilities suitable for a wide range of applications.