Introduction:

Distributed Data

October 13, 2022
5 min read

Distributed Data

Near real-time data analysis is essential to delivering value in applications such as self-driving cars, IoT, systems monitoring, clickstreams, and fraud detection. These real-time applications expect sub-100 millisecond response times even as they stretch across the globe producing massive amounts of write operations into databases. 

These new applications have redefined distributed data from data stored on database cluster nodes to data streaming through a global analytics pipeline. They also force software developers to look beyond centralized public cloud service models for low-latency distributed data solutions.

For most applications, migrating from private data centers to public cloud services delivers numerous benefits, such as a lower capital outlay, agility, and on-demand scaling. However, geographic restrictions, prohibitive costs, and high latency associated with public cloud services pose significant challenges for real-time distributed data applications. 

This article will explain some of the difficulties facing low-latency, high-throughput applications with globally-distributed data sources and consumers and explore tools and techniques designed to solve these challenges.

Real-time Distributed Data Application Challenges
Challenge Public cloud solutions Low-latency solutions
Expensive Writes Centralized public cloud databases Append-only logs
Network Transit Latency Data traveling to a centralized public cloud data center Edge computing
Storage Complexity Polyglot architecture Multi-model databases
Replication Delay Synchronous replication Conflict-Free Replicated Data Types (CRDTs)

The Challenge of Distributed Data

Distributed data applications come with a set of problems smaller-scale applications do not. The sections below detail four of the biggest challenges facing apps handle data distributed across geographic regions. 

Expensive Writes

Public cloud databases like DynamoDB offer high availability, fully managed, and scalable services but come at a steep cost. The cost is measured in "Capacity Units", with Read Capacity Units (RCUs) being much cheaper than the Write Capacity Units (WCUs). 

WCUs are at least 16 times more expensive than RCUs, and can end up being a whopping 64 times more expensive! How much businesses actually pay depends on multiple factors including the consistency level chosen, the volume of data written per second, and the amount of WCUs used for data replication data in global tables. 

Network Latency

Network latency is less of an issue when distributed data and its consumers reside in the same or contiguous regions. It becomes a significant consideration if your data is located in one region and needs sub-100ms access from multiple regions.

For example, many NoSQL databases can only be provisioned by the public provider in a handful of regional data centers. This paradigm causes distributed applications to incur a high round-trip network transit latency to reach those data centers. 

A quick peek at cross-region ping statistics can demonstrate how distance increases network lag. Geographic distance can play a major role when your application uses a centralized public cloud. Perhaps you chose to centralize all your data in Washington. If you have users in Auckland consuming this data, the network latency of 250ms alone will render a real-time application unusable. 

Storage Complexity 

Developers may opt to use specialized storage technologies to serve individual components of their applications. Commonly referred to as “Polyglot”, this architecture offers developers “the right database for the right job”. 

However, mixing and matching multiple database technologies can come at a cost. Complexity, licensing, storage costs, and compliance risks increase over time. Polyglot systems don’t always scale well and are expensive to change. 

Replication Delay 

Suppose you need to build a globally accessible database. You would create the database in a primary region with replicas across the globe. 

Replicas rely on replication techniques to stay synchronized, which run into data consistency problems due to the same transit delays we discussed earlier. If a user in Washington and a user in Auckland send the same query to the database, they may get different results depending on recent data updates.

Globally-distributed data requires specialized technology to overcome network delays.

New approaches to processing distributed data

The challenges above demand a departure from the typical centralized cloud model. Fortunately, recent advances in the field of distributed data provide new tools and techniques to help alleviate some of these challenges.

Conflict Free Replicated Data Types (CRDTs) are being used successfully in distributed databases, online chat platforms, and even online gambling systems. Companies like Redis, SoundCloud, Atom, and Paypal use CRDTs to solve some of the toughest distributed data problems. 

Multi-model databases are slowly replacing the Polyglot architectures in low-latency applications. The black or white choice between consistency or performance is being replaced with tools offering tunable consistency levels. 

The sections below explore some of these essential distributed data processing features.

Append-only Logs

Sometimes referred to as “Write ahead logs”, append-only logs are at the core of modern databases. Their purpose is to log every change in data (in case there is a disaster and the database needs to be recovered). Any request for changes to data must first be logged in an append-only log before being honored by the database management system. 

Append-only logs can also be leveraged to reduce the cost of writes by generating in-memory materialized views in real time for changing data, transforming the changes into CRDT operations, and shipping them to the peers with causal ordering.

Conflict-Free Replicated Data Types (CRDTs)

One common disadvantage of contemporary software is the coordination overhead required for data updates. When several users update the same data simultaneously, each update must be coordinated across all nodes. This coordination slows down the system and sometimes causes data corruption. 

A Conflict-Free Replicated Data Type (CRDT) is a data structure similar to a linked list (and can also include maps or sets) that is replicated across multiple nodes. Applications using CRDTs can independently update any replica. Replica updates occur concurrently without coordination with the other replicas. 

Although individual replicas may have different states at any time, they are guaranteed to converge eventually. CRDTs contain an algorithm to automatically resolve any inconsistencies that might occur due to network faults and failures. Systems designed to use CRDTs can take advantage of decreased latency (no more stalled execution), increased throughput, and higher availability.

Multi-model databases

As the name suggests, a multi-model database supports multiple data models on a single backend. These databases act as a single access and storage point for any application. They incorporate features from different database models, eradicating data redundancy and eliminating the need for any associated ETL. The result is reduced computing, lower operational costs, and enhanced application scalability. 

Multi-model databases have many advantages. They eliminate the need to install, configure, and maintain different databases, reducing licensing and maintenance costs. They also provide the flexibility to use the best model for the job.

Tunable consistency levels 

When it comes to distributed data, consistency and performance fall at opposite ends of the spectrum. Until now, developers have had to sacrifice one for the other. Tunable consistency levels enable developers to choose the best balance of consistency and performance. No longer confined to an "all or nothing" choice between consistency and performance, developers can now choose from several options that fall along the spectrum, like causal, read-your-write, session, and monotonic read.

Processing distributed data with the new application requirements for speed and global span requires developers, site reliability engineers, and DevOps managers to understand many aspects of data storage and retrieval. 

We explain the fundamental concepts that every engineer involved in distributed data processing should know in the following chapters.

Chapter 1: Vertical scaling vs. horizontal scaling. As the user’s base grows, scaling the system becomes more and more pertinent. Explore the two types of scaling approaches you can use for your application.

Chapter 2: Advantages and disadvantages of NoSQL. Learn about pros and cons of NoSQL databases, differences between SQL and NoSQL, and the most common use cases in theory and with examples.

Chapter 3: High availability vs. fault tolerance. Learn about what is high availability and fault tolerance in modern, distributed applications. Provided examples will help you understand the difference between them.

Chapter 4: Sharding vs. partitioning. Learn the similarities and differences between sharding and partitioning, understand the use cases for each, and follow the best practices.

Chapter 5: Database Indexing and Partitioning: Tutorial & Examples. Learn about the use cases and best practices of database indexing and partitioning techniques by following explanations and examples.

Subscribe to our LinkedIn Newsletter to receive more educational content
Subscribe now
Subscribe to our Linkedin Newsletter to receive more educational content
Subscribe now