Why Global Data Network? Or Why We Built A Truly Coordination-Free Distributed Database For Edge Computing
Life happens in the HERE and in the NOW. The HERE and the NOW are defined by where we are and what interactions, events are going on around us and with us. These events and interactions occur in the real world and not some far off data center a thousand miles from us. Similarly every interaction we do typically involves “state” or data about the interaction that combines with other state to create “context” which is essential for us to create and understand meaning from all the interactions we have . But we store this state (data) 1000s of miles away in far remote locations - cloud data centers.
This is like interacting with someone who either doesn’t remember from one sentence to the next sentence (aka stateless) or someone who has to travel 1000s of miles to respond to us each time (aka centralized state stores). If we put it this way, it certainly sounds silly, doesn’t it?
But this is what is presented as a normal way of doing things at the edge with stateless apps and centralized databases. There has to be a better way!
What we need are geo-distributed databases and streams that provide data at the edge, can scale across tens and hundreds of locations worldwide, and yet act in concert to provide a single coherent multi master database or stream.
It means we need to design systems that work on the internet with its unpredictable network topology, maintain causality over networks that are lossy, and avoid centralized forms of consensus - while still arriving at some shared version of truth in real time. Can this be done? This is the the question that piqued our interest and started a new journey at Macrometa.
There are many good distributed database (sql, nosql,..) and streaming systems out there. So why can’t these systems enable us to have state locally at the edge?
Note: There are some NoSQL/NewSQL databases that can span 3 or 4 or 5 regions, but are not exactly a true geo-distributed or edge native database.
In my opinion, an edge native database is something that allows reads and writes locally in all locations in parallel and does not require a user to magically know which data should be placed in which location or require that user to redesign the schema every time they want to add/remove a location.
Coming back, so why can’t the current generation of databases and streaming systems handle multiple locations?
One problem with most of the current distributed databases and streaming systems is that they rely on consensus protocols like Zab (zookeeper), or Raft or Paxos as their foundation. These are certainly beautiful protocols; but these protocols are meant for running inside data centers where the networks are ultra-reliable and low latency.
These protocols (and thereby the systems built on these) are not a good fit for WAN environments where the network is lossy, jittery and high latency. What we need database and streaming systems that are coordination-free and offline-first.
Convergence & Adaptive Consistency...
Let’s say you go to a nice restaurant with your partner and order a wine and dessert. Let’s say you ordered wine from the bartender and your partner ordered dessert from the waiter. How would you feel if the restaurant serves you nice folks only one i.e., wine or dessert but not both?
It certainly does not make sense. But that is how most current generation NoSQL databases work with their last-writer-wins approach. In other words, when two edge locations receive writes in parallel (i.e., not causally related), then only one change gets picked up and other change is lost.
What makes more sense is for both of your orders converged and the restaurant serves everyone both wine and dessert. In other words, what we need are databases that can intelligently converge changes in real-time at each location without coordination, but all come to the same definite version of truth.
If interested you can look at CRDT research for more info on this. CRDTs are objects that can be updated without expensive synchronization/consensus, and they are guaranteed to converge eventually if all concurrent updates are commutative and if all updates are executed by each replica eventually.
Similarly, there are some use cases like the double-spend problem where strong consistency is a must. But that does not mean that other use cases are penalized to support this subset of use cases. What we need are databases that can support various levels of data consistencies so that developers can make their choice depending on the use case.
Imagine you went to a coffee shop with your colleague. Your colleague orders one `Cafe Latte` Tall. You order something like below :
- 6 tablespoons coffee/espresso
- 8 ounces water
- 13 ounces milk
- 1/4 cup coconut milk
- 4 tablespoons sugar
- 1/4 teaspoon vanilla extract - mix in after cycle is complete
- 2 tablespoons Frangelico
You can imagine how the coffee barista looks at you. You might get a similar look from your colleague as well. The reality is both you and your colleague ordered the same thing i.e., `Cafe Latte` but what is different is the abstraction level of the order.
In recent years, there is significant progress on CRDTs and CRDT related technologies. But much of the research is about data types like Counters, OR-Sets etc. That probably makes sense from an academic perspective. But what developers need is `Cafe Latte`. Not the ingredients.
What is needed are higher level abstractions that developers can work with and quickly be productive - like databases with various consistency semantics, query languages etc.
Layered Multi Model Database…
When you choose a database today, you’re not choosing one piece of technology. You’re choosing three i.e., storage technology, data model, and API/query language.
For example, let’s say you choose Postgres. What you are choosing actually is the Postgres storage engine, a relational data model, and the SQL query language. Similarly if you choose MongoDB you are choosing the MongoDB distributed storage engine, a document data model, and the MongoDB API.
A fundamental problem with many of the current generation databases including the above mentioned ones is that they are below big monolithic systems with features interwoven between all of the layers. Monolithic databases continue to be useful pieces of technology, but they are evolutionary dead ends. They are the products of a smaller, less-connected era, and their shortcomings risk becoming liabilities for your business as it evolves.
Document databases, column-oriented, row-oriented, JSON, key-value, etc. all make sense in the right context, and often different parts of an application call for different choices. This creates a tough decision: Use a whole new database (and licenses, etc.) to support a new data model, or try to shoehorn data into your existing database.
What we need is layered model that decouples its various layers - data storage, data organization, data manipulation and data model.
This allows a single database system to support multiple data models against a single, integrated backend. Document, graph, relational, and key-value are examples of data models that may be supported by a multi-model database.
This allows a single database system to
- Support multiple data models against a single, integrated backend,
- Employ a unified query language to retrieve correlated multi-model data such as Graph-Json-Key/Value etc in a single platform and
- Support multi-model ACID 2.0 transactions.
Data privacy is a big issue and rightfully so. GDPR is one big step in that direction and I would imagine more regions and countries will follow this path in future. Database and Streaming systems (being holders of data) fall right into middle of this.
Most databases and streaming systems currently do not address this. The databases that attempt to address this typically use techniques like Geo-Partitioning (break the dataset into multiple subsets or partitions) and keep each subset in a different region.
While on the surface it looks right, this approach creates few problems. For example, it requires customers to know which subset of the dataset will be used in which region and then partition the dataset so that the partition falls into that region.
Similarly if a business is doing well and their customer base expands to other regions, this approach would require deep schema changes to support proper placement of data.
What is better? A database or streaming system that can geofence the data natively with strong isolation and does not require schema changes when a business expands to more regions.
The above are some of the fundamental issues with current database and streaming systems. We cannot address these by doing more of the same. We need to build systems with architectures that address these problems natively. Is it possible?
Certainly yes! We have built Global Date Network that addresses the above and other issues natively. It is currently running in 25+ locations worldwide with scheduled expansion beyond 100 by end of the year.
Clearly, more databases and streaming systems will emerge to address these fundamental problems. The important point is that the technology is available today, and provided by Macrometa’s Global Data Network.