Announcing PhotonIQ: The New AI CDN For Accelerating Apps, APIs, Websites and Services

Databricks Vs Snowflake

Chapter 12 of Event Stream Processing

Databricks vs Snowflake

Traditionally, corporations used data warehouses to store data of various types generated from various sources. Data warehouses are designed to support decision-making through intelligence extracted from the data. However, with the evolution of technology, data needs have changed because of increased data velocity, volume, and veracity. There are quite a few relevant systems in the market, but we'll discuss two of them, Databricks and Snowflake.

Databricks aims to combine data warehouses and data lakes under a single platform, while Snowflake is promoting a data warehouse with a software-as-a-service (SaaS) offering that boasts less maintenance and scalability.

In this article, we will start by looking at the background of data warehouses and data lakes and then discuss the critical differences between Databricks and Snowflake. We'll also provide alternatives to these technologies.

Data Warehouses vs Data Lakes

A data warehouse stores historical data about a business to allow the analysis and extraction of insights. It does not store current information, nor is it updated in real-time, and is mainly used with relational databases where data is stored in relations (tables) with the schema optimized for fast querying and analytics.

A data warehouse stores data in a structured format that is accessible via SQL queries. It is easier to manipulate structured data than non-structured or semi-structured data because the schema for all the data is known beforehand. However, storage and processing are centralized, and using purpose-built hardware makes a data warehouse expensive.

Event stream processing diagram

Fig 1. Data warehouse architecture (source)

Unlike data warehouses, data lakes can have unstructured data. Data lakes allow data to be held in various formats on cloud object storage (S3, ADLS, or Google Cloud storage) with a separate processing layer. They complement the limitations of data warehouses. Storage and processing are decentralized, and instead of using big tables, data is split into smaller files and distributed on multiple nodes. Decoupled storage allows the data lake to scale independently. It also uses commodity hardware, so it has a lower cost.

Event stream processing

Fig 2. Data lake architecture (source)

A data lake is cost-efficient whereas a data warehouse is performance-efficient. A new data architecture, the data lakehouse, offers features of both the data warehouse and a data lake. A data lakehouse is usually a combination of a federated query engine and storage that offers a data warehouse, data lake, and analytics features. Data lakehouses are still new, haven’t matured, and have challenges.

There are various data warehouse and data lake tools and frameworks for data processing and analytics in the cloud ecosystem. Although Databricks and Snowflake have a similar design, architecture, and support analytics, they aren’t quite the same. Snowflake replaces legacy data warehouses and supports ELT, while Databricks offers a data processing engine powered by Spark that is used with the data warehouse.

Experience the Power of Macrometa

Schedule a personalized demo with one of our expert solutions architects.

  • Maximize the value of your existing application investments
  • Designed for complex and distributed use cases
  • Achieve up to 100x faster performance than AWS or GCP
  • Seamlessly scale to serve billions of users in real-time

Databricks

Databricks is a cloud-based platform specializing in analyzing data at scale regardless of its location. It is a data and analytics platform that helps enterprises extract business intelligence from the data. It also provides a complete data science workspace with its machine learning runtime, Managed MLflow, and collaborative notebooks.

Databricks is well known for its ability to process a large amount of data. In addition, it has support for multiple languages, which makes it more potent because you can integrate libraries from any programming language ecosystem. As a result, large enterprises often use Databricks for production operations in industries like healthcare, fintech, entertainment, and others.

Snowflake

Snowflake is a cloud-based data warehouse that seamlessly provides all the data warehouse functions with a single tool without different system integrations. It’s relatively easy to get started, fairly cost effective, and quick to scale compared to a legacy data warehouse. Decoupled storage and computing enable data sharing and scaling, and Snowflake abstracts cloud complexities and lets customers load, integrate, process, analyze, and share their data.

Databricks and Snowflake Comparison Table

The following table summarizes the high-level differences between Databricks and Snowflake.

DatabricksSnowflake
Service ModelPaaSSaaS
Major Cloud Platform SupportAzure, AWS, GoogleAzure, AWS, Google
Migration to PlatformComplex because it is a data lakeEasy because it is designed based on a data warehouse
ScalabilityAuto-scalingAuto-scaling up to 128 nodes
Vendor Lock-inNoYes
User-FriendlinessLearning curveEasy to adopt
Data StructuresAll data types (raw, audio, video, logs, text, etc.)Semi-structured or Structured data
ServicesBig data, data science, data analytics, and machine learningDatabase management and data warehouse
Data Science and Machine LearningBuilt-in and unified tool for any type of developmentOnly available via third-party integrations
CostPay by usagePay by usage
Query InterfaceSQL, Spark Dataframe, KoalasSQL
Query OptimizationVectorization and cost-based optimizationVectorization and cost-based optimization
Provisioning of Different Types of NodesYesNo
IPONo2020
Valuation$38 billion$33 billion

Table 1. Comparison between Databricks and Snowflake

Databricks and Snowflake Architectures

The architecture of data lakes separates them from conventional data warehouses because of the decoupling of storage and computing. Databricks has a separate layer for storage and computation.

Although Snowflake is a managed service and its architecture is transparent from end-users, it also has a separate storage and processing layer.

Store, serve, and process data anywhere in the world

  • Improve write performance with globally distributed active-active architecture
  • Scale with a real-time data layer, accessible within 10ms proximity of 80% of the global population.
  • Support multiple data types (KV, Docs, Graphs and Search) and streaming data

Data Ownership

Snowflake is inspired by legacy warehouse architecture but modernized. Under the hood, it has decoupled storage and processing and can be scaled independently while still owning both layers.

In contrast, Databricks has fully decoupled storage and processing layers. It lets you store data anywhere in any format or shape. It focuses on the processing layer and offers freedom to choose the processing engine while integrating third-party solutions.

Data Structure

As mentioned above, Snowflake also supports semi-structured data. Data can be loaded directly into Snowflake without going through an ETL process.

However, Databricks permits storing all types of data in any format and type since its storage layer is independent of the processing layer. Databricks can work as the ETL tool to add structure to the unstructured data.

Scalability

Both platforms leverage cloud computing to scale without significant overhead. Databricks can scale as much as you can invest in the infrastructure, but Snowflake is limited to 128 nodes. Also, Snowflake offers you fixed-sized warehouse options where the end-user cannot resize nodes but can resize clusters. Additionally, Snowflake offers auto-scaling and auto-suspend to allow starting and stopping clusters during idle or busy times. In contrast, although Databricks lets you provision different types of nodes at various scale levels, it is a little more complicated and will require the necessary technical expertise to scale the Databricks cluster.

Compare

PlatformBatchComplex EventsStreamsGeo-ReplicationDatabaseGraphs
Spark✔️✔️✔️
Flink✔️✔️✔️✔️
Macrometa✔️✔️✔️✔️✔️✔️

Performance

Snowflake is suited for high-performance queries because it already has structured data suitable to the business use case. Both Snowflake and Databricks implement cost-based optimizations and vectorization. Additionally, Databricks offers hash integrations to accelerate the aggregation of queries.

Snowflake has slow performance on semi-structured data as it may need to load all the data into RAM and perform a complete scan. In contrast, Databricks lets you optimize data processing jobs to run high-performance queries.

Finally, Snowflake is batch-based and needs the entire dataset for results computation, while Databricks is a continuous data processing (streaming) system that also offers batch processing.

It’s worth noting that there has been a blog war between Databricks and Snowflake about the performance metrics published by Databricks claiming that they are 2.5x faster than Snowflake. 

Machine Learning

Databricks has a machine learning ecosystem to develop different models. In addition, it supports development in various programming languages. 

Snowflake has no ML libraries, but it offers connectors to integrate various ML tools. In addition, it gives access to its storage layer or export query results, which can be used for training and testing the models.

You can develop machine learning and analytics using Snowflake, but it requires integration with other solutions and doesn’t offer such services out of the box. Instead, it provides drivers to integrate with other platform libraries or modules and access the data.

Databricks and Snowflake Use Cases

Snowflake is suited for SQL-based business intelligence use cases due to its design and architecture. Conversely, Databricks allows SQL-based business intelligence and also offers support for a wider variety of use cases, such as recommendation engines or intrusion detection. Both products support dashboards for reporting and analytics.

Databricks can scale up to meet the high throughput demands of any high-volume system, but its query performance for analytics will be slow. On the other hand, Snowflake has limited support for continuous writes and concurrency but can beat the performance of Databricks.

Databricks and Snowflake Differences

Snowflake is a fully managed service, so deploying and scaling up or down is easier. Most of the operations are hidden from the end-user, so there are few options for fine-tuning. On the other hand, Databricks needs much more administration and deployment; it requires expertise to optimize the queries executing against the data lake engine.

Both platforms are dependent on cloud platforms because of their storage layers. However, they are threatened by cloud providers that offer similar platforms and have better integration within their ecosystems.

Conclusion

While Databricks and Snowflake have some things in common, one is geared toward data lakes while the other brings the data warehouse perspective. Snowflake is best suited for SQL-like business intelligence applications and provides better performance. On the other hand, Databricks offers support for multiple programming languages. Snowflake is also a little easier for developers to use compared to Databricks, which has a steep learning curve. Due to their various trade-offs, many companies are having to use them together: Databricks for ETL and Snowflake for data warehouse.

It is crucial to have a deep comprehension of the technology landscape, whether you are creating applications or assessing industry solutions offered by Macrometa. By attaining a comprehensive understanding of these technological choices, you can make informed decisions and effectively utilize the appropriate solution to address your specific requirements for SaaS, data lakes or data warehouses.

Like the Article?

Subscribe to our LinkedIn Newsletter to receive more educational content.

Chapters

Platform

PhotonIQ
Join the Newsletter