Chapter 12:

Databricks vs Snowflake

March 15, 2022
15 min read

Databricks vs Snowflake

Traditionally, corporations used data warehouses to store data of various types generated from various sources. Data warehouses are designed to support decision-making through intelligence extracted from the data. However, with the evolution of technology, data needs have changed because of increased data velocity, volume, and veracity. There are quite a few relevant systems in the market, but two of them, Databricks and Snowflake, are fiercely competing to lead the industry.

Databricks is making waves for combining data warehouses and data lakes under a single platform, while Snowflake is revolutionizing the data warehouse with a software-as-a-service (SaaS) offering that requires near-zero maintenance and provides quick scalability. 

In this article, we will start by looking at the background of data warehouses and data lakes and then discuss the critical differences between Databricks and Snowflake.

Data Warehouses vs Data Lakes

A data warehouse stores historical data about a business to allow the analyzing and extraction of insights. It does not store current information, nor is it updated in real time. It is mainly used with relational databases where data is stored into relations (tables) with the schema optimized for fast querying and analytics. 

A data warehouse stores data in a structured format that is accessible via SQL queries. It is easier to manipulate structured data than non-structured or semi-structured data because the schema for all the data is known beforehand. However, storage and processing are centralized, and the usage of purpose-built hardware makes a data warehouse expensive. Moreover, with advancements in infrastructure and better support from open source software for new tool development, most data warehouses are becoming outdated. They cannot scale enough and provide the performance required to process the data in near real time. 

Fig 1. Data warehouse architecture (source)

Unlike data warehouses, data lakes can have unstructured data. Data lakes allow data to be held in various formats on cloud object storage (S3, ADLS, or Google Cloud storage) with a separate processing layer. They complement the limitations of data warehouses. Storage and processing are decentralized, and instead of using big tables, data is split into smaller files and distributed on multiple nodes. Decoupled storage allows the data lake to scale independently. It also uses commodity hardware, so it has a lower cost. 

Fig 2. Data lake architecture (source)

A data lake is cost-efficient whereas a data warehouse is performance-efficient. A new data architecture, the data lakehouse, offers the best features of the data warehouse and a data lake. It reduces the complexity by moving analytics from the data warehouse to the data lake. A data lakehouse is usually a combination of a federated query engine and storage that offers a data warehouse, data lake, and analytics features. Data lakehouses are still new, haven’t matured, and have challenges.

There are various data warehouse and data lake tools and frameworks for data processing and analytics in the cloud ecosystem. However, the two major rivals, Databricks and Snowflake, are gaining traction and competing head to head. Although they have a similar design, architecture, and support analytics, they aren’t quite the same. Snowflake replaces legacy data warehouses and supports ELT, while Databricks offers a data processing engine powered by Spark that is used with the data warehouse.

Stateful geo-replicated stream processing keeps globally distributed data consistent
One integrated platform for streams, key values, docs, graphs, and search simplifies development
Declarative configuration using JavaScript and SQL avoids the need to learn a new syntax
Free Dev Account

Databricks

Databricks is a cloud-based platform specializing in analyzing data at scale regardless of its location. It is a data and analytics platform that helps enterprises extract business intelligence from the data. It also provides a complete data science workspace with its machine learning runtime, Managed MLflow, and collaborative notebooks. 

Databricks is well known in the industry due to its ability to process a large amount of data. In addition, it has support for multiple languages, which makes it more potent because you can integrate libraries from any programming language ecosystem. As a result, large enterprises use Databricks for production operations in different industries, including healthcare, fintech, entertainment, and many more. Databricks also offers Delta lake, an open-source storage layer that enables building a lakehouse architecture. 

Snowflake

Snowflake is a cloud-based data warehouse that seamlessly provides all the data warehouse functions with a single tool without different system integrations. The Snowflake team took the legacy data warehouse concept and developed a modern and fully managed cloud data warehouse.

It’s easy to get started, very cheap, and quick to scale compared to a legacy data warehouse. Decoupled storage and computing enable data sharing and scaling more easily and with low overhead. Snowflake abstracts cloud complexities and lets customers effortlessly load, integrate, process, analyze, and share their data. Also, it is easier to migrate to Snowflake because it has been designed and developed from a data warehouse perspective.

Databricks and Snowflake Comparison Table

The following table summarizes the differences between Databricks and Snowflake. 

Databricks Snowflake
Service Model PaaS SaaS
Major Cloud Platform Support Azure, AWS, Google Azure, AWS, Google
Migration to Platform Complex because it is a data lake Easy because it is designed based on a data warehouse
Scalability Auto-scaling Auto-scaling up to 128 nodes
Vendor Lock-in No Yes
User-Friendliness Learning curve Easy to adopt
Data Structures All data types (raw, audio, video, logs, text, etc.) Semi-structured or Structured data
Services Big data, data science, data analytics, and machine learning Database management and data warehouse
Data Science and Machine Learning Built-in and unified tool for any type of development Only available via third-party integrations
Cost Pay by usage Pay by usage
Query Interface SQL, Spark Dataframe, Koalas SQL
Query Optimization Vectorization and cost-based optimization Vectorization and cost-based optimization
Provisioning of Different Types of Nodes Yes No
IPO No 2020
Valuation $38 billion $33 billion

Table 1. Comparison between Databricks and Snowflake

Architecture

The architecture of data lakes separates them from conventional data warehouses because of the decoupling of storage and computing. Databricks has a separate layer for storage and computation, which makes it more flexible to scale and leverage the different types of processing engines suited to each use case. 

Although Snowflake is a managed service and architecture is transparent from end-users, it also has a separate storage and processing layer. Also, Snowflake's node types are unknown, but Databricks gives you the freedom to choose the correct node.

Architecture

The architecture of data lakes separates them from conventional data warehouses because of the decoupling of storage and computing. Databricks has a separate layer for storage and computation, which makes it more flexible to scale and leverage the different types of processing engines suited to each use case. 

Although Snowflake is a managed service and architecture is transparent from end-users, it also has a separate storage and processing layer. Also, Snowflake's node types are unknown, but Databricks gives you the freedom to choose the correct node.

Data Ownership

Snowflake is inspired by legacy warehouse architecture but modernized it. Under the hood, it has decoupled storage and processing and can be scaled independently while still owning both of the layers.

In contrast, Databricks has fully decoupled storage and processing layers. It lets you store data anywhere in any format or shape. It focuses on the processing layer and offers freedom to choose the processing engine while seamlessly integrating third-party solutions.

Data Structure

As mentioned above, Snowflake also supports semi-structured data. Data can be loaded directly into Snowflake without going through an ETL (Extract, Transform, and Load) process.  

However, Databricks permits storing all types of data in any format and type since its storage layer is independent of the processing layer. Databricks can work as the ETL tool to add structure to the unstructured data. 

Scalability

Both platforms leverage cloud computing to scale quickly without significant overhead. Databricks can scale as much as you can invest in the infrastructure, but Snowflake is limited to 128 nodes. Also, Snowflake offers you fixed-sized warehouse options where the end-user cannot resize nodes but can resize clusters with a single click. Additionally, Snowflakes offers auto-scaling and auto-suspend to allow starting and stopping clusters during idle or busy times. In contrast, although Databricks lets you provision different types of nodes at various levels of scale, it is more complicated than a single click. You need to have the necessary technical expertise to scale the Databricks cluster.

Platform
Spark
Flink
Macrometa
Batch
✔️
✔️
✔️
Complex Events
✔️
✔️
✔️
Streams
✔️
✔️
✔️
Geo-Replication
✔️
✔️
Database
✔️
Graphs
✔️
Free Dev Account
Platform
Batch
Complex Events
Streams
Geo-Replication
Database
Graphs
Spark
✔️
✔️
✔️
Flink
✔️
✔️
✔️
✔️
Macrometa
✔️
✔️
✔️
✔️
✔️
✔️

Performance

Snowflake is the best option if you need high-performance queries because it already has structured data suitable to the business use case. Both Snowflake and Databricks implement cost-based optimizations and vectorization. Additionally, Databricks offers hash integrations to accelerate the aggregation of queries. 

Snowflake has slow performance on semi-structured data as it may need to load all the data into RAM and perform a complete scan. In contrast, Databricks lets you optimize data processing jobs to run high-performance queries.

Finally, Snowflake is batch-based and needs the entire dataset for results computation, while Databricks is a continuous data processing (streaming) system that also offers batch processing. 

It’s worth noting that there has been a blog war between Databricks and Snowflake about the performance metrics published by Databricks claiming that they are 2.5x faster than Snowflake. 

Machine Learning

Databricks has a robust machine learning ecosystem to develop different models. In addition, it supports development in various programming languages, so it makes it easier to use libraries and modules. 

Snowflake doesn’t have any ML libraries, but it offers connectors to integrate various ML tools. In addition, it gives access to its storage layer or export query results, which can be used for training and testing the models.

You can develop machine learning and analytics using Snowflake, but it requires integration with other solutions and doesn’t offer such services out of the box. Instead, it provides drivers to integrate with other platform libraries or modules and access the data.

Use Cases

Snowflake is best suited for SQL-based business intelligence use cases due to its design and architecture. Conversely, Databricks allows SQL-based business intelligence and also offers support for a wider variety of use cases, such as recommendation engines or intrusion detection. In addition, both products support dashboards for reporting and analytics.

Databricks can scale up to meet the high throughput demands of any high-volume system, but its query performance for analytics will be slow. On the other hand, Snowflake has limited support for continuous writes and concurrency but can beat the performance of Databricks.

Databricks and Snowflake Differences

Snowflake is a fully managed service, so it is easy to deploy and scale up or down. Most of the operations are hidden from the end-user, so there are few options for fine-tuning. On the other hand, Databricks needs a lot more administration and deployment; it requires expertise to optimize the queries executing against the data lake engine.

Snowflake has a fixed pricing model for the managed compute and storage, but Databricks has an open-source option where you can use the storage from any cloud vendor of your choice. So, Snowflake locks you to a specific vendor, and you can only get the services from them. On the other hand, Databricks has the flexibility to integrate and use any service or third party.

Both platforms are dependent on cloud platforms because of their storage layers. However, they are threatened by cloud providers that offer similar platforms and have better integration within their ecosystems.

Conclusion

While Databricks and Snowflake have some things in common, one is geared toward data lakes while the other brings the data warehouse perspective. Snowflake is best suited for SQL-like business intelligence applications and provides optimal performance. On the other hand, Databricks offers Delta Lake and support for multiple programming languages, allowing you to develop your data science use cases with any tools and frameworks. Snowflake is also a little easier to use compared to Databricks, which has a steep learning curve. Due to their various trade-offs, many companies are using them together: Databricks for ETL and Snowflake for data warehouses. 

Subscribe to our Linkedin Newsletter to recieve more educational content
Subscribe now