Database Indexing and Partitioning: Tutorial & Examples
Chapter 5 of Distributed Data
Today’s applications need fast data access. The proliferation of sensors and the explosion in the number of devices online require a lot of data to be processed quickly. Databases form the backbone of this modern digital world and must be designed with efficiency and ease of access in mind. This article will explore how Indexing and Partitioning can enable efficient data storage and retrieval, how each strategy can apply to different scenarios, best practices and anti-patterns for using them, and how the two methods can complement each other.
In a nutshell, a database index is an internal data structure maintained by a database and is used to look up a record quickly. It can improve database performance by narrowing the search space and reducing the amount of data read by queries. Generally used to improve SELECT query performance, indices can hurt UPDATE and DELETE performance and should be avoided on tables with frequently changing data.
Partitions refer to the arrangement of data in a database to be accessed more efficiently. Partitioning makes it easier to add new data. Partitioning can also speed up queries by reducing the amount of data queries have to scan to retrieve information. Generally used to separate archive records from the more recent data in the same table, partitions add some management overhead and should be avoided on small tables. Partitioning can also happen on a database level. For example, if a database is getting huge because of old data (which is not helpful for most queries), we can create a copy of the database and move old records to the new database. This results in two databases; one can be used to access old information, and the other only houses recent data. This can make queries accessing current data faster by reducing the amount of data they have to read.
Before we dive into a real-world scenario for using Indices and Partitions, here’s a summary of the two strategies:
Summary of key concepts
Let’s consider the example of a hypothetical database used by a high-end connected vehicle to highlight the characteristics, commonalities, and differences between using indexing and partitioning. To improve the driving experience, a connected vehicle uses its myriad of sensors to collect information constantly. What’s the temperature outside? Where is the driver headed? What’s on her schedule? How is traffic, speed, current location, etc.? What has happened on the same day and time of the week for the last few weeks? Is this an ice-cream kind of a day or hot coffee? All this information helps the car predict what the driver might need next and react accordingly to improve their driving experience. The speed of access to the database can define whether any given feature is even possible. Split-second decisions can only be made when data is accessible in the blink of an eye. This is where Indices and Partitions come in; let’s take a closer look at each.
Indexing basics & Best Practices
Database indices play a key role in speeding up data retrieval. This is because they minimize the volume of accessed data and reduce the retrieval time. There are two basic types of indices, Primary and Secondary. A Primary Index is generally set on a column with only unique values, and is also called a Clustered Index. A Secondary Index on the other hand can be created on columns with repeating values (duplicate data). To understand how indices work, let’s consider a table a connected vehicle uses to log its location every second. The vehicle inserts a new record into a table called “Location”. There are no deletes or updates to this table, just inserts.
CREATE TABLE [dbo].[Location]( [location_id] [bigint] NOT NULL, [latitude] [int] NOT NULL, [longitude] [int] NOT NULL, [log_time] [datetime] NOT NULL ) ON [PRIMARY]
Notice it’s a simple table with no indices defined. Records are not ordered in the table by any field, meaning reading this table would be like reading a book without page numbers. There will be no order to the topics either. To find any information in this table, we will have to scan every page looking for the information we need. It’s the worst way to read a book (or table) because it takes up valuable time and resources.
This is where indices come to the rescue. We must add page numbers if we want to order chapters in a book. An index that adds order to a table is called a clustered index. It sorts the data in the table and can be defined on one column (like shown below) or as a by grouping a set of columns in the table that combine to make a unique value (in which case the index is called a Composite Key..
ALTER TABLE dbo.Location ADD CONSTRAINT PK_Location PRIMARY KEY CLUSTERED ( location_id ) ON [PRIMARY]
Here, we are ordering all records in the table by the location_id column. But how exactly is this index live inside the RDBMS? Indices are implemented using a variety of data structures such as B-trees, hash tables, etc. The most common data structure used in database index is called a B-tree. In this implementation, an index is implemented as a tree of pages (the basic internal structure for organizing data in database files). The tree's root is the index, and each node in the tree is either an index or a leaf. They are organized as multi-level structures with a “Root” level, an “Intermediate level”, and the “Leaf” level. The structure of a B-Tree index provide multiple advantages. Besides supporting performant random searches (storing lots of data and then searching specific piece of data) as well as sequential access (find records with IDs within a certain range), it also allows for insertions and deletions by using partially full pages.
B-Tree structure of a clustered index
But how has our clustered index helped? Let’s consider a range type query where we are looking for location_id between 5 million and 5.2 million. Having an index on the column allows the RDBMS to quickly find the location of the five millionth record, read the consecutive rows to 5.2 million, and then stop without needing to scan the rest of the table. Imagine the time we saved in a table with potentially billions of rows!
Indexes don’t always need to order data. They can be “Non-Clustered” and used to search for data in any column. For instance, what if we needed to search for location_ids in the table for a specific log_time value?
Select location_id from [Location] where log_time = '2022-05-30 00:16:35.993'
If we look at the execution plan for this query, we can see that RDBMS decides to do a scan operation to find the data, which is slow and expensive:
Query execution plan - “Clustered” scan
We can create another index to allow queries to search by log_time like so:
CREATE NONCLUSTERED INDEX [IX_location_log_time] ON [dbo].[Location] ( [log_time] ASC )
This will create a new index that queries will use with WHERE conditions that use the log_time column. If we rerun our query, we will see the data is being read from a “Seek” operation on the index, which is fast and efficient:
Query execution plan using “Non-Clustered” Seek
Great! So if indices are so helpful and speed up our queries, why not add them for every column? It’s because they come at a price.
Every index takes time to create and maintain. When creating an index, we are basically duplicating data in the specific column(s) on which the index gets defined. We are creating our B-Tree structure, and adding new pages to the database. This takes up space, increases the database size, and most importantly, these pages get added to, re-arranged, or deleted when we INSERT, UPDATE or DELETE data from the table. This is why Indices can slow down these operations and must be created with care on tables with frequently changing data.
OK, so which columns are best suited for indices? Ideally, we should identify columns in all large tables frequently used in JOIN conditions and index those columns. This makes any queries using those joins efficiently. We should also identify any columns used frequently for GROUP BY queries and index them. This makes Aggregation queries run faster. Any time we need to sort data, locate rows by column values or correlate data across tables, indices can help with the query performance.
Partitioning basics & Best Practices
Partitioning is the act of splitting large datasets into smaller units to improve the efficiency of queries, securing data and reducing disk contention among other things. Database partitioning and table partitioning are two different ways to manage data in a database. Both methods allow you to split a large database into smaller, more manageable databases and tables, but they differ in how they accomplish this. Which method you use depends on the specific needs of your application and the architecture of the database environment in which it runs. Some databases, like Amazon Aurora and PostgreSQL, support table partitioning, and some, like MySQL, support only database partitioning. The primary benefit of using partitioning is that it enables parallelism, which is the ability to perform multiple tasks or operations at the same time. This improves performance in high-throughput applications that store large amounts of data, such as OLTP and large data analytics systems
Database partitioning is a method for dividing a database into separate sections called partitions. Each partition contains a single copy of the data in the database and functions as a separate database in its own right. This separation enables you to scale your application across multiple servers without affecting your application’s functionality. It also improves the performance of queries that refer to data in different partitions, which can reduce your database's size and improve your application's speed.
Table partitioning is the process of splitting a single table into multiple tables. The split can happen vertically (so the table has fewer columns), horizontally (so the table has fewer rows). It can also be functional (which maps rows of data into one partition or the other depending on their value). Partitioning is most useful when dealing with huge tables with billions of rows of data. Going back to our example of the Location table, imagine the connected vehicle using this table has been using this table for many years. There are billions of rows in the table now, and queries accessing this table have gotten slower over time, even with indexing in place. We know the queries are slow because of the volume of data, so how can we reduce this volume without losing the historical data? This is where table partitioning can help!
Suppose the Location table we discussed earlier has been in use for 10+ years and has billions of rows of data. Also, suppose there are two types of queries executing against this data:
- Current queries: These queries are being used by the software installed in the connected vehicle to make quick decisions and only need to access Location data for the current month.
- Historical queries: These queries are used for reporting purposes. They access data older than the current month and don’t have to be lightning fast.
Armed with the insight into the kinds of queries that run against our table, we can now split our table into two partitions, a “Current” and a “Historical” partition. The non-clustered indices on the original table will now exist on both Current and Historical partitions.
The immediate performance benefit would be to the Current queries, which will only need to access the Current partition with just a month’s worth of data. Historical queries will only run against the Historical partition.
Another benefit of this sort of partitioning will be observed as new data gets added to the Current partition. INSERTs will run a lot faster because non clustered index on the current table will be much smaller. The same goes for UPDATEs and DELETE statements. By reducing the data in the current partition, we will make our current queries many times faster.
When it comes to tuning database performance, Indexing and Partitioning are excellent tools in a DBA’s toolbelt. It is important to consider the pros and cons when selecting either one. Both techniques reduce the amount of data used by queries to allow them to run faster. Indices work best on tables with less data churn (INSERTs/UPDATEs/DELETEs), whereas Partitioning speeds up the same operations on huge tables. Partitions are also a great approach for archiving old records.