What is Change Data Capture (CDC)?
The utility and value of data are increasing exponentially every day. This is because of the modern methods and techniques that allow the extraction of valuable insights for businesses and organizations. The volume of data is simultaneously increasing which brings about storage challenges. This means that experts need to implement architectures that can effectively store and update data which is imperative to its effective usage.
CDC or Change Data Capture is an architecture that allows database systems to communicate and synchronize their data records. This method employs techniques for real-time or approximately real-time data transfer across all storage points including data warehouses and lakes.
CDC techniques - Push and Pull
There are two main approaches used to implement CDC techniques, the first one being pull where a target system regularly checks the source systems for any change in the database. Push sends the updated version to the target data storage as soon as an update is made.
The Advantages vs Disadvantages of Push and Pull
The push and pull methodologies have their trade-offs.
Advantages: Low latency for the change to be implemented for the target source. The update is immediately pushed and does not have to wait for the target source to query whether or not any changes have been made.
- Push is more computationally intensive for the source system. This is because it requires an algorithm to recognize any changes made and then has to communicate the changes to the target storage.
- If the target source is unavailable, the push methodology may cause a loss of updates. This is because the updates will be pushed from the source but the target wouldn’t receive them. Hence, the update will not take place.
Advantages: Less computationally expensive for the source system. Here, the target system has to regularly check to identify any changes in state and implement them.
Disadvantages: Higher latency - If there’s an update it will have to wait for the target system to scan for changes and implement them.
This is a pull-based approach where the target system checks for the latest version in a row. The row has an attribute such as a version number or a time-stamp. The source then checks for the latest version and implements the updates from source to the corresponding rows in the target database.
This is a push-based approach. The source (publisher) pushes the change directly into a data queue, the target (subscriber) then acts upon the received updates.
The benefit of this approach is that it decouples the source and target, so if the target system is undergoing downtime, the updates are resting in the queue. They can later be acted upon as soon as the system is available again.
Macrometa’s stream processing engine allows you to perform ETL (Extract Transform Load) which is an effective method to transfer data. The last step involves data transfer which can be done with the help of Pub-Sub based model which is a real-time, low latency, and a globally distributed data transfer service.
Data transfer is a challenging task because the volume, velocity, and accuracy of the transfer are extremely important aspects. Change data capture methods are employed to effectively carry out this transfer from a source to a target destination to maintain a global updated database. These methods can use a push or pull approach based on the requirements of a process.