Everything you need to know about Change Data Capture

Learn how to turn data stores into streams of change events.

...
By Stefan Sprenger

Many data architectures employ different data stores for different use cases. They may use traditional database systems for organizing raw operational data, data warehouse systems for preparing data for downstream analytics or ML applications, and search engines for providing powerful search interfaces to data. In addition, many enterprises do not only maintain internal data but also integrate external data services into their architecture, which further increases the number of data stores in use.

Data stores are rarely operated in isolation but often need to exchange data with each other. Exemplary use cases for transferring data from a data source to a data sink are:

  • Loading operational data from a database (data source) into a data warehouse system (data sink),

  • indexing a database system (data source) with a search engine (data sink), or

  • integrating external data services (data source) into internal database systems (data sink).

The common approach to transferring data between data stores, a problem often described as Extract Transform Load (ETL), is the recurrent execution of data pipelines that extract all data from a data source, perform optional transformations on the data, and eventually write the transformed data to a data sink.

While this approach is fairly easy to implement, it is very inefficient when applied in practice. For each run, data pipelines have to consider all data from the data source, even if there have not been any changes since the last run, effectively wasting computing resources and putting a significant load on all involved systems. To avoid impacting the performance of the consumed data source and interfering with other workloads, data pipelines are typically executed with a very low frequency, e.g., each night at 2 am. As a consequence, data sinks are seldomly in sync with data sources.

What if we could do better?

Change Data Capture (CDC)

Change data capture (CDC) is a technique used to detect and capture record-level change events that occurred in data stores, e.g., insertions of new records, updates of existing records, or deletions of records.

CDC offers two main advantages compared to the traditional full copy of data sets. First, data pipelines need to consider only the data that have changed since the last run, which makes the consumption of computing resources more efficient. Second, given that information on change events become available immediately after their occurrence, data pipelines could turn into streaming applications, which process data change events in real time and always keep data sinks in sync with data sources.

The following sections discuss different approaches to implementing CDC.

Diff-based CDC

The most straightforward approach to capturing data changes is comparing the current state of the data source to the state of the data source when the data pipeline has been last executed.

The diff-based CDC implementation has two main disadvantages. First, for computing the differences the implementation would still need to access all data regardless of whether they have changed. Second, space usage would strongly increase because the implementation would need to always keep a snapshot of the data from the time of the last run.

In practice, diff-based CDC implementations rarely offer advantages compared to full copies of the data unless the execution of data transformations or the loading of data into the data sink are performance bottlenecks.

  • Detection of all data change events (insertions, updates, and deletions)

  • Uses the query layer for extracting data, which puts additional load on the data source

  • Requires recurrent polling of the table

  • Determining the difference between two data sets is a compute-heavy operation that makes frequent executions impossible

  • Additional space requirements for caching the state of the data set

Timestamp-based CDC

To ease the detection of changes one could introduce a column to the schema of the data source that stores when a record has been last modified. In some data sets, such a column may already exist.

Data pipelines could query the timestamp column to retrieve those records, which have been changed since the last run.

Timestamp-based CDC cannot capture deletions and is only suitable for applications that perform soft deletions or do not delete records at all.

  • Uses the query layer for extracting data, which puts additional load on the data source

  • Requires recurring polling of the table

  • Requires the presence of a column that tracks the time when the record has been last modified

  • Cannot capture deletions unless the application uses soft deletions

Trigger-based CDC

Many database systems provide trigger functions as a means to performing user-defined actions once events, like insertions of data, occur (e.g., Triggers in PostgreSQL, Triggers in SQL Server, or Triggers in MySQL). Trigger functions could be used to capture all changes that occur in a particular table and track them, for instance, in a separate table used as an event queue.

While trigger functions could be used to allow data pipelines to only process the data that have changed since the last run, they still require recurring polling of the event table.

  • Event-driven capturing of changes

  • Detection of all data change events (insertions, updates, and deletions)

  • Support for tracking custom information, such as the name of the database user performing the operation

  • Change events trigger writes to the event table, often strongly increasing the execution time of the original operation

  • Changes to the schema of the data table must be manually propagated to the event table

  • Requires recurring polling of the event table

  • Vendor-specific code for implementing trigger functions, which impedes future migrations to other database systems

Log-based CDC

Most database systems maintain transaction logs (or operation logs), which keep track of the operations applied to a database table, for replication or recovery reasons. The log-based approach to implementing CDC leverages these logs for capturing and extracting change events.

Transaction logs are typically updated in real time and do not create any computational overhead, like trigger functions, which enables an implementation of CDC that does not only allow the event-driven processing of changes but also keeps the performance of the data source unaffected.

Although most of today’s database systems provide transaction logs (see, for instance, MySQL’s Binary Log, logical replication in PostgreSQL, or MongoDB’s Oplog), log-based CDC cannot be used with any database system. Especially very old releases of database systems do not support such functionality.

Transaction logs do not keep the full history of operations performed on a database table, but only store the operations applied within a particular retention period (for instance, within the last seven days). At DataCater, we read a full snapshot of a database table before starting to consume the transaction log to be able to offer a full replication of a data source to a data sink.

  • Event-driven capturing of changes

  • Detection of all data change events (insertions, updates, and deletions)

  • No impact on the performance of the data source

  • The implementations of transaction logs are specific to the vendor, which impedes future migrations to other database systems

  • Transaction logs typically do not store the full history of a database table but only the operations performed within a particular retention period

Change Data Capture with DataCater

At DataCater, we believe that change data capture must enable an event-driven processing of changes such that data sinks are always in sync with data sources. Additionally, CDC must be able to extract change events without impacting any other workloads. As a consequence, we strongly prefer log-based CDC whenever possible.

DataCater uses log-based CDC connectors for all major database systems, including IBM DB2, MySQL, Microsoft SQL Server, Oracle, and PostgreSQL. DataCater hides vendor-specific implementations of transaction logs and provides a unique interface to the event-driven capturing and processing of data changes.

DataCater provides interactive means to building streaming data pipelines. When using log-based CDC connectors, DataCater offers possibly the most efficient way to stream change events from data sources to data sinks and transform them on the way.

Summary

This article compared different approaches to implementing change data capture, a technique used for extracting change events from data stores. Although all approaches have their own strengths and weaknesses, from our experience, log-based CDC is the most efficient way for implementing CDC in production use cases, because it enables an event-driven capturing of changes and does not impact the performance of any other workloads performed on the data store.