Everything you need to know to get started using change data capture with MySQL.
MySQL is one of the most popular database management systems in the world, successfully powering many applications since its first introduction in 1995. MySQL is typically used for managing the core (or transactional) data of applications, such as products or sales in an e-commerce shop, and is often complemented with other data systems, e.g., a data warehouse for analytics, a search engine for search, etc.
The traditional approach to syncing MySQL with complementary data stores is batch-based. From time to time, data pipelines extract all data from the MySQL database system and send it to downstream data stores. Change data capture (CDC) is a modern alternative to inefficient bulk imports. CDC extracts change events (INSERTs, UPDATEs, and DELETEs) from data stores, such as MySQL, and provides them to a data pipeline. The main advantages of CDC are:
CDC typically captures changes in real-time, keeping downstream systems, such as data warehouses, always up-to-date and enabling event-driven data pipelines.
Using CDC decreases the load on all involved systems since only relevant information, i.e., data change events, are processed.
CDC enables straightforward implementations of use cases requiring access to data changes, such as an audit or changelog, without changing the code of applications.
In this comprehensive article, you will get a full introduction to using change data capture with MySQL. We cover three common approaches to implementing change data capture: triggers, queries, and MySQL’s Binlog. While each approach has its own advantages and disadvantages, at DataCater our clear favorite is Binlog-based CDC.
Using MySQL’s support for TRIGGER functions we could listen for all INSERT, UPDATE, and DELETE events occurring in the table of interest and, for each event, insert one row into another table, effectively building a changelog.
Let’s assume a MySQL database features the following table users:
CREATE TABLE users ( user_id INT, user_name TEXT, user_email TEXT );
We could create a trigger, which captures all INSERT events from the table users and appends a new row for each INSERT event to the table users_change_events:
CREATE TABLE users_change_events ( log_id BIGINT AUTO_INCREMENT, event_type TEXT, event_timestamp TIMESTAMP user_id INT, user_name TEXT, user_email TEXT PRIMARY KEY ('log_id') ); CREATE TRIGGER user_insert_capture AFTER INSERT ON users FOR EACH ROW BEGIN INSERT INTO users_change_events ( event_type, event_timestamp, user_id, user_name, user_email ) VALUES ( 'INSERT', now(), NEW.user_id, NEW.user_name, NEW.user_email );
UPDATE and DELETE events can be captured accordingly.
By default, this approach stores captured events only inside MySQL. If you want to sync change events to other data systems, you would have to recurringly query the MySQL table holding the events, which increases the complexity.
Let us have a look at the pros and cons of implementing change data capture with triggers in MySQL:
Changes are captured instantly, enabling the real-time processing of change events.
Triggers can capture all event types: INSERTs, UPDATEs, and DELETEs.
Using triggers, it's easy to add custom metadata to the change event, such as the name of the database user performing the operation.
Triggers increase the execution time of the original operation and thus hurt the performance of the database.
Triggers require changes to the MySQL database.
Changes to the schema of the monitored table (here users) must be manually propagated to the table filled by the trigger function (here users_change_events).
If change events shall be synced to a data store other than the same MySQL database, we would need to set up a data pipeline, which polls the table filled by the trigger function (here users_change_events).
Creating and managing triggers induces additional operational complexity.
The second approach to implementing change data capture with MySQL is query-based.
If the schema of the monitored database table features a timestamp column indicating when a row has been changed the last time, we could recurringly query MySQL using that column and ask for all rows, which have been modified since the last time we queried MySQL. Assuming a table named users and a timestamp column named updated_at, such a query could be implemented as follows:
SELECT * FROM users WHERE updated_at > 'TIMESTAMP_LAST_QUERY';
Note that query-based CDC cannot capture DELETEs - unless using soft deletions - but is limited to INSERT and UPDATE events.
Query-based CDC can be implemented without introducing any changes to the database, if the schema holds a timestamp column indicating the modification time of rows.
Query-based CDC implementations use the query layer for extracting data, which puts additional load on the MySQL database.
Query-based CDC requires the recurring polling of the monitored table, which wastes resources if no data change between two accesses.
Query-based CDC requires the presence of a column that tracks the time when the record has been last modified.
Query-based CDC cannot capture DELETE events unless the application uses soft deletions.
MySQL offers the Binlog for efficiently and safely replicating data between different database instances on possibly different physical machines. Technically, the Binlog is a binary file on disk, which holds all events that change the content or structure of the MySQL database, e.g., INSERTs, UPDATEs, DELETEs, schema changes, etc.
For the purpose of implementing change data capture, we might attach to the Binlog of the MySQL database and consume all change events occurring in the monitored table.
While many database systems already use the Binlog for replication purposes it is not enabled by default. If your MySQL instance does not yet use the Binlog, you can enable it by introducing the following changes to the MySQL configuration:
server-id = 42 log_bin = mysql-bin binlog_format = ROW expire_logs_days = 10 # define `binlog_row_image` for MySQL 5.6 or higher, # leave it out for earlier releases binlog_row_image = FULL
Managed MySQL instances typically do not directly expose access to the configuration. However, most managed MySQL services, which we are aware of, offer support for the MySQL Binlog, e.g., AWS RDS, Google Cloud SQL, or Azure Database.
For the technical implementation of Binlog-based change data capture, we highly recommend using one of the existing open-source projects, such as Debezium or Maxwell’s daemon. DataCater’s MySQL source connector is based on Debezium.
The following list shows the advantages and disadvantages of using MySQL’s Binlog for implementing CDC:
Binlog-based CDC enables the event-driven capturing of data changes in real-time. Downstream applications have always access to the latest data from MySQL.
Binlog-based CDC can detect all change event types: INSERTs, UPDATEs, DELETES, and even schema changes.
Since reading events from the Binlog boils down to directly accessing the file system, this CDC approach does not impact the performance of the MySQL database.
The Binlog is not available in very old versions of MySQL (at DataCater we support MySQL 5.5 or newer).
The Binlog does not store the entire history of change events performed on the database table but only the operations performed within a particular retention period (defined by the configuration option expire_logs_days), which is why we typically combine it with an initial full snapshot of the monitored table using a SELECT * FROM table_name; query.
When comparing the three approaches to implementing change data capture with MySQL, using the MySQL Binlog is the clear winner. It is not only highly efficient, capturing all event types in real-time without harming the performance of the MySQL database, but is also widely available, whether you’re using a self-managed or a managed MySQL installation, and applicable without introducing changes to the database schema.
CDC connectors are typically more complicated to manage than the traditional SELECT * FROM table; query. In DataCater, we offer plug & play connectors for change data capture, which enable you to set up Binlog-based or query-based CDC with MySQL in a few minutes and take the operational complexity away from you. Give it a try!
Did you like this article? Did we miss anything? Feel free to reach out to us. We would be more than happy to chat CDC, show you DataCater in action, or discuss any other request with you.