This guide helps you to get started using CDC with the PostgreSQL database system.
PostgreSQL is a famous open-source database management system, which is in production at a plethora of enterprises. In the typical setup, PostgreSQL manages the transactional data of applications, such as products in an e-commerce shop, and integrates third-party data systems for other purposes, e.g., a data warehouse for analytics, a BI tool for reporting, etc.
The traditional approach to connecting PostgreSQL with other data stores is batch-based. From time to time, data pipelines extract all data from PostgreSQL and send them to downstream data stores, which is not only inefficient but also prone to errors.
Change data capture (CDC) is a modern alternative that can extract record-level change events (INSERTs, UPDATEs, and DELETEs) from PostgreSQL in real-time. The main benefits of change data capture are:
CDC captures change events in real-time, keeping downstream systems, such as data warehouses, always in sync with PostgreSQL and enabling fully event-driven data architectures.
Using CDC reduces the load on PostgreSQL since only relevant information, i.e., changes, are processed.
CDC enables the efficient implementation of use cases requiring access to change events of PostgreSQL, such as audit or changelogs, without modifying the application code.
In this article, we provide a complete introduction to using change data capture with PostgreSQL. We cover three common approaches to implementing change data capture: triggers, queries, and logical replication. While each approach has its own advantages and disadvantages, at DataCater our clear favorite is log-based CDC using logical replication.
With PostgreSQL’s Trigger feature, we can listen for all INSERT, UPDATE, and DELETE events occurring in the table of interest and, for each event, insert one row into a second table, building a changelog.
The PostgreSQL community provides a generic Trigger function (Code), which supports PostgreSQL version 9.1 and newer and stores all change events in the table audit.logged_actions. Given that we want to enable the Trigger-based change data capture for the table public.users, we can run the following SQL statement:
This approach to CDC stores captured events only inside PostgreSQL. If you want to sync change events to other data systems, such as a data warehouse, you would have to recurringly query the PostgreSQL table holding the change events (here audit.logged_actions), which increases the complexity of the implementation.
Let’s compare the pros and cons of implementing change data capture with triggers in PostgreSQL:
Changes are captured instantly, enabling the real-time processing of change events.
Triggers can capture all event types: INSERTs, UPDATEs, and DELETEs.
By default, the PostgreSQL Trigger function used here adds helpful metadata to the events, e.g., the statement that caused the change, the transaction ID, or the session user name.
Triggers increase the execution time of the original statement and thus hurt the performance of PostgreSQL.
Triggers require changes to the PostgreSQL database.
If change events shall be synced to a data store other than the same PostgreSQL database, we would need to set up a separate data pipeline, which polls the table filled by the trigger function (here audit.logged_actions).
Creating and managing triggers induces additional operational complexity.
The second way to implement change data capture with PostgreSQL is query-based.
If the schema of the monitored database table features a timestamp column indicating when a row has been changed the last time, we could recurringly query PostgreSQL using that column and ask for all records, which have been modified since the last time we queried PostgreSQL. Assuming a table named public.users and a timestamp column named updated_at, such a query could be implemented as follows:
SELECT * FROM public.users WHERE updated_at > 'TIMESTAMP_LAST_QUERY';
Note that query-based CDC cannot capture DELETEs - unless using soft deletions - but is limited to only INSERT and UPDATE events.
Query-based CDC can be implemented without introducing any changes to PostgreSQL, if the schema holds a timestamp column indicating the modification time of rows.
Query-based CDC implementations use the query layer for extracting data, which puts additional load on PostgreSQL.
Query-based CDC requires the recurring polling of the monitored table (here public.users), which wastes resources if data rarely change.
Query-based CDC requires the presence of a column (here updated_at) that tracks the time when the record has been last modified.
Query-based CDC cannot capture DELETE events (unless the application uses soft deletions).
Since version 9.4, PostgreSQL offers logical replication for efficiently and safely replicating data between different PostgreSQL instances on possibly different physical machines. Technically, it’s a write ahead log on disk, which holds all events that change the data of the PostgreSQL database, e.g., INSERTs, UPDATEs, and DELETEs.
PostgreSQL uses a subscription model with publishers and subscribers for the implementation of logical replication. For the purpose of implementing change data capture, we might employ the database of interest as a publisher and subscribe ourselves to its log.
While many database systems might already use replication it is not enabled by default. You can enable logical replication by introducing the following change to the configuration file postgresql.conf.
wal_level = logical
In the next step, you need to modify the configuration file pga_hba.conf to allow replication (please refer to the PostgreSQL documentation for individual configuration):
host all repuser 0.0.0.0/0 md5
Let’s assume that you want to capture changes from the table public.users. You can enable CDC for this table by creating a new publication as follows:
CREATE PUBLICATION newpub FOR TABLE public.users;
In the next step, you can start subscribing to this publication. A subscription starts with an initial snapshot and then replicates all incremental changes. If you would like to consume the events from another PostgreSQL instance, you might create the subscription as follows:
CREATE SUBSCRIPTION newsub CONNECTION 'dbname=foo host=bar user=repuser' PUBLICATION newpub;
Technically, logical replication is implemented by a logical decoding plugin. If you are using a PostgreSQL version older than 10, you need to manually install a plugin in your PostgreSQL database, e.g., wal2json or decoderbufs. Since version 10, PostgreSQL ships the plugin pgoutput by default.
For the technical implementation of log-based change data capture, we highly recommend using one of the existing open-source projects, such as Debezium. DataCater’s PostgreSQL source connector is based on Debezium.
Most managed PostgreSQL services, which we are aware of, offer support for logical replication, e.g., AWS RDS, Google Cloud SQL, or Azure Database.
The following list shows the advantages and disadvantages of using PostgreSQL’s logical replication for implementing CDC:
Log-based CDC enables the event-driven capturing of data changes in real-time. Downstream applications have always access to the latest data from PostgreSQL.
Log-based CDC can detect all change event types in PostgreSQL: INSERTs, UPDATEs, and DELETES.
Consuming events via logical replication boils down to directly accessing the file system, which does not impact the performance of the PostgreSQL database.
Logical replication is not supported by very old versions of PostgreSQL (older than 9.4).
When comparing the three approaches to implementing change data capture with PostgreSQL, using logical replication is the clear winner. It is not only highly efficient, capturing all event types in real-time without harming the performance of the PostgreSQL database, but is also widely available, whether you’re using a self-managed or a managed PostgreSQL installation, and applicable without introducing changes to the database schema.
CDC connectors are typically more complicated to manage than the traditional SELECT * FROM table; query. In DataCater, we offer plug & play connectors for change data capture, which enable you to set up log-based CDC with PostgreSQL in a few minutes and take the operational complexity away from you. Give it a try!