Improve the efficiency and freshness of your data processing by extracting change events from Web APIs instead of performing bulk loads.
Change data capture (CDC) is a technique to detect and extract data change events from data systems. CDC enables data pipelines to process data much more efficiently. As opposed to recurring bulk loads, which waste compute resources, put a high load on all involved systems, and prevent frequent data extractions, CDC allows data pipelines to process only the relevant data changes.
Today, CDC is most popular for being used as an efficient means to extracting data from database systems. Database systems typically provide a replication log, which can serve as the source for change events and is consumed by CDC connectors, like the excellent Debezium. However, CDC is not restricted to database systems but can also be used in other cases, where we don’t have access to an event log, such as web APIs.
In this article, we show how to deploy change data capture to extract change events from JSON-based web APIs.
As an example, we use Shopify’s Products endpoint, which returns a list of products:
{ "products": [ { "id": 632910392, "title": "IPod Nano - 8GB", "vendor": "Apple", "product_type": "Cult Products", "handle": "ipod-nano", "created_at": "2021-07-01T13:58:02-04:00", "updated_at": "2021-07-01T13:58:02-04:00", [...] } ] }
Our goal is to turn this static API response, which provides the state of the products at query time, into a real-time changelog that can be processed with a (streaming) data pipeline.
Get started with DataCater, for free
The real-time ETL platform for data and dev teams. Benefit from the power of event streaming without having to handle its complexity.
The status quo for extracting data from a web API, such as Shopify’s Products endpoint, is a recurring bulk load. For instance, each night at 2 am, one might access the API, extract all products, and pass them to a data pipeline for further processing - regardless of whether the products have been updated since the last access.
Depending on the amount of data, bulk loads cannot be performed too often because they might degrade the performance of downstream systems. While consuming a web API is usually not causing any performance issues, processing the consumed data and writing them into downstream data systems, such as other web APIs, might take (much) more time and become a bottleneck.
A first step to reduce the load on downstream systems is to enable the data extraction to detect which data have changed since the last access.
To this end, we might use timestamps provided by the API indicating the last time a record has been changed. In the case of Shopify’s Products endpoint, this attribute is called updated_at. When consuming the data from the API, the connector would only consider those records, which have a value in the updated_at attribute greater than the last query time.
If the API does not provide such timestamps, we would need to use other, less preferable, means to detecting changes. For instance, we could maintain a set of the hashes of processed records and periodically compare the hashes of the records from the API with this internal set - if we discover an unknown hash we know that we have not yet processed this record.
At this point, we already made the data processing more intelligent. Although we can detect which data have changed, we still need to periodically extract all data from the API. What if we could improve on that?
Most APIs provide means to filter the data while querying them, typically by defining a parameter in the query string. For instance, Shopify allows us to define the query parameter updated_at_min to retrieve only products that have been updated after a given timestamp.
Let’s assume it’s 2021-08-05 08:00 and we queried the Shopify API the last time one minute ago. We could access the following URI to retrieve all products, which have been changed since our last access:
/admin/api/2021-07/products.json?updated_at_min=2021-08-05T07:59:00-04:00
Using such filters allows us to consume only relevant data and query the API at a relatively high frequency.
In the context of web APIs, there is an alternative approach for exchanging change events: Webhooks. In this case, you - the subscriber - have to provide an API endpoint, which is called by the application whose change events you want to consume, whenever a change happens. While webhooks are widely supported in web applications, they have one main disadvantage compared to CDC: Webhooks don’t support replaying events. If the subscribing API is offline when the event occurs (or the webhook is fired), the data get lost. In contrast, a CDC-based approach can recover from the downtime by continuing at the last read timestamp and lose not any data.
When it comes to selecting specific technologies for the implementation, we’re of course a bit biased. At DataCater, we’re big fans of Apache Kafka for storing event data. The Kafka community provides an open-source connector for applying change data capture in the context of web APIs.
Of course, you may also implement your own connector or use something else than Apache Kafka.
If you want to save time and headaches, you might also consider using DataCater for implementing change data capture with web APIs. DataCater offers a plug-and-play CDC connector for web APIs, which takes only a few minutes to configure. In the case of the Shopify API, all you need to do is to fill out seven text fields in a web form.
There are plenty of use cases, which would strongly benefit from a CDC-based approach to data extraction. For instance, our partner Xanevo uses DataCater to stream changes of products from Shopify to the NLG application AX Semantics to automate the generation of SEO-optimized product descriptions.
Using change data capture gives Xanevo complete leverage over this process. Once a change occurs in Shopify, DataCater instantly streams the change event to AX Semantics, thus keeping the product description always up-to-date. A further advantage of CDC for the automation of content is that only changes trigger the generation of new texts, while a periodic bulk load leads to new texts even if the underlying data has not changed, which is preferred in SEO.
Get started with DataCater, for free
The real-time ETL platform for data and dev teams. Benefit from the power of event streaming without having to handle its complexity.