Streaming ETL | DataCater

Process your data as continuous flows

Traditionally, ETL pipelines are run at fixed intervals and extract data from source systems using bulk loads. This is not only very inefficient but also neglects the fact that most data sets are changing continuously.

DataCater makes streaming ETL pipelines accessible to data and dev teams. Streaming ETL pipelines sync change events from data sources to data sinks in real-time and can transform them on the way. Streaming ETL pipelines are much more efficient than the traditional bulk loads because they process only relevant data (changes). Since they are often combined with change data capture connectors, streaming ETL pipelines extract and process data changes in real-time, keeping downstream applications and systems always up-to-date.

Transform data in real-time

The T in ETL stands for Transform. Streaming ETL pipelines cannot only sync data between two data systems in real-time but also transform them on the way. In streaming ETL pipelines, transformations are used to:

Filter data

By default, streaming ETL pipelines process all data from a data source. Filters might be useful if only a subset of the source data, for instance, only products of a specific category, are of interest for the downstream applications.

Clean data

Typically, streaming ETL pipelines consume raw data from a data source and must clean the data on the way in preparation for the downstream usage. Cleaning data might include performing tasks, such as replacing missing values, normalizing attribute values, or fixing typos in text values.

Change the schema of data

Most data sinks do not use the exact same schema as data sources. ETL pipelines must take care of handling differences in data types, attribute names, or the availability of attributes. To this end, they apply transformations that manipulate the schema of the data, such as casting data types, renaming attributes, introducing new attributes, or removing attributes, while streaming data from data sources to data sinks.

Enrich data

Oftentimes, raw data in data sources lack information that needs to be added to the data before loading them into the data sink. For instance, one might want to automatically enrich a data set containing phone numbers with the cities of the area codes. In such cases, streaming ETL pipelines can enrich data with additional information while streaming them.

How DataCater makes streaming ETL accessible to data teams

Data and dev teams can use DataCater to benefit from the power of streaming ETL pipelines without handling their complexity. DataCater's Pipeline Designer allows teams to collaborate on streaming data pipelines and implement most business requirements without coding. DataCater offers plug & play connectors for many data systems and allows users to join, filter, and transform data using more than 50 no-code functions. Technical users can implement custom requirements using Python® transformations and share them with their team.

Optimized resource consumption

Streaming data pipelines use resources much more efficiently than their batch-based counterparts for two main reasons.

First, streaming data pipelines typically employ change data capture for extracting data from data sources. As a consequence, they are processing only relevant data - the changes - whereas batch data pipelines perform recurring bulk loads that extract all data from data sources regardless of whether they have changed since the last run. Second, compute clusters for running batch jobs are equipped for load peaks and cannot dynamically size up or down depending on the current load. They must run at full capacity all the time - keeping them mostly idle. Streaming data pipelines have much more predictable workloads than batch pipelines and can benefit from the elastic scalability of cloud technologies, such as Kubernetes, which allows them to best utilize compute resources.