A comparison between event-based streaming data pipelines and their batch-based counterparts.
Data pipelines extract data from a data source, apply operations, such as transformations, filters, joins, or aggregations to the data, and publish the processed data to a data sink.
While most people are familiar with the concepts of the traditional batch-based approach to processing data, streaming data pipelines are still the new kid on the block and rarely used in practice, despite having gained technical maturity over the last couple of years and offering promising advantages over their batch-based counterparts.
In this article, we want to shed light on streaming data pipelines and discuss how they compare to batch-based approaches. After introducing both kinds of data pipelines, this article discusses the most prevalent characteristics of each approach and their implications on the usage in practice.
Batch data pipelines are executed manually or recurringly. In each run, they extract all data from the data source, apply operations to the data, and publish the processed data to the data sink. They are done once all data have been processed.
The execution time of a batch data pipeline depends on the size of the consumed data source and typically ranges from multiple minutes to a few hours, even when applying techniques like parallelization. Given that batch data pipelines increase the load on the data source, they are often executed during times of low user activity, for instance, each night at 2 am, to avoid impacting other workloads.
Typical use cases for batch data pipelines have complex requirements on the data processing, such as joining dozens of different data sources (or tables), and are not time-sensitive. Examples include payroll, billing, or low-frequency reports based on historical data.
Batch data pipelines allow for multiple observations:
Batch data pipelines always know the entire data set, at the start time of their execution, which eases the implementation of operations, like joins and aggregations, that must access the data source altogether.
Batch data pipelines can often directly connect to data sources or data sinks, e.g., using JDBC drivers when integrating database systems, without any intermediate layer in between them.
Typically batch data pipelines do not maintain knowledge about data changes. They always have to process all data regardless of whether the data have changed since the last run, which may lead to a waste of computing resources.
Data sinks are always based on the state of the data source at the start time of the latest execution of the batch data pipeline, which, depending on the frequency of changes occurring in the data source, leads to the situation, where the data sink is often outdated.
Executing batch data pipelines impacts the performance of the consumed data source, especially when dealing with very large data sets, because they need to extract all data at once.
As opposed to batch data pipelines, streaming data pipelines are executed continuously, all the time. They consume streams of messages, apply operations, such as transformations, filters, aggregations, or joins, to the messages, and publish the processed messages to another stream.
Typically, they are deployed together with
In such a scenario, the stream produced by the data source connector resembles the changelog of the data source, containing one message for each performed data change event, i.e., insertion, update, or deletion.
Traditional message queues, such as RabbitMQ, delete messages once they have been consumed. Modern event stores, such as Apache Kafka, allow to keep messages as long as needed: for a certain time period, up to a certain amount of storage space, or even forever. This does not only enable to reconstruct the state of a data set at the time of a certain message (or event) but also emphasizes reprocessing (or replaying) messages.
Common use cases for streaming data pipelines are time-sensitive and must gain insights into the most recent changes that occurred in a certain data store. Examples include fraud detection, critical reports supporting important operational decisions, monitoring of customer behavior, or cybersecurity.
Streaming data pipelines reveal multiple strengths and weaknesses:
Streaming data pipelines must process only the data that have changed, keeping all other data untouched and improving the overall usage of computing resources.
Streaming data pipelines keep data sinks always in sync with data sources. They minimize the gap between the time a change event occurrs in the data source and the time the processed event arrives at the data sink.
When employing log-based change data capture connectors, streaming data pipelines reduce the load on the data source because they do not need to execute full queries but can extract data from the log file of the database system.
Streaming data pipelines must reconstruct data sets when performing operations that require knowledge about the whole data, like joins and aggregations. Log compaction often strongly improves the performance of such operations.
Streaming data pipelines must be employed in combination with additional connectors when consuming data from external data sources or publishing processed data to external data sinks, which increases operational overhead.
While there are many mature products for working with batch data pipelines, tooling for streaming data pipelines is still evolving and certainly something to keep an eye on in the upcoming years.
In theory, data architectures could employ only one of both approaches to data pipelining.
When executing batch data pipelines with a very high frequency, the replication delay between data sinks and data sources would shrink and come close to the one of streaming data pipelines. However, with a growing amount of data in the data source, this approach becomes unfeasible quickly.
Similarly, streaming data pipelines could be used to implement very complex data integration requirements, e.g., joining dozens of tables. However, especially for large data sets, this would severely impact the performance due to the necessary reconstructions of data sets and lack of optimization techniques, like database indexes.
Based on our experience, most data architectures benefit from employing both batch and streaming data pipelines, which allows data experts to choose the best approach depending on the use case.
While streaming data pipelines excel in transferring data very fast and can apply simple to moderate transformations, batch data pipelines show their strengths in performing very complex processing steps.
Streaming data pipelines may be, for instance, employed for extracting data from an operational database or an external web service and ingesting the data into a data warehouse or data lake. In contrast, batch data pipelines may be used for joining dozens of different database tables in preparation for complex, low-frequent reports.
This article introduced batch and streaming data pipelines, presented their key characteristics, and discussed both their strengths and weaknesses.
Neither batch nor streaming data pipelines are one-size-fits-all solutions but must be employed in combination to provide the most benefits to downstream use cases.
Compared to their batch-based counterparts, streaming data pipelines are still at a very early stage. While technologies like Apache Kafka are already in use at many companies, they still require users to execute many repetitive tasks, wasting the time of data experts.
DataCater aims to advance the state of streaming data pipelines by providing a complete platform for building, managing, and deploying them. DataCater offers efficient tools for managing data connectors and can be used by data teams to work together on continuous data preparation.
Please feel free to reach out to us if you want to learn more about DataCater, see our product in action, or just chat data preparation.