What is a Data Pipeline?

An introduction to the concepts, use cases and most important characteristics of data pipelines.

...
By Stefan Sprenger

The Romans were known for constructing large numbers of aqueducts to efficiently transport water from faraway springs to the places of consumption, mainly cities or towns. The concept of a data pipeline is quite similar to the one of an aqueduct: Data pipelines move data from the place of creation, the data source, to the location of usage, the data sink.

This article provides an introduction to the concepts behind data pipelines, describes their most important use cases, and presents the key characteristics used to differentiate between data pipelines.

The Data Pipeline

Data pipelines move data from data sources to data sinks. Data pipelines may even add value to the data while transporting them: They may remove errors from the data, enrich the data with additional information, combine the data with other data sets, or transform the data into a format suitable for the downstream usage.

In general, data pipelines apply the following three main tasks in sequential order:

  • Data extraction: Data pipelines extract raw data from the consumed data source and convert the extracted data into an internal representation.

  • Data transformation: Data pipelines apply a series of different transformations to the data.

  • Data ingestion: Data pipelines ingest the transformed data into the data sink, which persists the data for the downstream usage.

Data Pipeline
The different phases of data pipelining: data extraction, data transformation, and data ingestion.

Data Extraction

Data pipelines directly connect to the consumed data source, e.g., a database system, a REST API, or a file system, to collect the data of interest. Since each data store typically brings its own data format and type system, the first action of the data extraction phase is converting the extracted data into an internal representation allowing further processing.

Most data pipelines are executed recurringly (batch) or continuously (streaming). As a consequence, data pipelines must track the schema of the consumed data source and be able to detect changes, e.g., deletions of attributes used in transformations, which would break their execution.

Furthermore, as laid out in a previous article, being able to detect which data have changed between runs of a data pipeline strongly improves the efficiency of the data extraction.

Data Transformation

After extracting the data, the data pipeline takes care of preparing the data for downstream usage. Data transformation involves operations such as replacing missing values, manipulating the schema of the data, unifying different formats, removing duplicates, etc.

Data pipelines may even enrich the data with external information, e.g., convert currencies according to the current exchange rate retrieved from an external service, or combine the data with other data sets, e.g., join sales data with customer data.

Data Ingestion

Once all transformations have been applied, the data pipeline takes care of ingesting the prepared data into the data sink. Similar to the extraction of data, this step may involve converting the data from the internal representation into the format used by the data sink, e.g., CSV, Parquet, or JSON.

Depending on the data sink, the data pipeline may need to implement some kind of buffering to avoid ingesting data at a rate higher than the one that the data sink can sustain.

Use Cases

Many different applications require the exchange of data between data stores and, to this end, make use of data pipelines. In the following, we list the most popular use cases.

Machine Learning

The performance of machine learning models strongly depends on the quality of training data: “If Your Data Is Bad, Your Machine Learning Tools Are Useless”. For this reason, data experts rarely train machine learning models directly on raw data but first prepare the raw data for the training. Data pipelines are commonly used to implement data preparation for machine learning and handle tasks, such as replacing invalid values, dropping attributes, or changing date formats.

Data Warehousing

Data warehouses integrate data sources for analytical applications. Data pipelines can be used to implement the import of data into a data warehouse. Data pipelines extract data from operational data sources, transform and integrate the data in preparation for the analytical use cases, and eventually load the prepared data into the data warehouse.

Search Indexing

Search engines, like Apache Solr or Elasticsearch, provide powerful search interfaces to data. Data pipelines can be used to load data from data stores into one of the aforementioned search engines and perform preparatory transformations on the way, e.g., removal of stopwords, replacement of synonyms, stemming of words, etc.

Data Migration

When migrating from one data store to another, for instance from an on-premise MySQL database to an AWS RDS for MySQL instance, data pipelines can be used to load data from the old into the new data store, enabling a smooth transition. It is often the case that the new data store imposes a different schema than the old store. Data pipelines could perform schema manipulations to enable the migration in such cases.

Key Characteristics of Data Pipelines

The diversity of the presented use cases shows that no data pipeline is like the other. In the following, we present the key characteristics of data pipelines with the aim to help you choose the data pipeline implementation that is most beneficial for your use case.

Batch vs. Streaming

Batch and streaming are two fundamentally different execution modes for data pipelines.

For each run, batch data pipelines extract all data from the data source, transform them according to the pipeline definition, and publish the transformed data to the data sink, overwriting the output of previous runs. Typically batch data pipelines don’t have any knowledge about which data have changed between runs and always have to consider the entire data source.

As opposed to batch data pipelines, which stop the processing of data until the next run is scheduled, streaming data pipelines are continuously executed. Streaming data pipelines monitor the consumed data source for updates (insertions, updates, or deletions of records), which are extracted and processed in real-time. In contrast to batch data pipelines, streaming data pipelines only consider data changes.

While batch data pipelines may be beneficial for data sets that never or rarely change, streaming data pipelines are the implementation of choice for data sets that change frequently.

ETL vs. ELT

Extract Transform Load (ETL) is the classic heavy-lifting task performed to prepare and integrate data before loading them into a data warehouse. In this case, the data warehouse is mainly responsible for serving the data to analytical applications.

In Extract Load Transform (ELT), the data are loaded into the data warehouse before performing the data preparation and integration inside the data warehouse. In this scenario, the data warehouse is not only responsible for serving the data but also takes care of applying transformations to the data. For a few years, ELT is becoming increasingly popular and is very common in the field of cloud-based data warehouse systems, like Google Cloud BigQuery or Snowflake.

While ETL pipelines typically perform complex operations, such as multi-table joins, ELT pipelines only apply preparatory transformations, if at all, to data before loading them into the data warehouse.

Code vs. No Code

Technically, data pipelines are software programs that need to be implemented in code. Popular tools are programming languages, such as Python, Scala, Java, or even SQL, and libraries, like Spark, Kafka Streams, Flink, or Pandas.

Implementing data pipelines in software code involves lots of repetitive work, e.g., the management of data schemas, the conversion of data between different formats, the management of data connectors, and last but not least the implementation of transformation functions.

No-code data pipeline platforms, like DataCater, automate the repetitive tasks to save lots of time for users when working with data pipelines. They typically provide interactive user interfaces that can be used to efficiently build and manage data pipelines.

About DataCater

DataCater is a no-code data preparation platform that provides interactive means to building, managing, and deploying streaming data pipelines. While DataCater allows users to build high-quality data pipelines without writing a single line of code, it still supports user-defined transformation functions, implemented in Python, to cope with special requirements.

We would be happy to show you the capabilities of DataCater in a short live demo. Please reach out to us to set up an appointment.