Cloud-Native Data Pipelines

Enhance your data pipelines by applying cloud-native principles.

Declarative data pipelines

Declarative data pipelines allow for more reliable, resilient, and reproducible deployments and faster iterations in development. DataCater offers a YAML-based representation, heavily inspired by Kubernetes' custom resource definition files, of data pipelines, which can be exported, imported, and edited through the Pipeline Designer's UI.

The following code listing shows an exemplary pipeline in DataCater's YAML format:

apiVersion: "datacater.io/v1"
kind: "Pipeline"
metadata:
  id: "42"
spec:
  filters: []
  transformationSteps:
  - name: "Transform name to lowercase notation"
    transformations:
    - attributeName: "name"
      transformation: "lowercase"
  - name: "Enrich phone numbers"
    transformations:
    - attributeName: "phone_number"
      transformation: "user-defined-transformation"
      transformationConfig:
        code: |-
          import requests
          import json

          def transform(value, row):
            api_response = requests.post("https://...", data={ "phone_number": value })

            return json.dumps(api_response.json())

Managing immutable revisions of your data pipelines

DataCater compiles streaming data pipelines to Apache Kafka Streams applications for execution. For each pipeline revision, users can create an immutable container image representing the state of the pipeline at the time, applying the Image Immutability Principle to data pipelines.

Immutable container images offer reproducible pipeline deployments, allow users to go back and forth between different pipeline revisions at any time, and form the basis for deploying pipelines as self-contained containers. They allow users to pinpoint processed data to the pipeline revision they have been transformed with and offer unmatched transparency on data processing.

DataCater Self-Managed allow you to plug any container registry into DataCater and outsource the management of pipeline revisions to the tool of your choice.

Running data pipelines as containers

DataCater deploys immutable revisions of streaming data pipelines as non-privileged containers, using Kubernetes (or Docker). DataCater relies on containers for pipeline execution for multiple reasons. First, containers allow us to apply the Self-Containment Principle and isolate data pipelines from other components, services, and the rest of our platform - an important trait when running a data pipeline platform at scale. They allow DataCater to define resource requests and limitations on the pipeline level and enable the elastic scaling of streaming data pipelines depending on the current load.

By running data pipelines as containers, DataCater Self-Managed can easily integrate with existing tools for monitoring and logging and make pipeline logs available for an investigation outside of the DataCater platform.

Download our whitepaper for free

Download whitepaper for free

By clicking "Download for free" you agree with the processing of your data according to the privacy policy and allow us to contact you via e-mail for marketing purposes. You can opt-out of this agreement at any time by sending an e-mail to info@datacater.io.