Why we run Data Pipelines as Containers

Five reasons why we deploy data pipelines as containers: Ease of integration, Security, Scalability, Immutability, and Robustness.

...
By Stefan Sprenger

DataCater is the self-service platform for streaming data pipelines. Our goal is to enable data teams to benefit from the power of Apache Kafka®-based data pipelines, without handling the operational complexity.

Our no-code pipeline designer - with first-in-class support for Python®-based data transformations - allows fully automating data preparation while eliminating the inaccessibility and complexity of manually programming streaming applications.

DataCater extensively uses the ecosystem of Apache Kafka. In addition to storing data change events in Apache Kafka and using Apache Kafka Connect for data connectors, it compiles data pipelines to Apache Kafka Streams applications that are packaged as images and deployed as containers.

In this article, we would like to share the five main reasons why we decided to use containers for the deployment of data pipelines.

Ease of integration

We do not only ship DataCater as a SaaS product - called DataCater Cloud - but also distribute it as a self-managed installation.

Containers are the de-facto standard for running applications in cloud environments. All major cloud providers offer managed platforms, like Kubernetes, to deploy containers.

Choosing containers as the runtime for data pipelines makes it straightforward to deploy DataCater Self-Managed in private or public clouds. Since most cloud platforms offer services for logging, monitoring, alerting, etc., which support containers, DataCater can natively integrate with existing tools instead of requiring custom implementations.

Security

Due to the support for Python-based transformations, we do not have full control over the code of the data pipeline. While we do not believe that our users will implement malicious code on purpose, there is always the chance of custom code having accidental side effects.

There exist multiple approaches to safely running custom user code. First, one may perform static code analyses and scan the code for malicious patterns. If any harmful code is detected, one could prevent the execution. Second, one might disable certain language features, which give users access to shared resources, e.g., the file system, and potentially allow users to interfere with other parts of the system.

While a static code analysis might accidentally miss edge cases and pose a security risk, the limitation of language features leads to a poor user experience because users cannot use the tools they are used to. We decided against these two approaches but run data pipelines in a fully isolated environment - containers - which prevents user code from breaking free (assuming non-privileged containers and the use of user namespaces). Custom code can neither access other data pipelines nor impact the rest of the platform, allowing us to achieve a very high level of security while providing the best user experience possible.

Scalability

Different use cases might have different requirements on the performance of data pipelines. While some data pipelines might need to process ten events per minute, others may need to process thousands of events per second. For efficiency reasons, we do not want to provision the same hardware resources to every pipeline but rather scale them according to individual needs, which might even fluctuate over time.

By running data pipelines as containers, we can leverage existing approaches to the elastic scaling of applications. Kubernetes’ Horizontal Pod Autoscaler allows us to dynamically start (or stop) instances of data pipelines depending on the current load and gives us fine-granular control over, for instance, the minimum and the maximum number of instances.

Immutability

At any point in time, users of our pipeline designer can create a deployment, which compiles the data pipeline to an Apache Kafka Streams application and packages the code as a container image. Users can manage multiple deployments per pipeline in parallel and roll back to previous deployments if needed.

Running data pipelines as containers - and packaging them as images - gives us immutability for free. Once a deployment has been created it cannot be changed any longer. Immutability of deployments gives users high control over the executed data pipelines and reduces the possibility of unexpected failures or accidental side effects.

Robustness

In contrast to batch data pipelines, which use one-off or recurring deployments, streaming data pipelines are always on. When it comes to uptime, we strive for a 100% availability of the data pipelines to ensure that data changes are instantly processed once they become available in the data source - at any time.

Data pipelines (or containers) are not executed within the DataCater platform itself but in the underlying container platform, i.e., Kubernetes. By building on top of the failure tolerance of Kubernetes, we can achieve both high robustness towards failures and high availability of the data pipelines. Even if our core platform would become unavailable for a short amount of time, data pipelines would still keep running.


Did you like this article? Feel free to reach out to us. We would be happy to show you DataCater in action and share more insights.

Get started with DataCater, for free

The real-time ETL platform for data and dev teams. Benefit from the power of event streaming without having to handle its complexity.

Start free