Data Pipeline Runtime Consistency with Containers

This article applies the principles of container-based application design to building and deploying data pipelines in the cloud era.

...
By Hakan Lofcali

In cloud-native environments, applications are deployed as containers. In Principles of Container- Based Application Design, RedHat defined a set of principles, which are nowadays industry standards for designing containerized applications. This section carries the foundational work of RedHat over to building and deploying data pipelines in the cloud era:

Image Immutability Principle (IIP)

“…, and once [containers are] built are not expected to change between different environments”.

For data pipelines, this should even be true for the same environment and different revisions of data pipelines. Data is at the heart of every organization, and the processing of data needs to be auditable in hindsight as well as runtime. IIP gives data teams and their organizations the ability to pinpoint the code that was used to transform and gain value from their data.

Self-containment Principle (S-CP)

“This principle dictates that a container should contain everything it requires at build time.”

Relying only on the availability of the Linux kernel effectively allows any library or even language to be used at data pipeline definition/creation time. Today, the majority of data transformations and data science work is done in Python with an abundance of choice when it comes to feature engineering, data transformation, and ML/statistical model training. S-CP allows data teams to use any of their favorite tools and libraries to effectively build a runtime environment tailored to their needs.

Runtime Confinement Principle (RCP)

“This RCP principle suggests that every container declares its resource requirements [CPU, Memory, Storage] and pass that information to the platform.”

Data pipelines can and should declare their resource requests. Having the ability to change these at runtime makes for great adaptability and leads to less downtime or crashes of data pipelines. This principle improves the transparency for analyzing resources used by different data teams across organizations. RCP allows for isolation of resource consumption and therefore failures, which are caused by rogue data pipelines consuming more resources than expected.

Summary

“Image Immutability”, “Self-containment”, and “Runtime Confinement” are three of RedHat’s seven principles. These three principles have a high impact on developing, deploying and operating data pipelines. Applying these to data pipeline runtimes will result in more consistent testing, higher transparency of deployed code, and better over-all observability.

Download our whitepaper for free

Download whitepaper for free

By clicking "Download for free" you agree with the processing of your data according to the privacy policy and allow us to contact you via e-mail for marketing purposes. You can opt-out of this agreement at any time by sending an e-mail to info@datacater.io.