Let's make event streaming a commodity

Build the rock-solid foundation for your next-generation real-time business intelligence.

...
...
By Stefan Sprenger (CEO, DataCater), Alexander Gallego (CEO, Vectorized)

After a decade of software eating the world, we can witness the emergence of the new data layer in modern enterprises. Today, enterprises produce a plethora of data each day, strongly outpacing the amount of data available in all endpoints worldwide. They employ analytical applications, machine learning models, dashboards, and reporting solutions, to turn their data into value, across whole organizations. While enterprises produce data at the highest velocity ever, data consumers (data scientists, data analysts, etc.) still have to wait up to a couple of days until the data are ready for usage.

The traditional approach to building data pipelines between data source systems and the downstream applications is batch-based: Data pipelines periodically extract all data from the source, transform the data, and load the prepared data into the destination. Batch data pipelines are executed at a low frequency, due to their high inefficiency.

What if we could do better? Streaming data pipelines process data in real-time instead of transferring them in batches: They employ change data capture for extracting changes from data sources, can transform the change events in real-time, and ensure that downstream applications can always work with current data.

If the existence of event streaming is no news to you, you have probably also heard about Apache Kafka. More than ten years after the birth of the open-source project, we have to acknowledge that the Kafka API is the de-facto industry standard for event streaming. It’s so predominant that other streaming platforms, such as Apache Pulsar, or Azure Event Hubs, offer Kafka-compatible APIs.

However, despite event streaming being available for more than ten years, it’s neither easy to operate at an enterprise-scale nor accessible to data teams, preventing streaming data pipelines from replacing their batch-based counterparts. Let’s have a look at two technologies, which can change that!

Redpanda: Fixing the operational complexity of event streaming

Redpanda Rocket

Vectorized Redpanda is the Kafka-API compatible streaming engine for mission-critical workloads. Engineered from the ground up for today’s hardware, Redpanda brings inline WebAssembly (WASM) transforms and geo-replicated hierarchical storage, all in a single binary, with no ZooKeeper and no JVM.

Redpanda does not only offer outstanding performance by better leveraging the underlying hardware, offering flat and predictable tail latencies, but also achieves operational simplicity by removing the dependency on ZooKeeper. You no longer have to operate two distributed systems in parallel, if you want to benefit from event streaming. In addition to saving your ops team headaches and overtime, the elimination of ZooKeeper helps to reduce the size of your streaming cluster, typically by a factor of 5!

For developers, one of the most interesting features of Redpanda is the support for inline WASM transforms. Developers can implement data transformations and filters in the language of their choice, e.g., JavaScript, and deploy the WASM binary with one CLI command to Redpanda. WASM transforms are executed inside the Redpanda cluster, reducing the need for additional compute clusters, such as Apache Flink, and making GDPR-compliant implementations more feasible.

Are you already using Apache Kafka and wondering how you might make the switch to Redpanda? The great thing about Redpanda is that it’s 100% Kafka-API compatible. You can replace your existing Kafka deployment with Redpanda, and keep using all the great tools and products from the Kafka ecosystem. Are you, for instance, running Apache Kafka Connect for integrating your Kafka cluster with external data systems? Just point it to a Redpanda cluster and start benefiting from the operational simplicity.

DataCater: Fixing the accessibility of event streaming

DataCater is the self-service platform for streaming data pipelines, allowing data teams to benefit from event streaming without handling the complexity of manually programming streaming applications. DataCater is based on Apache Kafka and can be deployed on any Kafka-API compatible streaming platform.

DataCater’s no-code pipeline designer is possibly the most efficient approach to transforming, filtering, and enriching data in real-time. More than 50 no-code functions help users to automate the most common tasks in data preparation, while DataCater’s first-in-class support for Python-based transforms brings the speed of development of Python notebooks to the world of event streaming. DataCater is targeting data scientists, data engineers, and data analysts, which are able to transition to event streaming without sacrificing their tech stack.

Technically, DataCater compiles pipelines to production-grade Apache Kafka Streams applications and deploys them on a container platform, such as Kubernetes. This approach does not only offer a native integration into existing cloud-based data architectures but also allows for the elastic scaling of streaming data pipelines.

Modern data teams consist of both non-technical domain experts and engineers. DataCater’s project feature was built with this in mind and allows the different persona in data projects to collaborate on streaming data pipelines. Imagine engineers implementing custom business rules in Python functions, while non-coding domain experts being able to interactively validate them without having to read code or manually inspect CSV files.

Beyond making event streaming accessible to data teams, DataCater’s pipeline designer excels at automating lots of repetitive work otherwise required when with data pipelines. Customers report that DataCater helps them to save around 40% of their time, which is tremendous given that most data experts spend around 80% of their time on data preparation.

Transformations in the Pipeline Designer

Summary

Redpanda and DataCater are a rock-solid foundation to build your next-generation real-time business intelligence. Both products share the same philosophy: They aim at solving the Achilles heel in the enterprise, the complexity of data handling, by offering a developer-first, self-service approach. While Redpanda is providing the easy-to-operate and high-performing event streaming platform, DataCater resides one layer above Redpanda and makes streaming data pipelines accessible to anyone working with data.

Given that DataCater is based on Apache Kafka and Redpanda is fully Kafka-API compatible, both technologies can be seamlessly integrated to turn the dream we share into reality: Make event streaming a commodity.

In the future, there are even more opportunities for a deep integration of both products. What if DataCater could compile pipelines to WASM scripts (instead of Kafka Streams) and deploy them to Redpanda (instead of Kubernetes)?

Streaming will be the new default for data pipelines. Once data consumers have experienced the power and efficiency of event streaming, they will never go back to the state where they have to wonder about the sync time of the consumed data. By making event streaming a commodity for data teams, DataCater and Redpanda enable enterprises to fully leverage their data. Having access to current and high-quality data at any time, enterprises are able to serve their customers in the best way possible.

Get started with our 14-day free trial

Risk-free exploration of the DataCater platform for streaming data pipelines - no credit card required.

Sign up

We would be more than happy to get to know your thoughts on the future of event streaming. Feel free to reach out to us on Twitter (Alexander Gallego, Vectorized, Stefan Sprenger, DataCater) or LinkedIn (Alexander Gallego, Vectorized, Stefan Sprenger, DataCater).

mailbox

Contact us

By clicking "Send request" you agree with the processing of your data according to the privacy policy.