Introducing DataCater

It is our pleasure to introduce DataCater, the complete platform for continuous data preparation.

...
By Stefan Sprenger

Over the next decade, we will see a large adoption of Artificial Intelligence (AI) in companies across all industries. Many organizations will move on from experimenting with AI in proof of concepts to fully integrating AI into their processes. Although there have been many advancements in tooling for AI over the last couple of years, data experts are still spending an enormous amount of time on preparing raw data for AI applications. It’s time to change that.

With more enterprises putting AI into production, we will also observe a large increase in demand for streaming data architectures, which enable AI applications to always work with current data. We identify a strong need for new technologies, which make stream processing more accessible to data experts and allow them to keep prepared data in sync with raw data.

Today’s tools for preparing data are not sufficient for tomorrow’s needs in AI. To this end, we are building DataCater, the complete platform for continuous data preparation. We have defined two main objectives, which are guiding us through our mission:

  • Save time for data experts by providing efficient means to preparing data.

  • Enable AI applications to always work with the most recent data by fully automating the execution of data preparation.

DataCater

DataCater is the complete platform for continuous data preparation and provides efficient tools for the following tasks:

  • Design of data pipelines, which prepare raw data for the usage in AI applications

  • Interactive and exploratory analysis of data sets

  • Management of data connectors

  • Execution of data pipelines

  • Monitoring of data quality

Pipeline Designer

The main component of DataCater is the Pipeline Designer, which allows users to interactively build streaming data pipelines.

The Pipeline Designer integrates an extensive repository of predefined filter and transformation functions, which cover most needs in data preparation for AI and can be combined to build data pipelines of your choice. By using predefined functions for the basic tasks in data preparation, users can save lots of time otherwise spent on implementation and testing of code. Don’t worry that you’ll have to completely stop coding - the Pipeline Designer allows you to define user-defined functions in Python to handle special requirements.

When applying transformations, the Pipeline Designer instantly previews their impact on real data that were sampled from the consumed data source. Instant previews do not only offer maximum transparency on the applied transformations, but also allow to iterate much faster in the development of data pipelines, again, strongly reducing the time spent on data preparation.

Using DataCater's Pipeline Designer to tokenize a text value.

The Pipeline Designer provides essential profile information for the sample data, such as the most frequent values, the number of distinct or missing values, min or max values in numeric attributes, etc. Previews of the impact of transformations are not limited to records, but are also available for profile information.

Data Connectors

DataCater integrates many different source and sink connectors for database systems, data warehouses, flat files, search engines, etc, making it straightforward to connect data pipelines with external data stores.

Many source connectors implement Change Data Capture (CDC), which allows extracting change events (insertions, updates, or deletions) from data sources in real-time and enables full automation of the execution of data pipelines. DataCater uses CDC as the key technology for always keeping prepared data in sync with raw data sources.

At the moment, DataCater offers CDC source connectors for IBM DB2, Microsoft SQL Server, MySQL, Oracle, and PostgreSQL.

Deployments

Once all data preparation requirements have been defined using the Pipeline Designer, it’s finally time for the execution. DataCater takes care of turning pipeline definitions into production-grade software code, packaging the generated code as Docker images, and deploying the images as containers onto the platform of your choice (on-premise or cloud).

All pipelines built with the Pipeline Designer are run as streaming data pipelines, which stream change events in real-time from data sources to data sinks. DataCater provides tools for monitoring the execution of data pipelines.

Architecture

We designed DataCater with the aim to allow a straightforward integration into existing data architectures. Considering one of our main objectives, the full automation of the execution of data preparation, it was an obvious choice to base DataCater on the leading technology for storing and processing streams of data, Apache Kafka, which is already in use in many enterprises.

Technically, DataCater’s data connectors use Kafka Connect, which does not only enable to use many of the existing connectors published by the open-source community, but also allows users of DataCater to implement their own custom data connectors. For most CDC source connectors, DataCater uses the Debezium project.

DataCater stores all change events, which have been extracted from data sources and are processed by data pipelines, in Kafka topics. DataCater does not specify any custom requirements, but supports a deployment on top of existing Kafka clusters.

DataCater compiles data pipelines built with the Pipeline Designer to highly-optimized Kafka Streams applications, which are executed as containers.

The following figure sketches an exemplary streaming data pipeline built with DataCater. The pipeline consumes data from a PostgreSQL database, applies filters and transformations to the raw data, and eventually publishes the prepared data to an Elasticsearch index, which could be consumed by downstream AI applications.

Streaming data from PostgreSQL to Elasticsearch and transforming them on the way.

Availability

We are offering DataCater both as a managed service and an on-premise installation. For the on-premise installation, we are delivering DataCater as a Docker image, which could be installed either on your machines or on the cloud platform of your choice.

After completely focusing on product development for the last couple of months, we are currently deploying first DataCater installations in pilot projects.

Summary

DataCater is the complete platform for continuous data preparation. It offers the Pipeline Designer as an interactive and efficient means to performing data preparation, which may save lots of time for data experts in practice. Technically, DataCater compiles data pipelines to production-grade streaming applications, which stream change events from data sources to data sinks in real-time and apply transformations on the way. DataCater fully automates the execution of data preparation and allows downstream AI applications to always work with the most recent data, benefiting production deployments of AI applications.

We are always happy to chat data preparation and learn more about potential use cases. Feel free to say hi to us!

Apache, Apache Kafka, Kafka, and the Kafka logo are trademarks of the Apache Software Foundation. PostgreSQL and the PostgreSQL logo are trademarks of the PostgreSQL Community Association of Canada. Elasticsearch and the Elasticsearch logo are trademarks of Elasticsearch B.V.