This article introduces three use cases for getting started with Apache Kafka: log analytics, change data capture, and data validation.
Many companies employ a plethora of different data and infrastructure systems, e.g., (a) config handling for microservices with technologies, like Apache Zookeeper® or Istio®, (b) log analytics with products, such as Elasticsearch®, or (c) transactional database systems, like Postgresql® or MySQL®.
Data streaming becomes an increasingly popular solution for integrating these systems. While it has been historically only used by certain industries, such as stock exchanges or media streaming providers, data streaming nowadays empowers businesses of all sizes and is used for implementing various use cases.
In this article, I want to help beginners to get started with the ecosystem of Apache Kafka® by discussing three popular use cases.
Let us assume that we employ a log analytics solution, like Splunk® or the ELK stack. We might use Apache Kafka for the intelligent ingestion of data into the log analytics solution.
Apache Kafka would help us to implement backpressure, thus ensuring that we do not ingest data at a rate higher than what the log analytics solution can handle.
Thanks to Apache Kafka, we could also decide on which data to ingest. Given that not always all logs are of immediate interest, we might stream only a subset of the data to the log analytics solution but pass the rest to cheaper cold data storage, like Amazon S3.
Furthermore, we can use Apache Kafka to integrate our log data with real-time threat detection engines, such as Falco.
Traditionally, most data integrations extract data from database systems using a
SELECT * query. This does not only put lots of load onto all involved systems but is also quite inefficient since a lot of (unchanged) records are replicated multiple times despite the lack of updates.
Change Data Capture (CDC) allows you to identify and extract changes from databases in real-time. By continuously intercepting database updates, inserts, and deletes, from the database’s replication log, CDC provides a continuous stream of data changes, a real-time changelog.
Debezium is a popular open-source project from the Apache Kafka community. It provides a set of Kafka Connect source connectors that implement log-based change data capture for database systems, such as MySQL, PostgreSQL, or Oracle, and can be used for streaming data changes to downstream data sinks in real-time.
Data-driven organizations use data from their various data systems to, for instance, decide whether to drive business growth or avoid future security breaches. There is no free lunch and relying on data comes at a certain risk. It’s always possible to make an inaccurate decision due to unexpected schema changes, host downtime, etc.
Fortunately, there are techniques and tools available for validating the correctness of data. We can employ Apache Kafka in combination with Kafka Connect and a stream processing framework, like Kafka Streams, to continuously monitor data sets of interest and validate their correctness using checks, such as:
Detect changes in the schema of data systems, like database systems or APIs
Validate custom data formats, e.g., check the validity of credit card numbers, by employing regular expressions or more advanced data validators developed in programming languages, such as Python
Monitor the freshness of data sets and detect whether a data set has been update within a specific time period, e.g., validate that the application logs have been written in the last 60 seconds
Over the last decade, Apache Kafka matured from being a pure message broker and became a complete data streaming platform, offering connectors and stream processors for implementing use cases like real-time ETL pipelines. This makes it straightforward for companies of all sizes to implement streaming use cases, such as replicating data from their application databases to downstream consumers. This article discussed three popular use cases from the Apache Kafka community. Feel free to try them out!