This guide explores the most important consumer and producer properties of Apache Kafka for achieving a high throughput.
Apache Kafka is a very powerful technology for storing data in a scalable, fault-tolerant, and robust manner and serves a plethora of data streaming use cases, like streaming data pipelines. It persists data in a distributed log and allows applications to either produce data to the log or consume data from it. Kafka calls events records and organizes them in topics. It does not make any assumption about their format - Users are free to bring their own data format and need to provide a serializer or deserializer when producing to or consuming from Kafka topics.
Throughput is an important performance indicator of an application working with Apache Kafka. It measures the amount of data that a consumer or producer can process in a specified time interval. Typically, throughput is defined in terms of records per second or megabytes (MB) per second.
Latency is another important performance metric and measures the time needed for processing single events. Latency resembles how close an application gets to the real-time processing of records and is typically defined in terms of milliseconds.
When tuning the performance of an application, you typically want to maximize throughput and minimize latency. In practice, it is often a trade-off between both - You sacrifice a bit of latency to win some throughput, and vice versa.
In this guide, we describe the most important configuration properties of the consumer and producer APIs of Apache Kafka that impact the throughput when reading from or writing to Kafka. If you are building streaming data pipelines with DataCater, you can adjust the consumer and producer properties in the configuration of Streams.
In general, if you want to maximize the amount of data that are consumed per second, it is very helpful to consume records in batches instead of processing them one by one. This comes at the cost of an increased latency because you need to wait until a batch fills up before you start processing it.
In the following, we describe the properties of the Apache Kafka consumer API that are essential to optimizing consumers for high throughput.
The consumer property max.poll.records
defines the number of records that are returned by one call of the poll()
function (default: 500
).
Depending on how you process the records retrieved from Kafka, increasing this property can lead to an increase in the overall throughput.
For instance, Kafka Streams, which is typically used with rather complex use cases that require the management of state, sets this property to 1000
by default.
The consumer property fetch.min.bytes
defines the minimum
number of bytes that a consumer intends to fetch from an Apache Kafka
broker (default: 1
).
If the broker has less than fetch.min.bytes
of
records available, it will wait for the amount of data to complete before
responding to the fetch request. The waiting can be constrained by setting
the consumer property fetch.max.wait.ms
.
If you want to reduce the number of network requests and increase the throughput of your consumer, you might want to try increasing this consumer property.
The consumer property fetch.max.wait.ms
defines the number
of milliseconds that an Apache Kafka Broker shall wait for data before responding to a fetch request if less than fetch.min.bytes
of data are available (default: 500
).
A higher value will improve the throughput and sacrifice latency, and
vice versa.
poll()
function provided by the Apache Kafka client.
While the application is processing the records and before it is
calling poll()
another time, the Kafka client can already fetch
data from the broker internally, which minimizes future waiting time.
As opposed to the poll()
function, which is available through a public interface, your application cannot actively invoke the
fetching of data from the broker.
Similar to consumers, optimizing Apache Kafka producers for a high throughput often involves transferring more records with fewer network requests.
The producer property batch.size
defines the maximum number of
records that are sent by the Apache Kafka client to the broker with one
request (default: 16384
).
Depending on the size of your records, you might want to consider
setting this property to a few hundred thousands, typically to something
between 100000
and 200000
. Please note that
increasing the batch.size
might have a negative impact on
the memory consumption of your application.
The producer property linger.ms
defines the number of
milliseconds that the Kafka client waits until batch.size
records are available before sending them to the broker (default:
0
).
By increasing this property to, for instance, 100
milliseconds, you can reduce the number of network requests and increase
the throughput but negatively impact the latency of your producer.
The producer property compression.type
allows you to define a
compression algorithm that is applied to the records before sending
them to the Kafka broker (default: none
). Applying compression reduces the amount of data that need to be transferred from your application
to the broker and can increase the throughput of your producer, especially when producing records in batches.
Apache Kafka clients support various compression algorithms, e.g., gzip
, lz4
, and zstd
.
The producer property acks
defines how many in-sync replicas must acknowledge a write request before the producer can consider it as successful (default: 1
).
By requiring all in-sync replicas to acknowledge a request, i.e.,
acks=all
, you get very strong guarantees about the
durability of your records but spend a lot of time waiting for the
responses of the replicas, which negatively impacts the throughput.
By requiring no in-sync replica to acknowledge a request, i.e.,
acks=0
, you can increase the performance of your producer
but risk losing records.
For most use cases, we recommend sticking to the default configuration, i.e., acks=1
.
This guide discussed various configuration options for consumers and producers that can have a positive impact on the throughput of your application when being adjusted.
In general, we can witness a trade-off between a high throughput and a low latency. Configurations that improve the throughput typically negatively impact the latency, and vice versa.
The default values of Apache Kafka configuration often favor latency
over throughput, e.g., by default, producers do not wait until batches
fill up but immediately start sending data to the broker (linger.ms=0
).
As a consequence, you need to experiment with different settings for the
configuration options discussed above if you have high-throughput
requirements.