Apache Kafka

Sangeevan Siventhirarajah
3 min readJan 28, 2022

Apache Kafka is an open-source distributed event streaming platform. It aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Kafka combines three key capabilities so it can be implemented the use cases for event streaming solution:

  1. To publish (write) and subscribe to (read) streams of events, including continuous import/export of the data from other systems.
  2. To store streams of events durably and reliably.
  3. To process streams of events as they occur or retrospectively.

An event records the fact that “something happened”. It is also called record or message in the documentation. When reading or writing data to Kafka, it has to be done in the form of events. Conceptually, an event has a key, value, timestamp, and optional metadata headers.

Producers are those client applications that publish (write) events to Kafka, and consumers are those that subscribe to (read and process) these events. In Kafka, producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve the high scalability that Kafka is known for. For example, producers never need to wait for consumers.

Events are organized and durably stored in topics. Topics in Kafka are always multi-producer and multi-subscriber. Events in a topic can be read as often as needed, and events are not deleted after consumption. Kafka’s performance is effectively constant with respect to data size, so storing data for a long time is perfectly fine.

Topics are partitioned, meaning a topic is spread over a number of “buckets” located on different Kafka brokers. This distributed placement of the data is very important for scalability because it allows client applications to both read and write the data from/to many brokers at the same time. When a new event is published to a topic, it is actually appended to one of the topic’s partitions. Events with the same event key are written to the same partition, and Kafka guarantees that any consumer of a given topic-partition will always read that partition’s events in exactly the same order as they were written.

In addition to command line tooling for management and administration tasks, Kafka has five core APIs.

  1. The Admin API to manage and inspect topics, brokers, and other Kafka objects.
  2. The Producer API to publish (write) a stream of events to one or more Kafka topics.
  3. The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.
  4. The Kafka Streams API to implement stream processing applications and micro-services.
  5. The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka.

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export job can deliver data from Kafka topics into secondary storage and query systems or into batch systems for offline analysis.

--

--