Apache Kafka

Distributed
Event Streaming

Apache Kafka is a distributed, fault-tolerant event streaming platform capable of handling trillions of events per day — built for real-time data pipelines and stream processing at scale.

1M+msgs/sec per broker
80%+Fortune 100 users
2011Open sourced at LinkedIn
Retention configurable
Pub/Sub Log Compaction Fault Tolerant Horizontally Scalable High Throughput Low Latency
Core Concepts

Building Blocks

Everything in Kafka revolves around a handful of powerful primitives.

📨

Producer

Publishes (writes) records to one or more Kafka topics. Producers decide which partition a message lands in, using a key hash or round-robin strategy.

📬

Consumer

Reads records from topics. Consumers track their position (offset) independently, allowing replay and parallel consumption without coordination.

🗂️

Topic

A named, ordered, persistent log of records. Topics are split into partitions for parallelism and replicated for fault tolerance.

🧩

Partition

An ordered, immutable sequence of records within a topic. Partitions enable horizontal scaling — each can live on a different broker.

🔢

Offset

A unique, sequential integer assigned to each record within a partition. Consumers commit offsets to track exactly where they left off.

🖥️

Broker

A Kafka server that stores and serves records. A Kafka cluster consists of multiple brokers for redundancy and throughput distribution.

👥

Consumer Group

A set of consumers that together consume a topic. Each partition is assigned to exactly one consumer in the group, enabling load balancing.

🔑

Record Key

An optional key attached to each message. Records with the same key always go to the same partition, preserving ordering for that key.

🔁

Replication

Each partition is replicated across N brokers (replication factor). One replica is the leader; others are followers that sync from it.

Architecture

How It All Fits Together

Kafka's architecture decouples producers from consumers through a fault-tolerant, distributed commit log.

Producers
App Service
DB CDC
IoT Device
write
Kafka Cluster
Broker 1
Leader partitions
Broker 2
Replica partitions
ZooKeeper / KRaft
Coordination
poll
Consumers
Analytics
Microservice
Data Lake

Producer Code Example

Java
Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("key.serializer", "...StringSerializer"); props.put("value.serializer", "...StringSerializer"); KafkaProducer<String, String> producer = new KafkaProducer<>(props); producer.send(new ProducerRecord<>( "my-topic", // topic "order-123", // key → determines partition "{\"price\": 99.9}" // value ));

Consumer Code Example

Java
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); consumer.subscribe(Arrays.asList("my-topic")); while (true) { ConsumerRecords<String,String> records = consumer.poll(Duration.ofMillis(100)); for (ConsumerRecord<String,String> r : records) { process(r.key(), r.value(), r.offset()); } }
Topics & Partitions

The Log Abstraction

A topic is an append-only, immutable log. Records are written to the end and identified by their offset. Partitions provide ordering within a key, and replication across brokers provides durability.

Topic Partition Map (click a cell!)

orders
6 partitions, RF=3
events
4 partitions, RF=2
metrics
8 partitions, RF=3

Key Concepts

Retention Policy

Topics retain data for a configurable time (e.g. 7 days) or size. Data is deleted or compacted after the limit is reached.

Log Compaction

For changelog topics, Kafka keeps only the latest value per key, compacting old records — perfect for state stores and caches.

Leader Election

Each partition has one leader broker (handles all reads/writes) and N-1 followers. If the leader fails, a follower is elected automatically.

Kafka APIs

Five Powerful APIs

📤

Producer API

Allows apps to publish streams of records to topics. Supports batching, compression (gzip, snappy, lz4, zstd), and async/sync delivery.

Write
📥

Consumer API

Subscribe to topics and process record streams. Consumers form groups for parallel, load-balanced consumption with offset management.

Read
🔀

Streams API

A lightweight stream-processing library. Build stateful, real-time apps that transform or aggregate data as it flows through Kafka topics.

Transform
🔌

Connect API

Scalable, fault-tolerant connectors to import/export data from databases, S3, Elasticsearch, JDBC sources, and hundreds of others.

Integrate
🛡️

Admin API

Manage topics, brokers, ACLs, consumer groups, and configs programmatically. Essential for infrastructure-as-code workflows.

Manage

Kafka Streams Topology

Kotlin
val builder = StreamsBuilder() builder .stream<String, String>("raw-events") .filter { _, v -> v.contains("purchase") } .mapValues { v -> enrich(v) } .groupByKey() .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5))) .count() .toStream() .to("purchase-counts")
Use Cases

Where Kafka Shines

01

Messaging & Event Bus

Replace traditional message queues (RabbitMQ, ActiveMQ) with a high-throughput, durable, replayable event bus connecting microservices without tight coupling.

02

Real-Time Analytics

Stream clickstream, IoT sensor data, or log events into analytics pipelines (Flink, Spark, ksqlDB) for dashboards with sub-second latency.

03

Change Data Capture (CDC)

Capture every row-level change in databases (PostgreSQL, MySQL) via Debezium connectors and stream them to data warehouses or caches in real time.

04

Log Aggregation

Collect logs from hundreds of services into a central Kafka cluster, then fan out to Elasticsearch, S3, or Splunk — decoupling collection from storage.

05

Event Sourcing

Use Kafka as the system of record for domain events. Services rebuild state by replaying the event log, enabling time-travel debugging and audit trails.

06

Stream ETL / Data Pipelines

Replace nightly batch ETL with continuous streams: transform, filter, join, and route data between dozens of sources and sinks in motion.

07

Microservice Choreography

Implement saga patterns and event-driven choreography between services with guaranteed delivery and independent scaling of each consumer.

08

ML Feature Pipelines

Stream features to a feature store in real time, ensuring ML models always have fresh data for inference without batch lag.

Delivery Guarantees

Reliability Promises

Kafka gives you fine-grained control over delivery semantics via producer acknowledgment and idempotency settings.

📮
At-Most-Once
Messages may be lost, never duplicated. Producer fires and forgets (acks=0). Fastest but least safe.
📬
At-Least-Once
Messages never lost, may be duplicated. Producer retries on failure (acks=all). Most common default.
Exactly-Once
No loss, no duplicates. Requires idempotent producer + transactions. Best for financial & critical data.
🔒
Idempotent Producer
Each message gets a unique sequence number. Broker deduplicates retries automatically (enable.idempotence=true).
🔗
Transactions
Atomically write to multiple partitions/topics. Consumers set isolation.level=read_committed to see only committed data.
🏢
Replication Factor
With RF=3, data is safe even if 2 brokers fail simultaneously, thanks to ISR (In-Sync Replicas) tracking.

Acknowledgment Settings

Config
# acks=0 → fire-and-forget, highest throughput # acks=1 → leader acknowledges, leader crash = loss # acks=-1 → all ISR replicas ack, safest props.put("acks", "all"); props.put("enable.idempotence", "true"); props.put("retries", Integer.MAX_VALUE); props.put("max.in.flight.requests.per.connection", 5);
Comparison

Kafka vs The World

How does Kafka stack up against other messaging and streaming technologies?

FeatureKafkaRabbitMQAWS SQSPulsar
Message Persistence✔ Log-based~ Queue~ Configurable✔ Ledger
Replay Messages✔ By offset✔ Cursor reset
High Throughput✔ 1M+ msg/s~ Moderate~ Managed✔ High
Exactly-Once✔ Transactions~ With effort✔ Native
Stream Processing✔ Kafka Streams✗ External✗ External~ Pulsar Functions
Multi-consumer✔ Consumer groups~ Exchanges✗ Single✔ Subscriptions
Operational Complexity~ Medium✔ Low✔ Managed~ Medium-High
History

The Kafka Timeline

2010
Born at LinkedIn
Jay Kreps, Neha Narkhede, and Jun Rao create Kafka internally at LinkedIn to handle activity stream data and operational metrics.
2011
Open Sourced
Kafka is donated to the Apache Software Foundation and open-sourced on GitHub, attracting rapid community adoption.
2012
Apache Top-Level Project
Kafka graduates to an Apache top-level project. Consumer groups and offset management are introduced.
2014
Confluent Founded
The original Kafka creators leave LinkedIn to found Confluent, building an enterprise platform around Kafka.
2016
Kafka Streams & Connect
Kafka 0.10 introduces the Streams API and Connect framework, transforming Kafka from a message queue into a full streaming platform.
2017
Exactly-Once Semantics
Kafka 0.11 delivers transactional producers and exactly-once semantics — a landmark for stream processing reliability.
2021
KRaft Mode (KIP-500)
Kafka begins removing the ZooKeeper dependency with KRaft (Kafka Raft metadata mode), simplifying deployment significantly.
2023+
ZooKeeper-Free GA
KRaft mode becomes production-ready. Kafka 3.x+ supports fully ZooKeeper-free clusters. Cloud-native managed offerings mature.
Glossary

A–Z Kafka Terms

Knowledge Check

Test Yourself