Kafka — Confusions, Labs, Gotchas & Mock Interview
🧠 Memory Map: TOPIC-PARTITION-OFFSET
Kafka is just 3 ideas stacked. Remember TPO:
| Letter | Pillar | What it controls |
|---|---|---|
| T | Topic | Named stream of records (like a table) |
| P | Partition | Unit of parallelism + ordering (like a shard) |
| O | Offset | Position within a partition (like a row number) |
Draw these 3 on a whiteboard, add producers on the left and consumers on the right, and you've explained Kafka.
SECTION 1 — TOP 8 CONFUSIONS CLEARED
Confusion #1 — Topic vs Partition vs Replica
| Concept | What it is | Example |
|---|---|---|
| Topic | Logical stream name | orders |
| Partition | Physical log file (shard) | orders-0, orders-1, orders-2 |
| Replica | Copy of a partition on a different broker | orders-0 on brokers 1, 2, 3 |
Key rule: ordering is guaranteed within a partition, not across the topic. Total order across topic = impossible at scale.
Interview one-liner: "Topic = logical stream. Partition = physical log + unit of parallelism. Replica = fault-tolerance copy."
Confusion #2 — Partition Key vs Producer Partitioner vs Sticky Partitioner
How does a producer decide WHICH partition a record goes to?
| Scenario | Rule |
|---|---|
key != null | partition = hash(key) % num_partitions — same key → same partition (ordered!) |
key == null (old clients) | Round-robin across partitions |
key == null (Kafka 2.4+) | Sticky partitioner — sticks to one partition per batch, rotates on batch full |
Why sticky: fewer but bigger batches → less network overhead → higher throughput.
Example:
// All orders for customer 42 land on same partition → ordered processing
producer.send(new ProducerRecord<>("orders", "customer-42", orderJson));
Confusion #3 — Consumer Group vs Consumer
One of the most-asked Kafka questions.
Rules:
- Each partition assigned to exactly ONE consumer within a group.
- Multiple groups can read the same topic INDEPENDENTLY.
- If
#consumers > #partitions→ extra consumers sit idle. - If
#consumers < #partitions→ some consumers get multiple partitions.
Interview one-liner: "Partitions split work WITHIN a group. Groups replay the same topic INDEPENDENTLY."
Confusion #4 — At-most-once vs At-least-once vs Exactly-once
| Delivery semantic | What happens | How to achieve |
|---|---|---|
| At-most-once | Commit offset BEFORE processing. Crash = message lost. | enable.auto.commit=true, fast commit |
| At-least-once (default) | Process THEN commit. Crash mid-commit = redelivery. | Manual commit after processing |
| Exactly-once | No duplicates, no loss. | Idempotent producer + transactions (EOS) |
EOS config:
Catch: EOS only works end-to-end when both producer AND consumer use transactions (e.g., Kafka Streams).
Confusion #5 — Leader vs Follower vs ISR (In-Sync Replica)
Broker 1 Broker 2 Broker 3
┌──────────┐ ┌──────────┐ ┌──────────┐
│ P0 LEADER│ │ P0 replica│ │ P0 replica│
│ (reads │ │ (follower)│ │ (follower)│
│ writes)│ │ │ │ │
└──────────┘ └──────────┘ └──────────┘
▲ │ │
│ │ fetch │ fetch
└──────────────┴───────────────┘
| Term | Role |
|---|---|
| Leader | Handles all reads/writes for a partition |
| Follower | Pulls data from leader; becomes leader if current leader dies |
| ISR | Replicas that are FULLY caught up (within replica.lag.time.max.ms) |
min.insync.replicas=2 + acks=all = producer waits until at least 2 replicas ACK. Durability killer if ISR shrinks below threshold — produce will fail.
Confusion #6 — Retention: time-based vs size-based vs compaction
3 ways Kafka trims old data:
| Policy | Config | Use case |
|---|---|---|
| Time | retention.ms=604800000 (7 days) | Event logs |
| Size | retention.bytes=1073741824 (1 GB) | Bounded disk |
| Compaction | cleanup.policy=compact | Keep LATEST value per key (like upsert) |
Compaction example:
Use compaction for: user profiles, config tables, CDC — anywhere you want "current state."
Confusion #7 — Zookeeper vs KRaft
Kafka used to need Zookeeper for cluster metadata. Kafka 3.3+ ships with KRaft (Kafka Raft), eliminating Zookeeper.
| Zookeeper | KRaft | |
|---|---|---|
| Metadata store | Separate ensemble | Embedded in Kafka brokers |
| Ops complexity | 2 clusters to manage | 1 cluster |
| Status | Deprecated (removal in 4.0) | Production-ready (3.3+) |
Interview note: "modern Kafka deployments use KRaft, Zookeeper is being retired" is the right answer in 2026.
Confusion #8 — Kafka vs RabbitMQ vs Kinesis vs Pulsar
Classic system-design question.
| Kafka | RabbitMQ | Kinesis | Pulsar | |
|---|---|---|---|---|
| Model | Log (replay-able) | Queue (delete-on-ack) | Log (AWS-managed) | Log + queue hybrid |
| Retention | Days-weeks | Minutes-hours | 24h default (7d max) | Unlimited (tiered) |
| Throughput | Very high | Medium | High | Very high |
| Ordering | Per-partition | Per-queue | Per-shard | Per-partition |
| Use case | Event streaming | Task queues | AWS-native streaming | Multi-tenant streaming |
Rule: log semantics + replay + high throughput → Kafka. Simple task queue with priorities/routing → RabbitMQ.
SECTION 2 — PRACTICE LABS
Lab 1: Single-node Kafka in Docker (15 mins)
# docker-compose.yml
cat > docker-compose.yml <<'EOF'
version: '3'
services:
kafka:
image: apache/kafka:3.7.0
ports:
- "9092:9092"
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@localhost:9093
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
EOF
docker compose up -d
# Create topic with 3 partitions
docker exec -it $(docker ps -qf name=kafka) \
/opt/kafka/bin/kafka-topics.sh --create \
--topic orders --partitions 3 --replication-factor 1 \
--bootstrap-server localhost:9092
# Output: Created topic orders.
# Produce
docker exec -it $(docker ps -qf name=kafka) \
/opt/kafka/bin/kafka-console-producer.sh \
--topic orders --bootstrap-server localhost:9092
> {"id":1,"amount":100}
> {"id":2,"amount":200}
> ^C
# Consume from beginning
docker exec -it $(docker ps -qf name=kafka) \
/opt/kafka/bin/kafka-console-consumer.sh \
--topic orders --from-beginning \
--bootstrap-server localhost:9092
# Output:
# {"id":1,"amount":100}
# {"id":2,"amount":200}
What you proved: you can create topics, produce, consume without any Java code.
Lab 2: Observe partitioning by key (10 mins)
# Produce with keys
docker exec -it $(docker ps -qf name=kafka) \
/opt/kafka/bin/kafka-console-producer.sh \
--topic orders --bootstrap-server localhost:9092 \
--property "parse.key=true" --property "key.separator=:"
> alice:order-1
> bob:order-2
> alice:order-3
> carol:order-4
> alice:order-5
> ^C
# Consume with partition info
docker exec -it $(docker ps -qf name=kafka) \
/opt/kafka/bin/kafka-console-consumer.sh \
--topic orders --from-beginning \
--bootstrap-server localhost:9092 \
--property "print.partition=true" \
--property "print.key=true"
# Sample output:
# Partition:1 alice order-1
# Partition:1 alice order-3
# Partition:1 alice order-5
# Partition:2 bob order-2
# Partition:0 carol order-4
What you proved: all alice records went to the SAME partition → ordered processing for that key is guaranteed.
Lab 3: Consumer group rebalancing (10 mins)
# Terminal 1: consumer-1 in group "billing"
docker exec -it $(docker ps -qf name=kafka) \
/opt/kafka/bin/kafka-console-consumer.sh \
--topic orders --from-beginning \
--group billing \
--bootstrap-server localhost:9092
# Terminal 2: describe the group — see partition assignment
docker exec -it $(docker ps -qf name=kafka) \
/opt/kafka/bin/kafka-consumer-groups.sh \
--describe --group billing \
--bootstrap-server localhost:9092
# Output shows consumer-1 owns ALL 3 partitions
# Terminal 3: start consumer-2 in SAME group
docker exec -it $(docker ps -qf name=kafka) \
/opt/kafka/bin/kafka-console-consumer.sh \
--topic orders --group billing \
--bootstrap-server localhost:9092
# Re-run describe: now partitions are SPLIT (rebalance happened)
What you proved: adding/removing consumers triggers rebalance — Kafka redistributes partitions automatically.
SECTION 3 — LIVE VISUAL ANIMATIONS
Animation 1: Producer → Broker → Consumer flow
PRODUCER CONSUMER GROUP
│ │
│ 1. send(topic=orders, key=alice, val=...) │
▼ │
┌──────────────────────────────────────────┐ │
│ KAFKA BROKER CLUSTER │ │
│ │ │
│ Topic: orders │ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ P0 │ │ P1 │ │ P2 │ │ │
│ │offset 0 │ │offset 0 │ │offset 0 │ │◀─────────┤
│ │offset 1 │ │offset 1 │ │offset 1 │ │ poll() │
│ │offset 2 │ │offset 2 │ │ ... │ │ │
│ └─────────┘ └─────────┘ └─────────┘ │ │
│ Leader:B1 Leader:B2 Leader:B3 │ │
└──────────────────────────────────────────┘ │
│
After processing:
commit offset 2 for P0
Animation 2: ISR shrink & re-election
Animation 3: Log compaction cleaner
SECTION 4 — GOTCHAS (REAL PRODUCTION FAILURES)
Gotcha 1: Increasing partitions rehashes keys
If you go from 4 to 8 partitions, hash(key) % 8 ≠ hash(key) % 4. Consumers that relied on per-key ordering will break.
Fix: plan partitions ahead; or drain before rescaling.
Gotcha 2: acks=1 = silent data loss
Producer waits only for leader ACK. If leader crashes before replicating → message lost.
Fix: acks=all + min.insync.replicas=2 + replication.factor=3.
Gotcha 3: Consumer lag growing = starved partition
kafka-consumer-groups.sh --describe shows LAG. If lag keeps climbing → consumers too slow OR stuck partition.
Fix: add consumers (up to # partitions), parallelize processing, or scale partitions if bottleneck.
Gotcha 4: auto.offset.reset=latest loses historical data on new consumer
A new consumer starts at the END of the log by default. Set to earliest to replay.
Gotcha 5: Long processing blocks heartbeat → rebalance storm
max.poll.interval.ms (default 5 min). If processing one batch takes longer, consumer is kicked → rebalance.
Fix: reduce max.poll.records OR pause-resume pattern OR increase timeout.
Gotcha 6: Messages bigger than max.message.bytes = producer error
Default 1 MB. Large payloads → RecordTooLargeException.
Fix: increase broker's message.max.bytes + topic's max.message.bytes + consumer's fetch.max.bytes (all 3 must align).
SECTION 5 — TIMED MOCK INTERVIEW (45 MIN)
Q1 (8 min) — "Design an event-driven order system: user places order → inventory reserves → payment charges → shipping dispatches"
Answer structure:
- Topic
orderspartitioned bycustomer_id(ordering per customer) - Services as independent consumer groups:
inventory-svc,payment-svc,shipping-svc— each reads orders topic independently - Each service writes its outcome to its own topic:
inventory-events,payment-events - Shipping waits on
payment-events+inventory-events(join) - Use idempotent producers + manual offset commits (at-least-once)
- Dead-letter topic for failed processing
Q2 (6 min) — "How do you guarantee exactly-once from producer to downstream system?"
- Producer:
enable.idempotence=true(deduplicates on broker side by PID+seq) - Producer: wrap multiple sends in
beginTransaction()/commitTransaction() - Consumer:
isolation.level=read_committed - End consumer to external store: needs idempotent upsert (e.g., primary key in DB) OR Kafka Streams EOS
- If downstream is NOT transactional (plain HTTP/DB), EOS is bounded by consumer's idempotency
Q3 (5 min) — "What's the tradeoff between many small partitions vs few large ones?"
Many partitions: more parallelism, more consumers possible, but: more open file handles, more metadata, longer leader election, longer rebalance. Few partitions: simpler, faster rebalance, but lower max throughput and fewer concurrent consumers possible. Rule of thumb: target ~25-50 MB/s throughput per partition. Total partitions ≈ target throughput / 25 MB/s.
Q4 (4 min) — "How does Kafka achieve high throughput?"
- Sequential disk writes (append-only log) — nearly as fast as RAM
- Zero-copy transfer (
sendfilesyscall) from page cache → network - Batch compression (producer side)
- Partition parallelism
- OS page cache (Kafka doesn't maintain its own cache)
Q5 (4 min) — "Consumer rebalance is causing 30-second pauses. How to fix?"
- Use cooperative sticky assignor (
partition.assignment.strategy=CooperativeStickyAssignor) — incremental rebalance, doesn't stop all consumers - Tune
session.timeout.ms+heartbeat.interval.ms - Static membership (
group.instance.id=xxx) — temporary disconnects don't trigger full rebalance - Avoid frequent scaling up/down
SECTION 6 — FINAL READINESS CHECKLIST
- Can I draw topic → partition → replica → leader/follower?
- Do I know how the partition is chosen for a record (key hash / sticky)?
- Can I explain consumer groups + the "1 partition per consumer per group" rule?
- Do I know at-most/at-least/exactly-once and the config for each?
- Can I explain ISR, acks=all, min.insync.replicas?
- Do I know the 3 retention policies (time, size, compaction)?
- Can I compare Kafka vs Kinesis vs RabbitMQ?
- Can I diagnose "consumer lag growing" + 3 fixes?
- Do I know Zookeeper vs KRaft status in 2026?
- Can I explain why Kafka is fast (sequential I/O + zero-copy + batching)?
If all 10 = YES, you're Kafka interview-ready.
Remember TOPIC-PARTITION-OFFSET. Everything else is details.