📨

Kafka

Kafka — Confusions, Labs, Gotchas & Mock Interview

📨

Apache Kafka · Section 1 of 1

Kafka — Confusions, Labs, Gotchas & Mock Interview

💡 Interview Tip

The video-free pack. Read this end-to-end and you can walk into any Kafka interview without opening YouTube.

🧠 Memory Map: TOPIC-PARTITION-OFFSET

Kafka is just 3 ideas stacked. Remember TPO:

Letter	Pillar	What it controls
T	Topic	Named stream of records (like a table)
P	Partition	Unit of parallelism + ordering (like a shard)
O	Offset	Position within a partition (like a row number)

Draw these 3 on a whiteboard, add producers on the left and consumers on the right, and you've explained Kafka.

SECTION 1 — TOP 8 CONFUSIONS CLEARED

Confusion #1 — Topic vs Partition vs Replica

Concept	What it is	Example
Topic	Logical stream name	`orders`
Partition	Physical log file (shard)	`orders-0`, `orders-1`, `orders-2`
Replica	Copy of a partition on a different broker	`orders-0` on brokers 1, 2, 3

Key rule: ordering is guaranteed within a partition, not across the topic. Total order across topic = impossible at scale.

Interview one-liner: "Topic = logical stream. Partition = physical log + unit of parallelism. Replica = fault-tolerance copy."

Confusion #2 — Partition Key vs Producer Partitioner vs Sticky Partitioner

How does a producer decide WHICH partition a record goes to?

Scenario	Rule
`key != null`	`partition = hash(key) % num_partitions` — same key → same partition (ordered!)
`key == null` (old clients)	Round-robin across partitions
`key == null` (Kafka 2.4+)	Sticky partitioner — sticks to one partition per batch, rotates on batch full

Why sticky: fewer but bigger batches → less network overhead → higher throughput.

Example:

java

// All orders for customer 42 land on same partition → ordered processing
producer.send(new ProducerRecord<>("orders", "customer-42", orderJson));

Confusion #3 — Consumer Group vs Consumer

One of the most-asked Kafka questions.

🧠 Memory Map

Topic: orders (4 partitions: P0, P1, P2, P3)

Consumer Group "billing":

consumer-A→reads P0, P1

consumer-B→reads P2, P3

Consumer Group "analytics": ← independent, reads everything again

consumer-X→reads P0, P1, P2, P3

Rules:

Each partition assigned to exactly ONE consumer within a group.
Multiple groups can read the same topic INDEPENDENTLY.
If #consumers > #partitions → extra consumers sit idle.
If #consumers < #partitions → some consumers get multiple partitions.

Interview one-liner: "Partitions split work WITHIN a group. Groups replay the same topic INDEPENDENTLY."

Confusion #4 — At-most-once vs At-least-once vs Exactly-once

Delivery semantic	What happens	How to achieve
At-most-once	Commit offset BEFORE processing. Crash = message lost.	`enable.auto.commit=true`, fast commit
At-least-once (default)	Process THEN commit. Crash mid-commit = redelivery.	Manual commit after processing
Exactly-once	No duplicates, no loss.	Idempotent producer + transactions (EOS)

EOS config:

# Producer

enable.idempotence=true

transactional.id=orders-producer-1

# Consumer

isolation.level=read_committed

Catch: EOS only works end-to-end when both producer AND consumer use transactions (e.g., Kafka Streams).

Confusion #5 — Leader vs Follower vs ISR (In-Sync Replica)

📐 Architecture Diagram

Broker 1        Broker 2        Broker 3
┌──────────┐   ┌──────────┐   ┌──────────┐
│ P0 LEADER│   │ P0 replica│   │ P0 replica│
│  (reads  │   │ (follower)│   │ (follower)│
│   writes)│   │           │   │           │
└──────────┘   └──────────┘   └──────────┘
     ▲              │               │
     │              │ fetch         │ fetch
     └──────────────┴───────────────┘

Term	Role
Leader	Handles all reads/writes for a partition
Follower	Pulls data from leader; becomes leader if current leader dies
ISR	Replicas that are FULLY caught up (within `replica.lag.time.max.ms`)

min.insync.replicas=2 + acks=all = producer waits until at least 2 replicas ACK. Durability killer if ISR shrinks below threshold — produce will fail.

Confusion #6 — Retention: time-based vs size-based vs compaction

3 ways Kafka trims old data:

Policy	Config	Use case
Time	`retention.ms=604800000` (7 days)	Event logs
Size	`retention.bytes=1073741824` (1 GB)	Bounded disk
Compaction	`cleanup.policy=compact`	Keep LATEST value per key (like upsert)

Compaction example:

Before: (k1,v1) (k2,v2) (k1,v3) (k3,v4) (k1,v5)

After: (k2,v2) (k3,v4) (k1,v5) ← only newest per key kept

Use compaction for: user profiles, config tables, CDC — anywhere you want "current state."

Confusion #7 — Zookeeper vs KRaft

Kafka used to need Zookeeper for cluster metadata. Kafka 3.3+ ships with KRaft (Kafka Raft), eliminating Zookeeper.

	Zookeeper	KRaft
Metadata store	Separate ensemble	Embedded in Kafka brokers
Ops complexity	2 clusters to manage	1 cluster
Status	Deprecated (removal in 4.0)	Production-ready (3.3+)

Interview note: "modern Kafka deployments use KRaft, Zookeeper is being retired" is the right answer in 2026.

Confusion #8 — Kafka vs RabbitMQ vs Kinesis vs Pulsar

Classic system-design question.

	Kafka	RabbitMQ	Kinesis	Pulsar
Model	Log (replay-able)	Queue (delete-on-ack)	Log (AWS-managed)	Log + queue hybrid
Retention	Days-weeks	Minutes-hours	24h default (7d max)	Unlimited (tiered)
Throughput	Very high	Medium	High	Very high
Ordering	Per-partition	Per-queue	Per-shard	Per-partition
Use case	Event streaming	Task queues	AWS-native streaming	Multi-tenant streaming

Rule: log semantics + replay + high throughput → Kafka. Simple task queue with priorities/routing → RabbitMQ.

SECTION 2 — PRACTICE LABS

Lab 1: Single-node Kafka in Docker (15 mins)

bash

# docker-compose.yml
cat > docker-compose.yml <<'EOF'
version: '3'
services:
  kafka:
    image: apache/kafka:3.7.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@localhost:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
EOF

docker compose up -d

# Create topic with 3 partitions
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-topics.sh --create \
  --topic orders --partitions 3 --replication-factor 1 \
  --bootstrap-server localhost:9092
# Output: Created topic orders.

# Produce
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-producer.sh \
  --topic orders --bootstrap-server localhost:9092
> {"id":1,"amount":100}
> {"id":2,"amount":200}
> ^C

# Consume from beginning
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-consumer.sh \
  --topic orders --from-beginning \
  --bootstrap-server localhost:9092
# Output:
# {"id":1,"amount":100}
# {"id":2,"amount":200}

What you proved: you can create topics, produce, consume without any Java code.

Lab 2: Observe partitioning by key (10 mins)

bash

# Produce with keys
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-producer.sh \
  --topic orders --bootstrap-server localhost:9092 \
  --property "parse.key=true" --property "key.separator=:"
> alice:order-1
> bob:order-2
> alice:order-3
> carol:order-4
> alice:order-5
> ^C

# Consume with partition info
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-consumer.sh \
  --topic orders --from-beginning \
  --bootstrap-server localhost:9092 \
  --property "print.partition=true" \
  --property "print.key=true"
# Sample output:
# Partition:1  alice  order-1
# Partition:1  alice  order-3
# Partition:1  alice  order-5
# Partition:2  bob    order-2
# Partition:0  carol  order-4

What you proved: all alice records went to the SAME partition → ordered processing for that key is guaranteed.

Lab 3: Consumer group rebalancing (10 mins)

bash

# Terminal 1: consumer-1 in group "billing"
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-consumer.sh \
  --topic orders --from-beginning \
  --group billing \
  --bootstrap-server localhost:9092

# Terminal 2: describe the group — see partition assignment
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-consumer-groups.sh \
  --describe --group billing \
  --bootstrap-server localhost:9092
# Output shows consumer-1 owns ALL 3 partitions

# Terminal 3: start consumer-2 in SAME group
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-consumer.sh \
  --topic orders --group billing \
  --bootstrap-server localhost:9092

# Re-run describe: now partitions are SPLIT (rebalance happened)

What you proved: adding/removing consumers triggers rebalance — Kafka redistributes partitions automatically.

SECTION 3 — LIVE VISUAL ANIMATIONS

Animation 1: Producer → Broker → Consumer flow

📐 Architecture Diagram

PRODUCER                                         CONSUMER GROUP
   │                                                  │
   │ 1. send(topic=orders, key=alice, val=...)        │
   ▼                                                  │
┌──────────────────────────────────────────┐          │
│        KAFKA BROKER CLUSTER              │          │
│                                          │          │
│  Topic: orders                           │          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐    │          │
│  │   P0    │ │   P1    │ │   P2    │    │          │
│  │offset 0 │ │offset 0 │ │offset 0 │    │◀─────────┤
│  │offset 1 │ │offset 1 │ │offset 1 │    │   poll() │
│  │offset 2 │ │offset 2 │ │  ...    │    │          │
│  └─────────┘ └─────────┘ └─────────┘    │          │
│   Leader:B1  Leader:B2  Leader:B3        │          │
└──────────────────────────────────────────┘          │
                                                      │
                                          After processing:
                                          commit offset 2 for P0

Animation 2: ISR shrink & re-election

🧠 ISR = {B3}

Initial state:

P0 leader = B1, followers = [B2, B3], ISR = {B1, B2, B3}

B2 becomes slow (network lag > 10s):

Kafka shrinks ISR→ISR = {B1, B3}

Producer with acks=all + min.insync.replicas=2 still succeeds (ISR=2 ≥ 2)

B1 crashes:

Controller elects NEW leader from ISR = {B3}

B3 becomes leader for P0

ISR{B3}

If min.insync.replicas=2:

Producer writes FAIL with NotEnoughReplicasException

→ tradeoff: durability vs availability

Animation 3: Log compaction cleaner

Before compaction (segment file):

offset 0: (k=user1, v=alice)

offset 1: (k=user2, v=bob)

offset 2: (k=user1, v=ALICE-UPDATED)

offset 3: (k=user3, v=carol)

offset 4: (k=user1, v=ALICE-FINAL)

offset 5: (k=user2, v=null) ← tombstone (delete)

After compaction:

offset 3: (k=user3, v=carol)

offset 4: (k=user1, v=ALICE-FINAL)

(user2 removed due to tombstone)

Compaction preserves: LATEST value per key (across all segments).

Offsets stay monotonically increasing but may have gaps.

SECTION 4 — GOTCHAS (REAL PRODUCTION FAILURES)

Gotcha 1: Increasing partitions rehashes keys

If you go from 4 to 8 partitions, hash(key) % 8 ≠ hash(key) % 4. Consumers that relied on per-key ordering will break. Fix: plan partitions ahead; or drain before rescaling.

Gotcha 2: `acks=1` = silent data loss

Producer waits only for leader ACK. If leader crashes before replicating → message lost. Fix: acks=all + min.insync.replicas=2 + replication.factor=3.

Gotcha 3: Consumer lag growing = starved partition

kafka-consumer-groups.sh --describe shows LAG. If lag keeps climbing → consumers too slow OR stuck partition. Fix: add consumers (up to # partitions), parallelize processing, or scale partitions if bottleneck.

Gotcha 4: `auto.offset.reset=latest` loses historical data on new consumer

A new consumer starts at the END of the log by default. Set to earliest to replay.

Gotcha 5: Long processing blocks heartbeat → rebalance storm

max.poll.interval.ms (default 5 min). If processing one batch takes longer, consumer is kicked → rebalance. Fix: reduce max.poll.records OR pause-resume pattern OR increase timeout.

Gotcha 6: Messages bigger than `max.message.bytes` = producer error

Default 1 MB. Large payloads → RecordTooLargeException. Fix: increase broker's message.max.bytes + topic's max.message.bytes + consumer's fetch.max.bytes (all 3 must align).

SECTION 5 — TIMED MOCK INTERVIEW (45 MIN)

Q1 (8 min) — "Design an event-driven order system: user places order → inventory reserves → payment charges → shipping dispatches"

Answer structure:

Topic orders partitioned by customer_id (ordering per customer)
Services as independent consumer groups: inventory-svc, payment-svc, shipping-svc — each reads orders topic independently
Each service writes its outcome to its own topic: inventory-events, payment-events
Shipping waits on payment-events + inventory-events (join)
Use idempotent producers + manual offset commits (at-least-once)
Dead-letter topic for failed processing

Q2 (6 min) — "How do you guarantee exactly-once from producer to downstream system?"

Producer: enable.idempotence=true (deduplicates on broker side by PID+seq)
Producer: wrap multiple sends in beginTransaction()/commitTransaction()
Consumer: isolation.level=read_committed
End consumer to external store: needs idempotent upsert (e.g., primary key in DB) OR Kafka Streams EOS
If downstream is NOT transactional (plain HTTP/DB), EOS is bounded by consumer's idempotency

Q3 (5 min) — "What's the tradeoff between many small partitions vs few large ones?"

Many partitions: more parallelism, more consumers possible, but: more open file handles, more metadata, longer leader election, longer rebalance. Few partitions: simpler, faster rebalance, but lower max throughput and fewer concurrent consumers possible. Rule of thumb: target ~25-50 MB/s throughput per partition. Total partitions ≈ target throughput / 25 MB/s.

Q4 (4 min) — "How does Kafka achieve high throughput?"

Sequential disk writes (append-only log) — nearly as fast as RAM
Zero-copy transfer (sendfile syscall) from page cache → network
Batch compression (producer side)
Partition parallelism
OS page cache (Kafka doesn't maintain its own cache)

Q5 (4 min) — "Consumer rebalance is causing 30-second pauses. How to fix?"

Use cooperative sticky assignor (partition.assignment.strategy=CooperativeStickyAssignor) — incremental rebalance, doesn't stop all consumers
Tune session.timeout.ms + heartbeat.interval.ms
Static membership (group.instance.id=xxx) — temporary disconnects don't trigger full rebalance
Avoid frequent scaling up/down

SECTION 6 — FINAL READINESS CHECKLIST

Can I draw topic → partition → replica → leader/follower?
Do I know how the partition is chosen for a record (key hash / sticky)?
Can I explain consumer groups + the "1 partition per consumer per group" rule?
Do I know at-most/at-least/exactly-once and the config for each?
Can I explain ISR, acks=all, min.insync.replicas?
Do I know the 3 retention policies (time, size, compaction)?
Can I compare Kafka vs Kinesis vs RabbitMQ?
Can I diagnose "consumer lag growing" + 3 fixes?
Do I know Zookeeper vs KRaft status in 2026?
Can I explain why Kafka is fast (sequential I/O + zero-copy + batching)?

If all 10 = YES, you're Kafka interview-ready.

Remember TOPIC-PARTITION-OFFSET. Everything else is details.

Practice Questions 📋

Kafka — Confusions, Labs, Gotchas & Mock Interview

Kafka — Confusions, Labs, Gotchas & Mock Interview

🧠 Memory Map: TOPIC-PARTITION-OFFSET

SECTION 1 — TOP 8 CONFUSIONS CLEARED

Confusion #1 — Topic vs Partition vs Replica

Confusion #2 — Partition Key vs Producer Partitioner vs Sticky Partitioner

Confusion #3 — Consumer Group vs Consumer

Confusion #4 — At-most-once vs At-least-once vs Exactly-once

Confusion #5 — Leader vs Follower vs ISR (In-Sync Replica)

Confusion #6 — Retention: time-based vs size-based vs compaction

Confusion #7 — Zookeeper vs KRaft

Confusion #8 — Kafka vs RabbitMQ vs Kinesis vs Pulsar

SECTION 2 — PRACTICE LABS

Lab 1: Single-node Kafka in Docker (15 mins)

Lab 2: Observe partitioning by key (10 mins)

Lab 3: Consumer group rebalancing (10 mins)

SECTION 3 — LIVE VISUAL ANIMATIONS

Animation 1: Producer → Broker → Consumer flow

Animation 2: ISR shrink & re-election

Animation 3: Log compaction cleaner

SECTION 4 — GOTCHAS (REAL PRODUCTION FAILURES)

Gotcha 1: Increasing partitions rehashes keys

Gotcha 2: acks=1 = silent data loss

Gotcha 3: Consumer lag growing = starved partition

Gotcha 4: auto.offset.reset=latest loses historical data on new consumer

Gotcha 5: Long processing blocks heartbeat → rebalance storm

Gotcha 6: Messages bigger than max.message.bytes = producer error

SECTION 5 — TIMED MOCK INTERVIEW (45 MIN)

Q1 (8 min) — "Design an event-driven order system: user places order → inventory reserves → payment charges → shipping dispatches"

Q2 (6 min) — "How do you guarantee exactly-once from producer to downstream system?"

Q3 (5 min) — "What's the tradeoff between many small partitions vs few large ones?"

Q4 (4 min) — "How does Kafka achieve high throughput?"

Q5 (4 min) — "Consumer rebalance is causing 30-second pauses. How to fix?"

SECTION 6 — FINAL READINESS CHECKLIST

Gotcha 2: `acks=1` = silent data loss

Gotcha 4: `auto.offset.reset=latest` loses historical data on new consumer

Gotcha 6: Messages bigger than `max.message.bytes` = producer error