📨
Kafka
Kafka — Confusions, Labs, Gotchas & Mock Interview
📨
📨
Apache Kafka · Section 1 of 1

Kafka — Confusions, Labs, Gotchas & Mock Interview

Kafka — Confusions, Labs, Gotchas & Mock Interview

💡 Interview Tip
The video-free pack. Read this end-to-end and you can walk into any Kafka interview without opening YouTube.

🧠 Memory Map: TOPIC-PARTITION-OFFSET

Kafka is just 3 ideas stacked. Remember TPO:

LetterPillarWhat it controls
TTopicNamed stream of records (like a table)
PPartitionUnit of parallelism + ordering (like a shard)
OOffsetPosition within a partition (like a row number)

Draw these 3 on a whiteboard, add producers on the left and consumers on the right, and you've explained Kafka.

SECTION 1 — TOP 8 CONFUSIONS CLEARED

Confusion #1 — Topic vs Partition vs Replica

ConceptWhat it isExample
TopicLogical stream nameorders
PartitionPhysical log file (shard)orders-0, orders-1, orders-2
ReplicaCopy of a partition on a different brokerorders-0 on brokers 1, 2, 3

Key rule: ordering is guaranteed within a partition, not across the topic. Total order across topic = impossible at scale.

Interview one-liner: "Topic = logical stream. Partition = physical log + unit of parallelism. Replica = fault-tolerance copy."

Confusion #2 — Partition Key vs Producer Partitioner vs Sticky Partitioner

How does a producer decide WHICH partition a record goes to?

ScenarioRule
key != nullpartition = hash(key) % num_partitions — same key → same partition (ordered!)
key == null (old clients)Round-robin across partitions
key == null (Kafka 2.4+)Sticky partitioner — sticks to one partition per batch, rotates on batch full

Why sticky: fewer but bigger batches → less network overhead → higher throughput.

Example:

java
// All orders for customer 42 land on same partition → ordered processing
producer.send(new ProducerRecord<>("orders", "customer-42", orderJson));

Confusion #3 — Consumer Group vs Consumer

One of the most-asked Kafka questions.

🧠 Memory Map
Topic: orders (4 partitions: P0, P1, P2, P3)
Consumer Group "billing":
consumer-Areads P0, P1
consumer-Breads P2, P3
Consumer Group "analytics": ← independent, reads everything again
consumer-Xreads P0, P1, P2, P3

Rules:

  1. Each partition assigned to exactly ONE consumer within a group.
  2. Multiple groups can read the same topic INDEPENDENTLY.
  3. If #consumers > #partitions → extra consumers sit idle.
  4. If #consumers < #partitions → some consumers get multiple partitions.

Interview one-liner: "Partitions split work WITHIN a group. Groups replay the same topic INDEPENDENTLY."

Confusion #4 — At-most-once vs At-least-once vs Exactly-once

Delivery semanticWhat happensHow to achieve
At-most-onceCommit offset BEFORE processing. Crash = message lost.enable.auto.commit=true, fast commit
At-least-once (default)Process THEN commit. Crash mid-commit = redelivery.Manual commit after processing
Exactly-onceNo duplicates, no loss.Idempotent producer + transactions (EOS)

EOS config:

# Producer
enable.idempotence=true
transactional.id=orders-producer-1
# Consumer
isolation.level=read_committed

Catch: EOS only works end-to-end when both producer AND consumer use transactions (e.g., Kafka Streams).

Confusion #5 — Leader vs Follower vs ISR (In-Sync Replica)

📐 Architecture Diagram
Broker 1        Broker 2        Broker 3
┌──────────┐   ┌──────────┐   ┌──────────┐
│ P0 LEADER│   │ P0 replica│   │ P0 replica│
│  (reads  │   │ (follower)│   │ (follower)│
│   writes)│   │           │   │           │
└──────────┘   └──────────┘   └──────────┘
     ▲              │               │
     │              │ fetch         │ fetch
     └──────────────┴───────────────┘
TermRole
LeaderHandles all reads/writes for a partition
FollowerPulls data from leader; becomes leader if current leader dies
ISRReplicas that are FULLY caught up (within replica.lag.time.max.ms)

min.insync.replicas=2 + acks=all = producer waits until at least 2 replicas ACK. Durability killer if ISR shrinks below threshold — produce will fail.

Confusion #6 — Retention: time-based vs size-based vs compaction

3 ways Kafka trims old data:

PolicyConfigUse case
Timeretention.ms=604800000 (7 days)Event logs
Sizeretention.bytes=1073741824 (1 GB)Bounded disk
Compactioncleanup.policy=compactKeep LATEST value per key (like upsert)

Compaction example:

Before: (k1,v1) (k2,v2) (k1,v3) (k3,v4) (k1,v5)
After: (k2,v2) (k3,v4) (k1,v5) ← only newest per key kept

Use compaction for: user profiles, config tables, CDC — anywhere you want "current state."

Confusion #7 — Zookeeper vs KRaft

Kafka used to need Zookeeper for cluster metadata. Kafka 3.3+ ships with KRaft (Kafka Raft), eliminating Zookeeper.

ZookeeperKRaft
Metadata storeSeparate ensembleEmbedded in Kafka brokers
Ops complexity2 clusters to manage1 cluster
StatusDeprecated (removal in 4.0)Production-ready (3.3+)

Interview note: "modern Kafka deployments use KRaft, Zookeeper is being retired" is the right answer in 2026.

Confusion #8 — Kafka vs RabbitMQ vs Kinesis vs Pulsar

Classic system-design question.

KafkaRabbitMQKinesisPulsar
ModelLog (replay-able)Queue (delete-on-ack)Log (AWS-managed)Log + queue hybrid
RetentionDays-weeksMinutes-hours24h default (7d max)Unlimited (tiered)
ThroughputVery highMediumHighVery high
OrderingPer-partitionPer-queuePer-shardPer-partition
Use caseEvent streamingTask queuesAWS-native streamingMulti-tenant streaming

Rule: log semantics + replay + high throughput → Kafka. Simple task queue with priorities/routing → RabbitMQ.

SECTION 2 — PRACTICE LABS

Lab 1: Single-node Kafka in Docker (15 mins)

bash
# docker-compose.yml
cat > docker-compose.yml <<'EOF'
version: '3'
services:
  kafka:
    image: apache/kafka:3.7.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@localhost:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
EOF

docker compose up -d

# Create topic with 3 partitions
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-topics.sh --create \
  --topic orders --partitions 3 --replication-factor 1 \
  --bootstrap-server localhost:9092
# Output: Created topic orders.

# Produce
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-producer.sh \
  --topic orders --bootstrap-server localhost:9092
> {"id":1,"amount":100}
> {"id":2,"amount":200}
> ^C

# Consume from beginning
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-consumer.sh \
  --topic orders --from-beginning \
  --bootstrap-server localhost:9092
# Output:
# {"id":1,"amount":100}
# {"id":2,"amount":200}

What you proved: you can create topics, produce, consume without any Java code.

Lab 2: Observe partitioning by key (10 mins)

bash
# Produce with keys
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-producer.sh \
  --topic orders --bootstrap-server localhost:9092 \
  --property "parse.key=true" --property "key.separator=:"
> alice:order-1
> bob:order-2
> alice:order-3
> carol:order-4
> alice:order-5
> ^C

# Consume with partition info
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-consumer.sh \
  --topic orders --from-beginning \
  --bootstrap-server localhost:9092 \
  --property "print.partition=true" \
  --property "print.key=true"
# Sample output:
# Partition:1  alice  order-1
# Partition:1  alice  order-3
# Partition:1  alice  order-5
# Partition:2  bob    order-2
# Partition:0  carol  order-4

What you proved: all alice records went to the SAME partition → ordered processing for that key is guaranteed.

Lab 3: Consumer group rebalancing (10 mins)

bash
# Terminal 1: consumer-1 in group "billing"
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-consumer.sh \
  --topic orders --from-beginning \
  --group billing \
  --bootstrap-server localhost:9092

# Terminal 2: describe the group — see partition assignment
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-consumer-groups.sh \
  --describe --group billing \
  --bootstrap-server localhost:9092
# Output shows consumer-1 owns ALL 3 partitions

# Terminal 3: start consumer-2 in SAME group
docker exec -it $(docker ps -qf name=kafka) \
  /opt/kafka/bin/kafka-console-consumer.sh \
  --topic orders --group billing \
  --bootstrap-server localhost:9092

# Re-run describe: now partitions are SPLIT (rebalance happened)

What you proved: adding/removing consumers triggers rebalance — Kafka redistributes partitions automatically.

SECTION 3 — LIVE VISUAL ANIMATIONS

Animation 1: Producer → Broker → Consumer flow

📐 Architecture Diagram
PRODUCER                                         CONSUMER GROUP
   │                                                  │
   │ 1. send(topic=orders, key=alice, val=...)        │
   ▼                                                  │
┌──────────────────────────────────────────┐          │
│        KAFKA BROKER CLUSTER              │          │
│                                          │          │
│  Topic: orders                           │          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐    │          │
│  │   P0    │ │   P1    │ │   P2    │    │          │
│  │offset 0 │ │offset 0 │ │offset 0 │    │◀─────────┤
│  │offset 1 │ │offset 1 │ │offset 1 │    │   poll() │
│  │offset 2 │ │offset 2 │ │  ...    │    │          │
│  └─────────┘ └─────────┘ └─────────┘    │          │
│   Leader:B1  Leader:B2  Leader:B3        │          │
└──────────────────────────────────────────┘          │
                                                      │
                                          After processing:
                                          commit offset 2 for P0

Animation 2: ISR shrink & re-election

🧠 ISR = {B3}
Initial state:
P0 leader = B1, followers = [B2, B3], ISR = {B1, B2, B3}
B2 becomes slow (network lag > 10s):
Kafka shrinks ISRISR = {B1, B3}
Producer with acks=all + min.insync.replicas=2 still succeeds (ISR=2 ≥ 2)
B1 crashes:
Controller elects NEW leader from ISR = {B3}
B3 becomes leader for P0
ISR{B3}
If min.insync.replicas=2:
Producer writes FAIL with NotEnoughReplicasException
→ tradeoff: durability vs availability

Animation 3: Log compaction cleaner

Before compaction (segment file):
offset 0: (k=user1, v=alice)
offset 1: (k=user2, v=bob)
offset 2: (k=user1, v=ALICE-UPDATED)
offset 3: (k=user3, v=carol)
offset 4: (k=user1, v=ALICE-FINAL)
offset 5: (k=user2, v=null) ← tombstone (delete)
After compaction:
offset 3: (k=user3, v=carol)
offset 4: (k=user1, v=ALICE-FINAL)
(user2 removed due to tombstone)
Compaction preserves: LATEST value per key (across all segments).
Offsets stay monotonically increasing but may have gaps.

SECTION 4 — GOTCHAS (REAL PRODUCTION FAILURES)

Gotcha 1: Increasing partitions rehashes keys

If you go from 4 to 8 partitions, hash(key) % 8hash(key) % 4. Consumers that relied on per-key ordering will break. Fix: plan partitions ahead; or drain before rescaling.

Gotcha 2: acks=1 = silent data loss

Producer waits only for leader ACK. If leader crashes before replicating → message lost. Fix: acks=all + min.insync.replicas=2 + replication.factor=3.

Gotcha 3: Consumer lag growing = starved partition

kafka-consumer-groups.sh --describe shows LAG. If lag keeps climbing → consumers too slow OR stuck partition. Fix: add consumers (up to # partitions), parallelize processing, or scale partitions if bottleneck.

Gotcha 4: auto.offset.reset=latest loses historical data on new consumer

A new consumer starts at the END of the log by default. Set to earliest to replay.

Gotcha 5: Long processing blocks heartbeat → rebalance storm

max.poll.interval.ms (default 5 min). If processing one batch takes longer, consumer is kicked → rebalance. Fix: reduce max.poll.records OR pause-resume pattern OR increase timeout.

Gotcha 6: Messages bigger than max.message.bytes = producer error

Default 1 MB. Large payloads → RecordTooLargeException. Fix: increase broker's message.max.bytes + topic's max.message.bytes + consumer's fetch.max.bytes (all 3 must align).

SECTION 5 — TIMED MOCK INTERVIEW (45 MIN)

Q1 (8 min) — "Design an event-driven order system: user places order → inventory reserves → payment charges → shipping dispatches"

Answer structure:

  1. Topic orders partitioned by customer_id (ordering per customer)
  2. Services as independent consumer groups: inventory-svc, payment-svc, shipping-svc — each reads orders topic independently
  3. Each service writes its outcome to its own topic: inventory-events, payment-events
  4. Shipping waits on payment-events + inventory-events (join)
  5. Use idempotent producers + manual offset commits (at-least-once)
  6. Dead-letter topic for failed processing

Q2 (6 min) — "How do you guarantee exactly-once from producer to downstream system?"

  • Producer: enable.idempotence=true (deduplicates on broker side by PID+seq)
  • Producer: wrap multiple sends in beginTransaction()/commitTransaction()
  • Consumer: isolation.level=read_committed
  • End consumer to external store: needs idempotent upsert (e.g., primary key in DB) OR Kafka Streams EOS
  • If downstream is NOT transactional (plain HTTP/DB), EOS is bounded by consumer's idempotency

Q3 (5 min) — "What's the tradeoff between many small partitions vs few large ones?"

Many partitions: more parallelism, more consumers possible, but: more open file handles, more metadata, longer leader election, longer rebalance. Few partitions: simpler, faster rebalance, but lower max throughput and fewer concurrent consumers possible. Rule of thumb: target ~25-50 MB/s throughput per partition. Total partitions ≈ target throughput / 25 MB/s.

Q4 (4 min) — "How does Kafka achieve high throughput?"

  1. Sequential disk writes (append-only log) — nearly as fast as RAM
  2. Zero-copy transfer (sendfile syscall) from page cache → network
  3. Batch compression (producer side)
  4. Partition parallelism
  5. OS page cache (Kafka doesn't maintain its own cache)

Q5 (4 min) — "Consumer rebalance is causing 30-second pauses. How to fix?"

  • Use cooperative sticky assignor (partition.assignment.strategy=CooperativeStickyAssignor) — incremental rebalance, doesn't stop all consumers
  • Tune session.timeout.ms + heartbeat.interval.ms
  • Static membership (group.instance.id=xxx) — temporary disconnects don't trigger full rebalance
  • Avoid frequent scaling up/down

SECTION 6 — FINAL READINESS CHECKLIST

  • Can I draw topic → partition → replica → leader/follower?
  • Do I know how the partition is chosen for a record (key hash / sticky)?
  • Can I explain consumer groups + the "1 partition per consumer per group" rule?
  • Do I know at-most/at-least/exactly-once and the config for each?
  • Can I explain ISR, acks=all, min.insync.replicas?
  • Do I know the 3 retention policies (time, size, compaction)?
  • Can I compare Kafka vs Kinesis vs RabbitMQ?
  • Can I diagnose "consumer lag growing" + 3 fixes?
  • Do I know Zookeeper vs KRaft status in 2026?
  • Can I explain why Kafka is fast (sequential I/O + zero-copy + batching)?

If all 10 = YES, you're Kafka interview-ready.

Remember TOPIC-PARTITION-OFFSET. Everything else is details.