🏗️

System Design

System Design for Data Engineers — Patterns, Tradeoffs & Mock Interview

🏗️

System Design · Section 1 of 1

System Design for Data Engineers — Patterns, Tradeoffs & Mock Interview

💡 Interview Tip

The video-free pack. Read this end-to-end and you can walk into any data-platform system-design interview without opening YouTube.

🧠 Memory Map: INGEST-STORE-SERVE

Every data system in the world is 3 layers. Remember ISS:

Letter	Layer	Question to answer
I	Ingest	How does data arrive? (batch vs stream, push vs pull)
S	Store	Where does it live? (raw lake, warehouse, serving layer)
S	Serve	How do consumers read it? (dashboards, API, ML, notebooks)

Draw these 3 boxes, fill them with concrete tools, and you've answered the question.

SECTION 1 — 8 DESIGN TRADEOFFS YOU MUST KNOW

Tradeoff #1 — Batch vs Streaming vs Micro-batch

	Batch	Streaming	Micro-batch
Latency	Hours-days	Milliseconds-seconds	Seconds-minutes
Cost	Lowest (compute on demand)	Highest (always on)	Middle
Complexity	Low	High	Medium
Examples	Nightly warehouse loads	Fraud detection	Structured Streaming default
Failure recovery	Rerun job	Checkpointing, DLQ	Checkpointing

Rule: if the business can wait 1 hour → batch. If it cannot wait 1 minute → streaming. In between → micro-batch.

Interview one-liner: "Choose streaming only when the business KPI degrades with delay. Otherwise batch is cheaper, simpler, easier to test."

Tradeoff #2 — Data Lake vs Data Warehouse vs Lakehouse

	Lake (S3/HDFS)	Warehouse (Snowflake/BigQuery)	Lakehouse (Delta/Iceberg)
Schema	On read	On write	On read + on write
Cost/TB	Cheapest	Expensive	Cheap (object store)
ACID	No	Yes	Yes
ML friendly	Yes (raw files)	Awkward	Yes
BI friendly	No (need engine)	Best	Good (via SQL engines)

Lakehouse = lake storage + warehouse features (ACID, time travel, schema enforcement) via Delta/Iceberg/Hudi.

Tradeoff #3 — OLTP vs OLAP vs HTAP

	OLTP	OLAP	HTAP
Storage	Row-oriented	Columnar	Hybrid
Workload	Many small txns	Few large scans	Both
Examples	Postgres, MySQL	Snowflake, BQ, Redshift	TiDB, SingleStore
Latency	ms	sec-min	ms-sec

Interview answer: "OLTP = operational, row-based. OLAP = analytical, columnar. Never run analytics on OLTP → kills production."

Tradeoff #4 — Lambda vs Kappa Architecture

Lambda: 2 pipelines — batch (accurate, slow) + streaming (approximate, fast). Merge at query time.

📐 Architecture Diagram

Source ──┬──> Batch layer  ──> Serving (daily)
         └──> Speed layer  ──> Serving (real-time)
                                      ▲
                                      └─ merge at query time

Kappa: 1 pipeline — streaming only. Replay from Kafka for historical backfill.

Source ──> Kafka ──> Stream processor ──> Serving layer

▲

└─ replay from Kafka offset 0 for backfill

Tradeoff: Lambda = 2 codebases, consistency issues. Kappa = single codebase but needs durable log (Kafka).

Modern default (2026): Kappa with Delta/Iceberg CDF for historical reprocessing.

Tradeoff #5 — Push vs Pull ingestion

	Push (source → us)	Pull (us → source)
Control	Source decides rate	Consumer decides rate
Failure	We may drop data	We may miss new data
Examples	Webhooks, Kafka producers	Polling REST APIs, CDC reads
Backpressure	Hard	Natural

Rule: high-volume real-time → push (Kafka). Scheduled / bounded → pull (Airflow jobs).

Tradeoff #6 — CAP Theorem for distributed stores

Of {Consistency, Availability, Partition tolerance}, you can only guarantee 2 during a partition.

System	Picks	Example
CP	Consistency + Partition tolerance	HBase, MongoDB (strict), ZooKeeper
AP	Availability + Partition tolerance	Cassandra, DynamoDB (default)
CA	Not possible in distributed	(single-node only)

Interview rule: if they ask "eventual consistency" → you're in AP. If they ask "strong consistency" → you're in CP.

Tradeoff #7 — Synchronous vs Asynchronous + Queue

Sync: Service A calls Service B → waits for response. Fast but fragile. Async via queue: Service A publishes event → Queue → Service B consumes.

When async wins:

B is slow or flaky → A doesn't block
Retry/DLQ needed
Multiple consumers of same event (pub/sub)

When sync wins:

Need immediate answer (booking confirmation)
Simple, low latency
Fewer moving parts

Tradeoff #8 — Columnar vs Row storage

ROW-ORIENTED:

[id=1, name=alice, amt=100, date=2026-04-01]

[id=2, name=bob, amt=200, date=2026-04-02]

[id=3, name=carol, amt=300, date=2026-04-03]

COLUMN-ORIENTED:

id: [1, 2, 3]

name: [alice, bob, carol]

amt: [100, 200, 300]

date: [2026-04-01, 2026-04-02, 2026-04-03]

Row wins: "give me order 42" (single-row reads, OLTP) Column wins: "sum amount by month" (scans one column, compressible)

Columnar also compresses way better because same-type values cluster.

SECTION 2 — 3 CLASSIC DATA PLATFORM DESIGNS

Design 1: Clickstream analytics platform (real-time dashboard)

Requirements: track every page view, show live MAU / top pages / funnel metrics.

Architecture:

📐 Architecture Diagram

Web app ──(HTTP)──▶ Gateway/Collector
                         │
                         ▼
                       Kafka (topic=page_views, partitioned by user_id)
                         │
            ┌────────────┼────────────┐
            ▼            ▼            ▼
      Flink/Spark   S3 (raw logs)  ClickHouse
      (aggregations)  (long-term)   (serving OLAP)
            │
            ▼
    Redis/Materialized  ──▶ Dashboard API ──▶ Grafana/App
    view (1-min rollups)

Key decisions:

Kafka as buffer → handles traffic spikes + replay for reprocessing
Flink for stateful aggregations (tumbling windows)
S3 = cold storage, ClickHouse = hot OLAP serving
Redis holds latest minute rollups for <1s dashboards

Scale math: 10 M events/day = ~115 events/sec. Kafka single partition handles this easily; 3-6 partitions for room.

Design 2: Batch data warehouse (ELT)

Requirements: load 50 source systems (CRM, ERP, app DB), serve finance + product BI.

Architecture:

Sources (50)

│

│ (Fivetran/Airbyte/CDC)

▼

Raw zone (S3 / Snowflake bronze schema)

│

│ dbt models (SQL transformations)

▼

Silver (cleaned, conformed)

│

│ dbt

▼

Gold (business marts)

│

▼

BI (Tableau/Looker) + Reverse ETL (back to Salesforce)

Orchestration: Airflow DAGs → dbt run nightly at 2am UTC Data quality: dbt tests (not_null, unique, relationships) + Great Expectations on critical tables Cost control: Snowflake auto_suspend=60s on BI warehouse, separate warehouse for ETL

Design 3: ML feature store + training + serving

Requirements: serve user features to recommendation model online + train weekly offline.

Architecture:

Event sources→Kafka → Flink (feature computation)

│

├──> Offline store (Delta table, historical)

│ │

│ ▼

│ Training pipeline (Spark ML / SageMaker)

│

└──> Online store (Redis/DynamoDB)

│

▼

Serving API (low latency reads)

│

▼

Recommendation model (inference)

Feast or Tecton handles the dual-write offline+online, guarantees feature parity between training and serving (avoids training-serving skew).

SECTION 3 — SCALE MATH (MEMORIZE THESE)

Ballpark numbers every data engineer should know:

Resource	Number
L1 cache	1 ns
RAM access	100 ns
SSD read	150 μs
HDD seek	10 ms
Network same datacenter	500 μs
Network cross-region	100-200 ms

Bandwidth:

1 Gbps network → 125 MB/s
10 Gbps → 1.25 GB/s
SSD throughput → 500 MB/s - 3 GB/s (NVMe)

Data volumes:

1M rows of 1 KB each = 1 GB
1B rows of 100 B each = 100 GB
Daily 10M events × 500 B = 5 GB/day = 1.8 TB/year

Throughput estimates:

Kafka single partition: 10-30 MB/s
Postgres writes: ~10K/s on good hardware
Redis: 100K ops/sec single node
Cassandra: 10K writes/sec/node

Use these to sanity check interview claims. If someone says "1 Postgres handles 1M writes/sec" — wrong order of magnitude.

SECTION 4 — COMMON GOTCHAS (WHAT FAILS IN PROD)

Gotcha 1: "We'll just add caching"

Caching solves reads, not writes. Adds invalidation complexity. Stale data bugs. Cache stampede on expiry.

Gotcha 2: "We'll shard later"

Resharding a production DB is painful (downtime, dual-write, rehash). Plan partitioning key from day 1.

Gotcha 3: "Kafka solves everything"

Kafka is a log, not a DB. Cannot query by key. Cannot update. Use Kafka for event transport + replay, NOT as source of truth for lookups.

Gotcha 4: "Exactly-once"

True exactly-once requires idempotent consumers + transactional producers + coordinated commits. Most pipelines get "effectively once" via idempotent writes (upserts).

Gotcha 5: "Join everything in Spark"

Shuffle-heavy. Prefer: broadcast small tables, pre-join at ingestion, denormalize in gold layer.

Gotcha 6: "Store everything in JSON"

Schema-on-read hides bugs. Migrations become archaeology. Use Parquet/Avro with explicit schemas.

SECTION 5 — TIMED MOCK INTERVIEW (60 MIN)

Q1 (15 min) — "Design a data platform for a fintech processing 100K transactions/sec"

Answer structure:

Clarify: read/write ratio? regulatory retention? fraud detection latency SLA?
Ingest: transactions → Kafka (partition by account_id for per-account ordering, 50+ partitions)
Processing:
- Stream: Flink for real-time fraud scoring (<100 ms latency)
- Batch: daily reconciliation in Spark on Delta
Store:
- Hot: Cassandra for account balance (AP for availability)
- Warm: Delta Lake on S3 (7-year retention for compliance)
- Analytics: BigQuery/Snowflake for finance team
Serve: gRPC API reading Cassandra + feature store (Redis) for fraud model
Scale math: 100K tx/sec × 500 B = 50 MB/s. Kafka cluster: 6 brokers, 50 partitions.
Failure modes: Kafka replication=3 + min.insync=2, multi-AZ, DLQ for failed fraud scoring

Q2 (10 min) — "A dashboard is slow. It queries a 10 TB Delta table. Fix it."

Checklist:

Partition pruning — is query filtering on partition column?
Z-ORDER / Liquid Clustering on frequent filter columns
OPTIMIZE for small-file compaction
Materialized aggregates (gold table, pre-aggregated daily rollups)
Serve dashboard from materialized view, not raw table
Warehouse sizing + result cache
Consider moving to ClickHouse / Druid if latency < 1s needed

Q3 (8 min) — "How do you handle schema evolution in a lakehouse?"

Use Parquet/Delta — schema metadata built-in
Additive changes safe (add nullable column)
Breaking changes (rename, drop, type change) → version the table OR use column mapping (Delta)
Contract tests in CI — validate schema before deploy
Consumer code: SELECT col1, col2 not SELECT * to survive new columns
Data contracts (Protobuf/Avro registry) for producer-consumer agreements

Q4 (8 min) — "Design ingestion from 200 legacy Oracle databases into lakehouse"

CDC: log-based CDC (Debezium/Goldengate/DMS) → Kafka topics per table
Staging: raw CDC events land as Delta append-only table (bronze)
Merge: hourly MERGE INTO silver_table USING bronze_stream — applies upserts/deletes
Schema registry: Avro in Confluent registry, enforce contracts
Bootstrap: initial snapshot via Spark JDBC read, then switch to CDC
Monitoring: lag per table, DLQ for unparseable events, watermark delay alerts

Q5 (10 min) — "Dashboard shows wrong number. How do you debug data quality?"

Reproduce: what query, what time, what cell shows wrong value?
Walk backwards: dashboard → gold table → silver → bronze → source
Check at each layer: row counts, aggregates, joins
Find the break: "silver has 1M rows, gold has 900K" → loss in aggregation
Common culprits:
- INNER vs LEFT JOIN dropping rows
- Timezone mismatch in date filters
- Dedup logic keeping wrong row
- Late-arriving data + non-idempotent pipeline
Preventive: dbt tests for uniqueness / not-null / row counts / referential integrity
Communicate: file bug, timeline of when-broken, impact scope

Q6 (9 min) — "Build reporting pipeline for team of 3 data engineers, 5 analysts, 100 stakeholders"

Storage: Snowflake (SMB starter tier) OR BigQuery (pay-per-query)
Transform: dbt for SQL models, git-versioned, code-reviewed
Orchestrate: Airflow (or Dagster) for non-dbt tasks; dbt Cloud scheduler for dbt
Ingest: Fivetran for SaaS sources (Salesforce, HubSpot, Stripe); custom Python for long-tail
BI: Looker/Metabase for analysts + self-serve for stakeholders
Quality: dbt tests on every PR + monitoring dashboard in Metabase
Governance: RBAC via Snowflake roles, PII tagged in dbt metadata

SECTION 6 — FRAMEWORKS TO ANSWER ANY DESIGN QUESTION

When you hear a design prompt, follow this 7-step framework:

Clarify — scale (QPS, volume), latency SLA, consistency needs, retention
Capacity estimates — rows/sec, GB/day, storage over 3 years
API design — what does the consumer call? REST? gRPC? SQL?
High-level architecture — draw 5-8 boxes: ingest, store, process, serve
Deep-dive a tricky component — partitioning, replication, failure mode
Tradeoffs — call out 2-3 decisions and alternatives
Monitoring + failure — what breaks first? how do you detect?

Timing: 45-min interview:

5 min clarify
10 min HLD
15 min deep-dive
10 min tradeoffs + scaling
5 min ops/monitoring

SECTION 7 — FINAL READINESS CHECKLIST

Can I draw INGEST-STORE-SERVE and place real tools in each box?
Do I know 5 tradeoffs with numbers (batch vs stream, row vs col, sync vs async)?
Can I estimate throughput from event size × rate?
Do I know when to pick CP vs AP?
Can I design clickstream / warehouse / feature store from memory?
Do I know Lambda vs Kappa and 2026 default?
Can I debug a slow 10 TB query (5+ actions)?
Do I know schema evolution strategies for lakehouse?
Can I run the 7-step framework on any new prompt?
Do I have ballpark latency + bandwidth numbers memorized?

If all 10 = YES, you're data-system-design interview-ready.

Remember INGEST-STORE-SERVE. Everything else is details.

Practice Questions 📋