System Design for Data Engineers — Patterns, Tradeoffs & Mock Interview
🧠 Memory Map: INGEST-STORE-SERVE
Every data system in the world is 3 layers. Remember ISS:
| Letter | Layer | Question to answer |
|---|---|---|
| I | Ingest | How does data arrive? (batch vs stream, push vs pull) |
| S | Store | Where does it live? (raw lake, warehouse, serving layer) |
| S | Serve | How do consumers read it? (dashboards, API, ML, notebooks) |
Draw these 3 boxes, fill them with concrete tools, and you've answered the question.
SECTION 1 — 8 DESIGN TRADEOFFS YOU MUST KNOW
Tradeoff #1 — Batch vs Streaming vs Micro-batch
| Batch | Streaming | Micro-batch | |
|---|---|---|---|
| Latency | Hours-days | Milliseconds-seconds | Seconds-minutes |
| Cost | Lowest (compute on demand) | Highest (always on) | Middle |
| Complexity | Low | High | Medium |
| Examples | Nightly warehouse loads | Fraud detection | Structured Streaming default |
| Failure recovery | Rerun job | Checkpointing, DLQ | Checkpointing |
Rule: if the business can wait 1 hour → batch. If it cannot wait 1 minute → streaming. In between → micro-batch.
Interview one-liner: "Choose streaming only when the business KPI degrades with delay. Otherwise batch is cheaper, simpler, easier to test."
Tradeoff #2 — Data Lake vs Data Warehouse vs Lakehouse
| Lake (S3/HDFS) | Warehouse (Snowflake/BigQuery) | Lakehouse (Delta/Iceberg) | |
|---|---|---|---|
| Schema | On read | On write | On read + on write |
| Cost/TB | Cheapest | Expensive | Cheap (object store) |
| ACID | No | Yes | Yes |
| ML friendly | Yes (raw files) | Awkward | Yes |
| BI friendly | No (need engine) | Best | Good (via SQL engines) |
Lakehouse = lake storage + warehouse features (ACID, time travel, schema enforcement) via Delta/Iceberg/Hudi.
Tradeoff #3 — OLTP vs OLAP vs HTAP
| OLTP | OLAP | HTAP | |
|---|---|---|---|
| Storage | Row-oriented | Columnar | Hybrid |
| Workload | Many small txns | Few large scans | Both |
| Examples | Postgres, MySQL | Snowflake, BQ, Redshift | TiDB, SingleStore |
| Latency | ms | sec-min | ms-sec |
Interview answer: "OLTP = operational, row-based. OLAP = analytical, columnar. Never run analytics on OLTP → kills production."
Tradeoff #4 — Lambda vs Kappa Architecture
Lambda: 2 pipelines — batch (accurate, slow) + streaming (approximate, fast). Merge at query time.
Source ──┬──> Batch layer ──> Serving (daily)
└──> Speed layer ──> Serving (real-time)
▲
└─ merge at query time
Kappa: 1 pipeline — streaming only. Replay from Kafka for historical backfill.
Tradeoff: Lambda = 2 codebases, consistency issues. Kappa = single codebase but needs durable log (Kafka).
Modern default (2026): Kappa with Delta/Iceberg CDF for historical reprocessing.
Tradeoff #5 — Push vs Pull ingestion
| Push (source → us) | Pull (us → source) | |
|---|---|---|
| Control | Source decides rate | Consumer decides rate |
| Failure | We may drop data | We may miss new data |
| Examples | Webhooks, Kafka producers | Polling REST APIs, CDC reads |
| Backpressure | Hard | Natural |
Rule: high-volume real-time → push (Kafka). Scheduled / bounded → pull (Airflow jobs).
Tradeoff #6 — CAP Theorem for distributed stores
Of {Consistency, Availability, Partition tolerance}, you can only guarantee 2 during a partition.
| System | Picks | Example |
|---|---|---|
| CP | Consistency + Partition tolerance | HBase, MongoDB (strict), ZooKeeper |
| AP | Availability + Partition tolerance | Cassandra, DynamoDB (default) |
| CA | Not possible in distributed | (single-node only) |
Interview rule: if they ask "eventual consistency" → you're in AP. If they ask "strong consistency" → you're in CP.
Tradeoff #7 — Synchronous vs Asynchronous + Queue
Sync: Service A calls Service B → waits for response. Fast but fragile. Async via queue: Service A publishes event → Queue → Service B consumes.
When async wins:
- B is slow or flaky → A doesn't block
- Retry/DLQ needed
- Multiple consumers of same event (pub/sub)
When sync wins:
- Need immediate answer (booking confirmation)
- Simple, low latency
- Fewer moving parts
Tradeoff #8 — Columnar vs Row storage
Row wins: "give me order 42" (single-row reads, OLTP) Column wins: "sum amount by month" (scans one column, compressible)
Columnar also compresses way better because same-type values cluster.
SECTION 2 — 3 CLASSIC DATA PLATFORM DESIGNS
Design 1: Clickstream analytics platform (real-time dashboard)
Requirements: track every page view, show live MAU / top pages / funnel metrics.
Architecture:
Web app ──(HTTP)──▶ Gateway/Collector
│
▼
Kafka (topic=page_views, partitioned by user_id)
│
┌────────────┼────────────┐
▼ ▼ ▼
Flink/Spark S3 (raw logs) ClickHouse
(aggregations) (long-term) (serving OLAP)
│
▼
Redis/Materialized ──▶ Dashboard API ──▶ Grafana/App
view (1-min rollups)
Key decisions:
- Kafka as buffer → handles traffic spikes + replay for reprocessing
- Flink for stateful aggregations (tumbling windows)
- S3 = cold storage, ClickHouse = hot OLAP serving
- Redis holds latest minute rollups for <1s dashboards
Scale math: 10 M events/day = ~115 events/sec. Kafka single partition handles this easily; 3-6 partitions for room.
Design 2: Batch data warehouse (ELT)
Requirements: load 50 source systems (CRM, ERP, app DB), serve finance + product BI.
Architecture:
Orchestration: Airflow DAGs → dbt run nightly at 2am UTC Data quality: dbt tests (not_null, unique, relationships) + Great Expectations on critical tables Cost control: Snowflake auto_suspend=60s on BI warehouse, separate warehouse for ETL
Design 3: ML feature store + training + serving
Requirements: serve user features to recommendation model online + train weekly offline.
Architecture:
Feast or Tecton handles the dual-write offline+online, guarantees feature parity between training and serving (avoids training-serving skew).
SECTION 3 — SCALE MATH (MEMORIZE THESE)
Ballpark numbers every data engineer should know:
| Resource | Number |
|---|---|
| L1 cache | 1 ns |
| RAM access | 100 ns |
| SSD read | 150 μs |
| HDD seek | 10 ms |
| Network same datacenter | 500 μs |
| Network cross-region | 100-200 ms |
Bandwidth:
- 1 Gbps network → 125 MB/s
- 10 Gbps → 1.25 GB/s
- SSD throughput → 500 MB/s - 3 GB/s (NVMe)
Data volumes:
- 1M rows of 1 KB each = 1 GB
- 1B rows of 100 B each = 100 GB
- Daily 10M events × 500 B = 5 GB/day = 1.8 TB/year
Throughput estimates:
- Kafka single partition: 10-30 MB/s
- Postgres writes: ~10K/s on good hardware
- Redis: 100K ops/sec single node
- Cassandra: 10K writes/sec/node
Use these to sanity check interview claims. If someone says "1 Postgres handles 1M writes/sec" — wrong order of magnitude.
SECTION 4 — COMMON GOTCHAS (WHAT FAILS IN PROD)
Gotcha 1: "We'll just add caching"
Caching solves reads, not writes. Adds invalidation complexity. Stale data bugs. Cache stampede on expiry.
Gotcha 2: "We'll shard later"
Resharding a production DB is painful (downtime, dual-write, rehash). Plan partitioning key from day 1.
Gotcha 3: "Kafka solves everything"
Kafka is a log, not a DB. Cannot query by key. Cannot update. Use Kafka for event transport + replay, NOT as source of truth for lookups.
Gotcha 4: "Exactly-once"
True exactly-once requires idempotent consumers + transactional producers + coordinated commits. Most pipelines get "effectively once" via idempotent writes (upserts).
Gotcha 5: "Join everything in Spark"
Shuffle-heavy. Prefer: broadcast small tables, pre-join at ingestion, denormalize in gold layer.
Gotcha 6: "Store everything in JSON"
Schema-on-read hides bugs. Migrations become archaeology. Use Parquet/Avro with explicit schemas.
SECTION 5 — TIMED MOCK INTERVIEW (60 MIN)
Q1 (15 min) — "Design a data platform for a fintech processing 100K transactions/sec"
Answer structure:
- Clarify: read/write ratio? regulatory retention? fraud detection latency SLA?
- Ingest: transactions → Kafka (partition by account_id for per-account ordering, 50+ partitions)
- Processing:
- Stream: Flink for real-time fraud scoring (<100 ms latency)
- Batch: daily reconciliation in Spark on Delta
- Store:
- Hot: Cassandra for account balance (AP for availability)
- Warm: Delta Lake on S3 (7-year retention for compliance)
- Analytics: BigQuery/Snowflake for finance team
- Serve: gRPC API reading Cassandra + feature store (Redis) for fraud model
- Scale math: 100K tx/sec × 500 B = 50 MB/s. Kafka cluster: 6 brokers, 50 partitions.
- Failure modes: Kafka replication=3 + min.insync=2, multi-AZ, DLQ for failed fraud scoring
Q2 (10 min) — "A dashboard is slow. It queries a 10 TB Delta table. Fix it."
Checklist:
- Partition pruning — is query filtering on partition column?
- Z-ORDER / Liquid Clustering on frequent filter columns
- OPTIMIZE for small-file compaction
- Materialized aggregates (gold table, pre-aggregated daily rollups)
- Serve dashboard from materialized view, not raw table
- Warehouse sizing + result cache
- Consider moving to ClickHouse / Druid if latency < 1s needed
Q3 (8 min) — "How do you handle schema evolution in a lakehouse?"
- Use Parquet/Delta — schema metadata built-in
- Additive changes safe (add nullable column)
- Breaking changes (rename, drop, type change) → version the table OR use column mapping (Delta)
- Contract tests in CI — validate schema before deploy
- Consumer code:
SELECT col1, col2notSELECT *to survive new columns - Data contracts (Protobuf/Avro registry) for producer-consumer agreements
Q4 (8 min) — "Design ingestion from 200 legacy Oracle databases into lakehouse"
- CDC: log-based CDC (Debezium/Goldengate/DMS) → Kafka topics per table
- Staging: raw CDC events land as Delta append-only table (bronze)
- Merge: hourly
MERGE INTO silver_table USING bronze_stream— applies upserts/deletes - Schema registry: Avro in Confluent registry, enforce contracts
- Bootstrap: initial snapshot via Spark JDBC read, then switch to CDC
- Monitoring: lag per table, DLQ for unparseable events, watermark delay alerts
Q5 (10 min) — "Dashboard shows wrong number. How do you debug data quality?"
- Reproduce: what query, what time, what cell shows wrong value?
- Walk backwards: dashboard → gold table → silver → bronze → source
- Check at each layer: row counts, aggregates, joins
- Find the break: "silver has 1M rows, gold has 900K" → loss in aggregation
- Common culprits:
- INNER vs LEFT JOIN dropping rows
- Timezone mismatch in date filters
- Dedup logic keeping wrong row
- Late-arriving data + non-idempotent pipeline
- Preventive: dbt tests for uniqueness / not-null / row counts / referential integrity
- Communicate: file bug, timeline of when-broken, impact scope
Q6 (9 min) — "Build reporting pipeline for team of 3 data engineers, 5 analysts, 100 stakeholders"
- Storage: Snowflake (SMB starter tier) OR BigQuery (pay-per-query)
- Transform: dbt for SQL models, git-versioned, code-reviewed
- Orchestrate: Airflow (or Dagster) for non-dbt tasks; dbt Cloud scheduler for dbt
- Ingest: Fivetran for SaaS sources (Salesforce, HubSpot, Stripe); custom Python for long-tail
- BI: Looker/Metabase for analysts + self-serve for stakeholders
- Quality: dbt tests on every PR + monitoring dashboard in Metabase
- Governance: RBAC via Snowflake roles, PII tagged in dbt metadata
SECTION 6 — FRAMEWORKS TO ANSWER ANY DESIGN QUESTION
When you hear a design prompt, follow this 7-step framework:
- Clarify — scale (QPS, volume), latency SLA, consistency needs, retention
- Capacity estimates — rows/sec, GB/day, storage over 3 years
- API design — what does the consumer call? REST? gRPC? SQL?
- High-level architecture — draw 5-8 boxes: ingest, store, process, serve
- Deep-dive a tricky component — partitioning, replication, failure mode
- Tradeoffs — call out 2-3 decisions and alternatives
- Monitoring + failure — what breaks first? how do you detect?
Timing: 45-min interview:
- 5 min clarify
- 10 min HLD
- 15 min deep-dive
- 10 min tradeoffs + scaling
- 5 min ops/monitoring
SECTION 7 — FINAL READINESS CHECKLIST
- Can I draw INGEST-STORE-SERVE and place real tools in each box?
- Do I know 5 tradeoffs with numbers (batch vs stream, row vs col, sync vs async)?
- Can I estimate throughput from event size × rate?
- Do I know when to pick CP vs AP?
- Can I design clickstream / warehouse / feature store from memory?
- Do I know Lambda vs Kappa and 2026 default?
- Can I debug a slow 10 TB query (5+ actions)?
- Do I know schema evolution strategies for lakehouse?
- Can I run the 7-step framework on any new prompt?
- Do I have ballpark latency + bandwidth numbers memorized?
If all 10 = YES, you're data-system-design interview-ready.
Remember INGEST-STORE-SERVE. Everything else is details.