🏗️
System Design
System Design for Data Engineers — Patterns, Tradeoffs & Mock Interview
🏗️
🏗️
System Design · Section 1 of 1

System Design for Data Engineers — Patterns, Tradeoffs & Mock Interview

System Design for Data Engineers — Patterns, Tradeoffs & Mock Interview

💡 Interview Tip
The video-free pack. Read this end-to-end and you can walk into any data-platform system-design interview without opening YouTube.

🧠 Memory Map: INGEST-STORE-SERVE

Every data system in the world is 3 layers. Remember ISS:

LetterLayerQuestion to answer
IIngestHow does data arrive? (batch vs stream, push vs pull)
SStoreWhere does it live? (raw lake, warehouse, serving layer)
SServeHow do consumers read it? (dashboards, API, ML, notebooks)

Draw these 3 boxes, fill them with concrete tools, and you've answered the question.

SECTION 1 — 8 DESIGN TRADEOFFS YOU MUST KNOW

Tradeoff #1 — Batch vs Streaming vs Micro-batch

BatchStreamingMicro-batch
LatencyHours-daysMilliseconds-secondsSeconds-minutes
CostLowest (compute on demand)Highest (always on)Middle
ComplexityLowHighMedium
ExamplesNightly warehouse loadsFraud detectionStructured Streaming default
Failure recoveryRerun jobCheckpointing, DLQCheckpointing

Rule: if the business can wait 1 hour → batch. If it cannot wait 1 minute → streaming. In between → micro-batch.

Interview one-liner: "Choose streaming only when the business KPI degrades with delay. Otherwise batch is cheaper, simpler, easier to test."

Tradeoff #2 — Data Lake vs Data Warehouse vs Lakehouse

Lake (S3/HDFS)Warehouse (Snowflake/BigQuery)Lakehouse (Delta/Iceberg)
SchemaOn readOn writeOn read + on write
Cost/TBCheapestExpensiveCheap (object store)
ACIDNoYesYes
ML friendlyYes (raw files)AwkwardYes
BI friendlyNo (need engine)BestGood (via SQL engines)

Lakehouse = lake storage + warehouse features (ACID, time travel, schema enforcement) via Delta/Iceberg/Hudi.

Tradeoff #3 — OLTP vs OLAP vs HTAP

OLTPOLAPHTAP
StorageRow-orientedColumnarHybrid
WorkloadMany small txnsFew large scansBoth
ExamplesPostgres, MySQLSnowflake, BQ, RedshiftTiDB, SingleStore
Latencymssec-minms-sec

Interview answer: "OLTP = operational, row-based. OLAP = analytical, columnar. Never run analytics on OLTP → kills production."

Tradeoff #4 — Lambda vs Kappa Architecture

Lambda: 2 pipelines — batch (accurate, slow) + streaming (approximate, fast). Merge at query time.

📐 Architecture Diagram
Source ──┬──> Batch layer  ──> Serving (daily)
         └──> Speed layer  ──> Serving (real-time)
                                      ▲
                                      └─ merge at query time

Kappa: 1 pipeline — streaming only. Replay from Kafka for historical backfill.

Source ──> Kafka ──> Stream processor ──> Serving layer
└─ replay from Kafka offset 0 for backfill

Tradeoff: Lambda = 2 codebases, consistency issues. Kappa = single codebase but needs durable log (Kafka).

Modern default (2026): Kappa with Delta/Iceberg CDF for historical reprocessing.

Tradeoff #5 — Push vs Pull ingestion

Push (source → us)Pull (us → source)
ControlSource decides rateConsumer decides rate
FailureWe may drop dataWe may miss new data
ExamplesWebhooks, Kafka producersPolling REST APIs, CDC reads
BackpressureHardNatural

Rule: high-volume real-time → push (Kafka). Scheduled / bounded → pull (Airflow jobs).

Tradeoff #6 — CAP Theorem for distributed stores

Of {Consistency, Availability, Partition tolerance}, you can only guarantee 2 during a partition.

SystemPicksExample
CPConsistency + Partition toleranceHBase, MongoDB (strict), ZooKeeper
APAvailability + Partition toleranceCassandra, DynamoDB (default)
CANot possible in distributed(single-node only)

Interview rule: if they ask "eventual consistency" → you're in AP. If they ask "strong consistency" → you're in CP.

Tradeoff #7 — Synchronous vs Asynchronous + Queue

Sync: Service A calls Service B → waits for response. Fast but fragile. Async via queue: Service A publishes event → Queue → Service B consumes.

When async wins:

  • B is slow or flaky → A doesn't block
  • Retry/DLQ needed
  • Multiple consumers of same event (pub/sub)

When sync wins:

  • Need immediate answer (booking confirmation)
  • Simple, low latency
  • Fewer moving parts

Tradeoff #8 — Columnar vs Row storage

ROW-ORIENTED:
[id=1, name=alice, amt=100, date=2026-04-01]
[id=2, name=bob, amt=200, date=2026-04-02]
[id=3, name=carol, amt=300, date=2026-04-03]
COLUMN-ORIENTED:
id: [1, 2, 3]
name: [alice, bob, carol]
amt: [100, 200, 300]
date: [2026-04-01, 2026-04-02, 2026-04-03]

Row wins: "give me order 42" (single-row reads, OLTP) Column wins: "sum amount by month" (scans one column, compressible)

Columnar also compresses way better because same-type values cluster.

SECTION 2 — 3 CLASSIC DATA PLATFORM DESIGNS

Design 1: Clickstream analytics platform (real-time dashboard)

Requirements: track every page view, show live MAU / top pages / funnel metrics.

Architecture:

📐 Architecture Diagram
Web app ──(HTTP)──▶ Gateway/Collector
                         │
                         ▼
                       Kafka (topic=page_views, partitioned by user_id)
                         │
            ┌────────────┼────────────┐
            ▼            ▼            ▼
      Flink/Spark   S3 (raw logs)  ClickHouse
      (aggregations)  (long-term)   (serving OLAP)
            │
            ▼
    Redis/Materialized  ──▶ Dashboard API ──▶ Grafana/App
    view (1-min rollups)

Key decisions:

  • Kafka as buffer → handles traffic spikes + replay for reprocessing
  • Flink for stateful aggregations (tumbling windows)
  • S3 = cold storage, ClickHouse = hot OLAP serving
  • Redis holds latest minute rollups for <1s dashboards

Scale math: 10 M events/day = ~115 events/sec. Kafka single partition handles this easily; 3-6 partitions for room.

Design 2: Batch data warehouse (ELT)

Requirements: load 50 source systems (CRM, ERP, app DB), serve finance + product BI.

Architecture:

Sources (50)
│ (Fivetran/Airbyte/CDC)
Raw zone (S3 / Snowflake bronze schema)
│ dbt models (SQL transformations)
Silver (cleaned, conformed)
│ dbt
Gold (business marts)
BI (Tableau/Looker) + Reverse ETL (back to Salesforce)

Orchestration: Airflow DAGs → dbt run nightly at 2am UTC Data quality: dbt tests (not_null, unique, relationships) + Great Expectations on critical tables Cost control: Snowflake auto_suspend=60s on BI warehouse, separate warehouse for ETL

Design 3: ML feature store + training + serving

Requirements: serve user features to recommendation model online + train weekly offline.

Architecture:

Event sourcesKafka → Flink (feature computation)
├──> Offline store (Delta table, historical)
│ │
│ ▼
│ Training pipeline (Spark ML / SageMaker)
└──> Online store (Redis/DynamoDB)
Serving API (low latency reads)
Recommendation model (inference)

Feast or Tecton handles the dual-write offline+online, guarantees feature parity between training and serving (avoids training-serving skew).

SECTION 3 — SCALE MATH (MEMORIZE THESE)

Ballpark numbers every data engineer should know:

ResourceNumber
L1 cache1 ns
RAM access100 ns
SSD read150 μs
HDD seek10 ms
Network same datacenter500 μs
Network cross-region100-200 ms

Bandwidth:

  • 1 Gbps network → 125 MB/s
  • 10 Gbps → 1.25 GB/s
  • SSD throughput → 500 MB/s - 3 GB/s (NVMe)

Data volumes:

  • 1M rows of 1 KB each = 1 GB
  • 1B rows of 100 B each = 100 GB
  • Daily 10M events × 500 B = 5 GB/day = 1.8 TB/year

Throughput estimates:

  • Kafka single partition: 10-30 MB/s
  • Postgres writes: ~10K/s on good hardware
  • Redis: 100K ops/sec single node
  • Cassandra: 10K writes/sec/node

Use these to sanity check interview claims. If someone says "1 Postgres handles 1M writes/sec" — wrong order of magnitude.

SECTION 4 — COMMON GOTCHAS (WHAT FAILS IN PROD)

Gotcha 1: "We'll just add caching"

Caching solves reads, not writes. Adds invalidation complexity. Stale data bugs. Cache stampede on expiry.

Gotcha 2: "We'll shard later"

Resharding a production DB is painful (downtime, dual-write, rehash). Plan partitioning key from day 1.

Gotcha 3: "Kafka solves everything"

Kafka is a log, not a DB. Cannot query by key. Cannot update. Use Kafka for event transport + replay, NOT as source of truth for lookups.

Gotcha 4: "Exactly-once"

True exactly-once requires idempotent consumers + transactional producers + coordinated commits. Most pipelines get "effectively once" via idempotent writes (upserts).

Gotcha 5: "Join everything in Spark"

Shuffle-heavy. Prefer: broadcast small tables, pre-join at ingestion, denormalize in gold layer.

Gotcha 6: "Store everything in JSON"

Schema-on-read hides bugs. Migrations become archaeology. Use Parquet/Avro with explicit schemas.

SECTION 5 — TIMED MOCK INTERVIEW (60 MIN)

Q1 (15 min) — "Design a data platform for a fintech processing 100K transactions/sec"

Answer structure:

  1. Clarify: read/write ratio? regulatory retention? fraud detection latency SLA?
  2. Ingest: transactions → Kafka (partition by account_id for per-account ordering, 50+ partitions)
  3. Processing:
    • Stream: Flink for real-time fraud scoring (<100 ms latency)
    • Batch: daily reconciliation in Spark on Delta
  4. Store:
    • Hot: Cassandra for account balance (AP for availability)
    • Warm: Delta Lake on S3 (7-year retention for compliance)
    • Analytics: BigQuery/Snowflake for finance team
  5. Serve: gRPC API reading Cassandra + feature store (Redis) for fraud model
  6. Scale math: 100K tx/sec × 500 B = 50 MB/s. Kafka cluster: 6 brokers, 50 partitions.
  7. Failure modes: Kafka replication=3 + min.insync=2, multi-AZ, DLQ for failed fraud scoring

Q2 (10 min) — "A dashboard is slow. It queries a 10 TB Delta table. Fix it."

Checklist:

  1. Partition pruning — is query filtering on partition column?
  2. Z-ORDER / Liquid Clustering on frequent filter columns
  3. OPTIMIZE for small-file compaction
  4. Materialized aggregates (gold table, pre-aggregated daily rollups)
  5. Serve dashboard from materialized view, not raw table
  6. Warehouse sizing + result cache
  7. Consider moving to ClickHouse / Druid if latency < 1s needed

Q3 (8 min) — "How do you handle schema evolution in a lakehouse?"

  1. Use Parquet/Delta — schema metadata built-in
  2. Additive changes safe (add nullable column)
  3. Breaking changes (rename, drop, type change) → version the table OR use column mapping (Delta)
  4. Contract tests in CI — validate schema before deploy
  5. Consumer code: SELECT col1, col2 not SELECT * to survive new columns
  6. Data contracts (Protobuf/Avro registry) for producer-consumer agreements

Q4 (8 min) — "Design ingestion from 200 legacy Oracle databases into lakehouse"

  1. CDC: log-based CDC (Debezium/Goldengate/DMS) → Kafka topics per table
  2. Staging: raw CDC events land as Delta append-only table (bronze)
  3. Merge: hourly MERGE INTO silver_table USING bronze_stream — applies upserts/deletes
  4. Schema registry: Avro in Confluent registry, enforce contracts
  5. Bootstrap: initial snapshot via Spark JDBC read, then switch to CDC
  6. Monitoring: lag per table, DLQ for unparseable events, watermark delay alerts

Q5 (10 min) — "Dashboard shows wrong number. How do you debug data quality?"

  1. Reproduce: what query, what time, what cell shows wrong value?
  2. Walk backwards: dashboard → gold table → silver → bronze → source
  3. Check at each layer: row counts, aggregates, joins
  4. Find the break: "silver has 1M rows, gold has 900K" → loss in aggregation
  5. Common culprits:
    • INNER vs LEFT JOIN dropping rows
    • Timezone mismatch in date filters
    • Dedup logic keeping wrong row
    • Late-arriving data + non-idempotent pipeline
  6. Preventive: dbt tests for uniqueness / not-null / row counts / referential integrity
  7. Communicate: file bug, timeline of when-broken, impact scope

Q6 (9 min) — "Build reporting pipeline for team of 3 data engineers, 5 analysts, 100 stakeholders"

  • Storage: Snowflake (SMB starter tier) OR BigQuery (pay-per-query)
  • Transform: dbt for SQL models, git-versioned, code-reviewed
  • Orchestrate: Airflow (or Dagster) for non-dbt tasks; dbt Cloud scheduler for dbt
  • Ingest: Fivetran for SaaS sources (Salesforce, HubSpot, Stripe); custom Python for long-tail
  • BI: Looker/Metabase for analysts + self-serve for stakeholders
  • Quality: dbt tests on every PR + monitoring dashboard in Metabase
  • Governance: RBAC via Snowflake roles, PII tagged in dbt metadata

SECTION 6 — FRAMEWORKS TO ANSWER ANY DESIGN QUESTION

When you hear a design prompt, follow this 7-step framework:

  1. Clarify — scale (QPS, volume), latency SLA, consistency needs, retention
  2. Capacity estimates — rows/sec, GB/day, storage over 3 years
  3. API design — what does the consumer call? REST? gRPC? SQL?
  4. High-level architecture — draw 5-8 boxes: ingest, store, process, serve
  5. Deep-dive a tricky component — partitioning, replication, failure mode
  6. Tradeoffs — call out 2-3 decisions and alternatives
  7. Monitoring + failure — what breaks first? how do you detect?

Timing: 45-min interview:

  • 5 min clarify
  • 10 min HLD
  • 15 min deep-dive
  • 10 min tradeoffs + scaling
  • 5 min ops/monitoring

SECTION 7 — FINAL READINESS CHECKLIST

  • Can I draw INGEST-STORE-SERVE and place real tools in each box?
  • Do I know 5 tradeoffs with numbers (batch vs stream, row vs col, sync vs async)?
  • Can I estimate throughput from event size × rate?
  • Do I know when to pick CP vs AP?
  • Can I design clickstream / warehouse / feature store from memory?
  • Do I know Lambda vs Kappa and 2026 default?
  • Can I debug a slow 10 TB query (5+ actions)?
  • Do I know schema evolution strategies for lakehouse?
  • Can I run the 7-step framework on any new prompt?
  • Do I have ballpark latency + bandwidth numbers memorized?

If all 10 = YES, you're data-system-design interview-ready.

Remember INGEST-STORE-SERVE. Everything else is details.