Day 2: ETL Patterns & Pipeline Design — Quick Recall Guide
- ⚡ = Must remember (95% chance of being asked)
- 🔑 = Key concept (core understanding needed)
- ⚠️ = Common trap (interviewers love to test this)
- 🧠 = Memory Map (mnemonic/acronym — memorize this!)
- 📝 = One-liner (flash-card style — cover answer, test yourself)
🧠 MASTER MEMORY MAP — Day 2
SECTION 1: MEDALLION ARCHITECTURE
🧠 Memory Map: Medallion Layers
⚡ MUST KNOW DIRECT QUESTIONS
A data organization pattern with 3 layers: Bronze (raw), Silver (clean), Gold (business-ready). Data flows through layers with increasing quality.
Raw data exactly as received — no transformations. Just append with metadata columns (_ingested_at, _source_file). Never update or delete in Bronze.
Cleaned, deduplicated, validated data. Apply MERGE for upserts, fix data types, remove nulls, apply business keys, tag PII columns.
Business-ready aggregations, KPIs, star schema. Optimized for fast queries by BI tools (Power BI, Tableau). Pre-computed metrics like daily revenue per route.
Bronze = data engineers (debugging). Silver = engineers + data scientists. Gold = business analysts + BI tools.
- Skip Silver: Source data is already clean → Bronze straight to Gold
- Skip Gold: Real-time dashboard reads from Silver directly
- Add layers: "Feature layer" between Silver and Gold for ML features
Data that arrives after its actual event time. Example: A booking from March 1 arrives on March 5 (airline system was offline). Handle with MERGE on business key — it updates Silver even if it arrives late.
🔑 MID-LEVEL QUESTIONS
| Layer | Strategy |
|---|---|
| Bronze | Just append — it's append-only anyway. Add _ingested_at to track when it arrived |
| Silver | MERGE on business key — upserts handle late data naturally |
| Gold | Recompute affected aggregation windows (reprocess March 1 aggregations) |
5 essential columns:
_ingested_at— when data was loaded (TIMESTAMP)_source_file— which file it came from (STRING)_batch_id— which batch run loaded it (STRING)_source_system— which system sent it (STRING)_raw_payload— original data if JSON (for debugging)
- Medallion = centralized pipeline (one team owns Bronze→Silver→Gold)
- Data Mesh = decentralized (each domain team owns their own Bronze→Silver→Gold)
- Can combine both: each domain team follows Medallion within their domain
SECTION 2: SCD (SLOWLY CHANGING DIMENSIONS)
🧠 Memory Map: SCD Types
⚡ MUST KNOW DIRECT QUESTIONS
Slowly Changing Dimension — how you handle changes to dimension tables over time. Example: a passenger changes their address — do you overwrite, keep history, or ignore?
Overwrite the old value with the new value. No history kept. Simple but you lose what the old value was.
Add a new row for each change. Keep the old row with end_date and is_current = false. Full history preserved. Most common in data warehousing.
Add a column for the previous value (e.g., current_address + previous_address). Limited history — only tracks the last change.
Because for ONE changed record, you need TWO operations:
- UPDATE the old row (close it: set
end_date,is_current = false) - INSERT a new row (open it:
start_date = today,is_current = true)
🔑 MID-LEVEL QUESTIONS
Problem: MERGE matches on booking_id. For a changed record, you need to UPDATE the old row AND INSERT a new row. But MERGE only does one action per matched key. Solution: Create a merge_key column in the staged data:
- Row to UPDATE:
merge_key = booking_id(matches target → WHEN MATCHED → UPDATE) - Row to INSERT:
merge_key = NULL(never matches → WHEN NOT MATCHED → INSERT)
-- Step 1: Create staged changes with TWO rows per changed record
staged_changes AS (
-- Row 1: Close the old record (merge_key = booking_id → will MATCH)
SELECT booking_id AS merge_key, booking_id, address,
current_timestamp AS end_date, false AS is_current
FROM changes WHERE change_type = 'U'
UNION ALL
-- Row 2: Insert new record (merge_key = NULL → will NOT MATCH → INSERT)
SELECT NULL AS merge_key, booking_id, new_address,
NULL AS end_date, true AS is_current
FROM changes WHERE change_type = 'U'
)
-- Step 2: MERGE with merge_key
MERGE INTO dim_passenger USING staged_changes
ON dim_passenger.booking_id = staged_changes.merge_key
AND dim_passenger.is_current = true
WHEN MATCHED THEN UPDATE SET end_date = ..., is_current = false
WHEN NOT MATCHED THEN INSERT (...)
Use Lakeflow apply_changes — it handles SCD Type 2 automatically:
dlt.apply_changes(
target="dim_passenger", # Target table
source="raw_passengers", # Source stream
keys=["passenger_id"], # Business key
sequence_by="updated_at", # How to order changes
stored_as_scd_type=2 # SCD Type 2 automatically!
)
SECTION 3: CDC (CHANGE DATA CAPTURE)
🧠 Memory Map: CDC Pipeline
⚡ MUST KNOW DIRECT QUESTIONS
Change Data Capture — captures every INSERT, UPDATE, DELETE from a source database and sends it downstream in real-time or near-real-time.
Open-source CDC connector that reads database transaction logs (Oracle redo logs, MySQL binlog) and streams changes to Kafka. Zero impact on source database performance.
- Direct queries = poll the database → high load on source, miss deletes, miss intermediate changes
- Debezium = reads transaction logs → zero load on source, captures ALL changes including deletes
CDC = capture changes FROM external source (Oracle → Delta) CDF = track changes WITHIN Delta Lake (Delta → downstream) CDF is a Delta Lake feature that records which rows changed (insert/update/delete) in a Delta table. Downstream consumers can read ONLY the changes instead of the full table.
-- Enable CDF on a table
ALTER TABLE bookings SET TBLPROPERTIES ('delta.enableChangeDataFeed' = true);
-- Read only changes since version 5
SELECT * FROM table_changes('bookings', 5);
3 columns automatically added:
_change_type— "insert", "update_preimage", "update_postimage", "delete"_commit_version— which Delta version_commit_timestamp— when the change happened
🔑 MID-LEVEL QUESTIONS
Two approaches:
- Soft delete: Set
is_deleted = true, keep the row (for audit trail) - Hard delete: Actually delete from Silver (for GDPR compliance)
MERGE INTO silver_bookings t USING bronze_cdc s
ON t.booking_id = s.booking_id
WHEN MATCHED AND s.op = 'd' THEN DELETE -- Hard delete
WHEN MATCHED AND s.op = 'u' THEN UPDATE SET ...
WHEN NOT MATCHED AND s.op = 'c' THEN INSERT ...
Use a sequence column (like updated_at timestamp or lsn — log sequence number). During MERGE, only apply the change if the incoming record is NEWER:
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET ...
SECTION 4: AUTO LOADER
🧠 Memory Map: Auto Loader
AUTO LOADER = "Automatically ingest new files as they arrive"
cloudFiles = the Spark source name for Auto Loader
TWO MODES:
Directory Listing → Scans folder for new files (simple, small folders)
File Notification → Azure Event Grid notifies when new file arrives (scalable, huge folders)
┌─────────────────────────────────────────────┐
│ < 10,000 files/day → Directory Listing │
│ > 10,000 files/day → File Notification │
└─────────────────────────────────────────────┘
SCHEMA EVOLUTION:
New column appears → Auto Loader adds it automatically
Unexpected data → Goes to _rescued_data column (not lost!)
Remember: "Auto Loader = cloudFiles + Directory or Notification + Schema rescue"
AUTO LOADER vs COPY INTO:
Auto Loader = streaming (always on, processes new files automatically)
COPY INTO = batch (run manually, idempotent — safe to re-run)
⚡ MUST KNOW DIRECT QUESTIONS
A Databricks feature that automatically ingests new files from cloud storage (ADLS Gen2) as they arrive. Uses cloudFiles as the Spark source format.
- Directory listing — scans folder periodically (good for <10K files/day)
- File notification — Azure Event Grid sends notifications (good for >10K files/day, more efficient)
_rescued_data?A special column where Auto Loader puts data that doesn't match the expected schema — instead of failing, unexpected fields are saved in this column for debugging.
Set cloudFiles.schemaEvolutionMode = "addNewColumns" — when source adds new columns, Auto Loader auto-adds them to the target table. Pipeline restarts to pick up the new schema.
| Auto Loader | COPY INTO | |
|---|---|---|
| Type | Streaming (always on) | Batch (run manually) |
| Scalability | Millions of files | Thousands of files |
| Schema evolution | Auto-detects | No auto-detection |
| File tracking | Checkpoint (remembers processed files) | Idempotent (safe to re-run) |
| Use when | Continuous ingestion | One-time or scheduled loads |
SECTION 5: LAKEFLOW DECLARATIVE PIPELINES (formerly DLT)
🧠 Memory Map: Lakeflow
LAKEFLOW = "Declarative ETL — you say WHAT, it does HOW"
Old name: DLT (Delta Live Tables) → renamed to "Lakeflow Declarative Pipelines" (2025)
TWO TABLE TYPES:
Streaming Table = append-only, processes new data only (like a river)
Materialized View = recomputed from scratch when data changes (like a snapshot)
┌───────────────────────────────────────────────┐
│ Append-only data (logs, events) → Streaming │
│ Aggregations, lookups → Materialized View │
└───────────────────────────────────────────────┘
DATA QUALITY = EXPECTATIONS (3 levels):
EXPECT → warn only (log bad rows, keep them)
EXPECT OR DROP → silently drop bad rows
EXPECT OR FAIL → stop pipeline on bad data
Remember: "WDF" = Warn, Drop, Fail
W — expect (Warn)
D — expect or DROP
F — expect or FAIL
PIPELINE MODES:
TRIGGERED = run once, then stop (batch — for cost savings)
CONTINUOUS = always running (streaming — for low latency)
⚡ MUST KNOW DIRECT QUESTIONS
A declarative framework for building ETL pipelines. You define WHAT your tables should look like (SQL/Python), and Lakeflow figures out HOW to execute, manage dependencies, handle retries, and enforce data quality.
Lakeflow Declarative Pipelines (2025). The product is the same, just rebranded.
An append-only table that processes only NEW data (incremental). Like a river — new water flows through, old water is already downstream. Best for: logs, events, CDC streams.
A table that is fully recomputed when underlying data changes. Like a fresh photo of your data. Best for: aggregations, star schema, Gold layer tables.
EXPECT— warn only, bad rows kept (log for monitoring)EXPECT OR DROP— silently remove bad rowsEXPECT OR FAIL— stop the entire pipeline
A Lakeflow function that automatically handles CDC/SCD logic. Give it a source stream, target table, keys, and SCD type — it does the MERGE internally.
- TRIGGERED: Runs once, processes all available data, stops. (cheaper, for batch)
- CONTINUOUS: Runs forever, processes data as it arrives. (expensive, for real-time)
🔑 MID-LEVEL QUESTIONS
| Streaming Table | Materialized View | |
|---|---|---|
| Processing | Only NEW data (incremental) | Recomputes ALL data |
| Good for | High-volume append (CDC, logs) | Aggregations, joins, Gold layer |
| Can UPDATE existing rows? | No (append-only) | Yes (full recompute) |
| Cost | Low (reads only new data) | Higher (reprocesses everything) |
| Example | raw_bookings Bronze | daily_revenue Gold |
Bad rows are not just dropped — they're sent to a quarantine table for investigation:
@dlt.table
@dlt.expect_or_drop("valid_amount", "amount > 0")
def silver_bookings():
return spark.readStream.table("bronze_bookings")
@dlt.table
def quarantine_bookings(): # Capture what was dropped
return spark.readStream.table("bronze_bookings").filter("amount <= 0")
import dlt
# Bronze: Auto Loader ingests raw files
@dlt.table
def bronze_bookings():
return (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/raw/bookings/"))
# Silver: Clean + validate
@dlt.table
@dlt.expect_or_drop("valid_id", "booking_id IS NOT NULL")
@dlt.expect_or_drop("valid_amount", "amount > 0")
def silver_bookings():
return dlt.readStream("bronze_bookings")
.select("booking_id", "passenger_id", "amount", "booking_date")
# Gold: Aggregate
@dlt.table
def gold_daily_revenue():
return dlt.read("silver_bookings")
.groupBy("booking_date")
.agg(sum("amount").alias("total_revenue"))
SECTION 6: PIPELINE SCENARIOS
🔑 MID-LEVEL QUESTIONS
5-step fix:
- Liquid Clustering on merge key (biggest win — scans only relevant files)
- Filter source to only today's changes (don't MERGE entire history)
- OPTIMIZE target table first (compact small files)
- Enable low shuffle merge:
spark.databricks.delta.merge.lowShuffle.enabled = true - Check for data skew: one key has millions of rows → use salting
4 layers:
- Schema validation (Bronze) — reject wrong formats at ingestion
- Row-level rules (Silver) — Lakeflow Expectations (NULL checks, range checks)
- Cross-table rules (Gold) — referential integrity (every booking has a valid passenger)
- Statistical monitoring — Lakehouse Monitoring detects drift (column mean shifted 20%)
🧠 FINAL REVISION — Day 2 Summary Card
┌─────────────────────────────────────────────────────────────┐
│ DAY 2: ETL PIPELINES │
├─────────────────────────────────────────────────────────────┤
│ │
│ MEDALLION = Bronze(raw) → Silver(clean) → Gold(business) │
│ Bronze: append-only + metadata (_ingested_at, _source) │
│ Silver: MERGE + dedup + PII tagging │
│ Gold: aggregations for BI + ML │
│ │
│ SCD Type 1 = Overwrite (no history) │
│ SCD Type 2 = Add new row (full history) ⭐ │
│ merge_key trick: NULL key → NOT MATCHED → INSERT new row │
│ Easy way: Lakeflow apply_changes(stored_as_scd_type=2) │
│ │
│ CDC = Oracle → Debezium → Kafka → Auto Loader → Delta │
│ CDC ops: c=create, u=update, d=delete, r=read │
│ CDF ≠ CDC! CDF = Delta's internal change tracking │
│ │
│ Auto Loader = cloudFiles source │
│ Two modes: Directory Listing (<10K) / Notification (>10K) │
│ Schema evolution: addNewColumns + _rescued_data │
│ Auto Loader (streaming) vs COPY INTO (batch) │
│ │
│ Lakeflow (was DLT): │
│ Streaming Table = append-only, incremental │
│ Materialized View = full recompute │
│ Expectations: Warn / Drop / Fail ("WDF") │
│ apply_changes = auto SCD Type 2 │
│ Modes: TRIGGERED (batch) vs CONTINUOUS (streaming) │
│ │
│ TOP 5 THINGS TO SAY IN INTERVIEW: │
│ 1. "Medallion with Bronze append-only + Silver MERGE" │
│ 2. "SCD Type 2 with merge_key trick or apply_changes" │
│ 3. "CDC: Debezium reads redo logs, zero source impact" │
│ 4. "Auto Loader with notification mode for scale" │
│ 5. "Lakeflow Expectations: Warn/Drop/Fail for quality" │
│ │
└─────────────────────────────────────────────────────────────┘
- First pass (30 min): Read only 🧠 Memory Maps + ⚡ Direct Questions
- Second pass (30 min): Read 🔑 Mid-Level Questions + ⚠️ Traps
- Before interview (15 min): Read ONLY the Final Revision Summary Card