🧱
Databricks
Day 2: ETL Patterns & Pipeline Design — Quick Recall Guide
🧱
🧱
Databricks · Section 12 of 17

Day 2: ETL Patterns & Pipeline Design — Quick Recall Guide

Day 2: ETL Patterns & Pipeline Design — Quick Recall Guide

🗺️ Memory Map
How to use this file:
  • ⚡ = Must remember (95% chance of being asked)
  • 🔑 = Key concept (core understanding needed)
  • ⚠️ = Common trap (interviewers love to test this)
  • 🧠 = Memory Map (mnemonic/acronym — memorize this!)
  • 📝 = One-liner (flash-card style — cover answer, test yourself)
Reading strategy: Read Memory Maps FIRST → then Direct Questions → then Mid-Level.

🧠 MASTER MEMORY MAP — Day 2

🧠 ETL PIPELINE DESIGN = "MACS-L"
ETL PIPELINE DESIGN"MACS-L"
MMedallion Architecture (Bronze → Silver → Gold)
AAuto Loader (streaming file ingestion)
CCDC (Change Data Capture — Oracle → Kafka → Delta)
SSCD Type 2 (Slowly Changing Dimensions — history tracking)
LLakeflow Declarative Pipelines (formerly DLT)
MEDALLION"BSG" (like the TV show Battlestar Galactica)
BBronze (raw, as-is, append-only)
SSilver (clean, deduplicated, validated)
GGold (aggregated, business-ready)
AUTO LOADER"DNS"
DDirectory listing mode (small folders, simple)
NNotification mode (huge folders, uses Azure Event Grid)
SSchema evolution (auto-detects new columns → rescue)

SECTION 1: MEDALLION ARCHITECTURE

🧠 Memory Map: Medallion Layers

🧠 BRONZE = Raw ingredients from market (dirty, uncut)
Think: COOKING ANALOGY
BRONZERaw ingredients from market (dirty, uncut)
→ Append only, never update
→ Add metadata: _ingested_at, _source_file, _batch_id
→ Engineers read this for debugging
SILVERWashed, cut, measured ingredients
→ Deduplicated (MERGE/upsert)
→ Schema validated, nulls handled
→ PII tagged (Unity Catalog)
→ Engineers + Scientists read this
GOLDFinal dish served to customer
→ Aggregated (daily revenue, route stats)
→ Star schema for BI tools
→ Power BI, Tableau read this
KEY DESIGN DECISIONS per layer:
Bronze: partition by ingestion_date (when data arrived)
Silver: cluster by business_key (booking_id)
Gold: cluster by query pattern (airport, date)

⚡ MUST KNOW DIRECT QUESTIONS

Q1What is Medallion Architecture?

A data organization pattern with 3 layers: Bronze (raw), Silver (clean), Gold (business-ready). Data flows through layers with increasing quality.

Q2What goes into Bronze layer?

Raw data exactly as received — no transformations. Just append with metadata columns (_ingested_at, _source_file). Never update or delete in Bronze.

Q3What goes into Silver layer?

Cleaned, deduplicated, validated data. Apply MERGE for upserts, fix data types, remove nulls, apply business keys, tag PII columns.

Q4What goes into Gold layer?

Business-ready aggregations, KPIs, star schema. Optimized for fast queries by BI tools (Power BI, Tableau). Pre-computed metrics like daily revenue per route.

Q5Who reads each layer?

Bronze = data engineers (debugging). Silver = engineers + data scientists. Gold = business analysts + BI tools.

⚠️ Q6When would you skip a layer?

  • Skip Silver: Source data is already clean → Bronze straight to Gold
  • Skip Gold: Real-time dashboard reads from Silver directly
  • Add layers: "Feature layer" between Silver and Gold for ML features

Q7What is late-arriving data?

Data that arrives after its actual event time. Example: A booking from March 1 arrives on March 5 (airline system was offline). Handle with MERGE on business key — it updates Silver even if it arrives late.

🔑 MID-LEVEL QUESTIONS

Q8How do you handle late-arriving data in each layer?

LayerStrategy
BronzeJust append — it's append-only anyway. Add _ingested_at to track when it arrived
SilverMERGE on business key — upserts handle late data naturally
GoldRecompute affected aggregation windows (reprocess March 1 aggregations)

Q9What metadata should Bronze have?

5 essential columns:

  1. _ingested_at — when data was loaded (TIMESTAMP)
  2. _source_file — which file it came from (STRING)
  3. _batch_id — which batch run loaded it (STRING)
  4. _source_system — which system sent it (STRING)
  5. _raw_payload — original data if JSON (for debugging)

Q10Medallion vs Data Mesh — what's the difference?

  • Medallion = centralized pipeline (one team owns Bronze→Silver→Gold)
  • Data Mesh = decentralized (each domain team owns their own Bronze→Silver→Gold)
  • Can combine both: each domain team follows Medallion within their domain

SECTION 2: SCD (SLOWLY CHANGING DIMENSIONS)

🧠 Memory Map: SCD Types

🧠 SCD = "How do you handle dimension changes over time?"
SCD"How do you handle dimension changes over time?"
Think: A PASSENGER CHANGES THEIR ADDRESS
SCD Type 0 = "Never change"Ignore the update (keep original forever)
SCD Type 1 = "Overwrite"Replace old address with new (no history)
SCD Type 2 = "Add new row"Keep old row + add new row (FULL history) ⭐
SCD Type 3 = "Add new column"Add previous_address column (limited history)
Remember: "0-N-O-C" = Zero change, No history, Old+new rows, Column added
⚡ INTERVIEW STAR: SCD TYPE 2 = Most asked! Know the code!
SCD Type 2 columns:
booking_idbusiness key (never changes)
passenger_namecan change
addresscan change
start_datewhen this version became active
end_datewhen this version was replaced (NULL = current)
is_currentTRUE for latest version, FALSE for old

⚡ MUST KNOW DIRECT QUESTIONS

Q11What is SCD?

Slowly Changing Dimension — how you handle changes to dimension tables over time. Example: a passenger changes their address — do you overwrite, keep history, or ignore?

Q12What is SCD Type 1?

Overwrite the old value with the new value. No history kept. Simple but you lose what the old value was.

Q13What is SCD Type 2?

Add a new row for each change. Keep the old row with end_date and is_current = false. Full history preserved. Most common in data warehousing.

Q14What is SCD Type 3?

Add a column for the previous value (e.g., current_address + previous_address). Limited history — only tracks the last change.

⚠️ Q15Why is SCD Type 2 the hardest to implement with MERGE?

Because for ONE changed record, you need TWO operations:

  1. UPDATE the old row (close it: set end_date, is_current = false)
  2. INSERT a new row (open it: start_date = today, is_current = true)
Standard MERGE can't do both for the same key — that's why we use the merge_key trick.

🔑 MID-LEVEL QUESTIONS

Q16Explain the merge_key trick for SCD Type 2 (MOST IMPORTANT)

Problem: MERGE matches on booking_id. For a changed record, you need to UPDATE the old row AND INSERT a new row. But MERGE only does one action per matched key. Solution: Create a merge_key column in the staged data:

  • Row to UPDATE: merge_key = booking_id (matches target → WHEN MATCHED → UPDATE)
  • Row to INSERT: merge_key = NULL (never matches → WHEN NOT MATCHED → INSERT)
sql
-- Step 1: Create staged changes with TWO rows per changed record
staged_changes AS (
  -- Row 1: Close the old record (merge_key = booking_id → will MATCH)
  SELECT booking_id AS merge_key, booking_id, address,
         current_timestamp AS end_date, false AS is_current
  FROM changes WHERE change_type = 'U'
  UNION ALL
  -- Row 2: Insert new record (merge_key = NULL → will NOT MATCH → INSERT)
  SELECT NULL AS merge_key, booking_id, new_address,
         NULL AS end_date, true AS is_current
  FROM changes WHERE change_type = 'U'
)
-- Step 2: MERGE with merge_key
MERGE INTO dim_passenger USING staged_changes
ON dim_passenger.booking_id = staged_changes.merge_key
   AND dim_passenger.is_current = true
WHEN MATCHED THEN UPDATE SET end_date = ..., is_current = false
WHEN NOT MATCHED THEN INSERT (...)

Q17What is the EASY way to do SCD Type 2?

Use Lakeflow apply_changes — it handles SCD Type 2 automatically:

python — editable
dlt.apply_changes(
  target="dim_passenger",       # Target table
  source="raw_passengers",      # Source stream
  keys=["passenger_id"],        # Business key
  sequence_by="updated_at",     # How to order changes
  stored_as_scd_type=2          # SCD Type 2 automatically!
)
No merge_key trick needed — Lakeflow handles the UPDATE + INSERT logic internally.

SECTION 3: CDC (CHANGE DATA CAPTURE)

🧠 Memory Map: CDC Pipeline

🧠 CDC = "Capture every change from source database"
CDC"Capture every change from source database"
AMADEUS CDC PIPELINE
Oracle DBDebezium → Kafka → Auto Loader → Delta Lake
Oracle DB: Source (legacy booking system)
Debezium: Reads Oracle's redo logs (no impact on DB performance)
Kafka: Message bus (buffers + distributes changes)
Auto Loader: Reads from Kafka landing zone in ADLS
Delta Lake: Final destination (Bronze → Silver → Gold)
Remember: "O-D-K-A-D" = OracleDebezium → Kafka → AutoLoader → Delta
CDC Message types:
"c" = CREATE (new booking)
"u" = UPDATE (booking changed)
"d" = DELETE (booking cancelled)
"r" = READ (initial snapshot/backfill)
Remember: "CUDR" (sounds like "could-er")

⚡ MUST KNOW DIRECT QUESTIONS

Q18What is CDC?

Change Data Capture — captures every INSERT, UPDATE, DELETE from a source database and sends it downstream in real-time or near-real-time.

Q19What is Debezium?

Open-source CDC connector that reads database transaction logs (Oracle redo logs, MySQL binlog) and streams changes to Kafka. Zero impact on source database performance.

Q20Why use Debezium instead of direct queries?

  • Direct queries = poll the database → high load on source, miss deletes, miss intermediate changes
  • Debezium = reads transaction logs → zero load on source, captures ALL changes including deletes

⚠️ Q21What is CDF (Change Data Feed)? How is it different from CDC?

CDC = capture changes FROM external source (Oracle → Delta) CDF = track changes WITHIN Delta Lake (Delta → downstream) CDF is a Delta Lake feature that records which rows changed (insert/update/delete) in a Delta table. Downstream consumers can read ONLY the changes instead of the full table.

sql
-- Enable CDF on a table
ALTER TABLE bookings SET TBLPROPERTIES ('delta.enableChangeDataFeed' = true);
-- Read only changes since version 5
SELECT * FROM table_changes('bookings', 5);

Q22What columns does CDF add?

3 columns automatically added:

  • _change_type — "insert", "update_preimage", "update_postimage", "delete"
  • _commit_version — which Delta version
  • _commit_timestamp — when the change happened

🔑 MID-LEVEL QUESTIONS

Q23Design a CDC pipeline for Amadeus (Oracle → Delta Lake)

Oracle (bookings DB)
↓ Debezium reads redo logs
Kafka (topic: bookings.changes)
↓ Kafka Connect writes JSON to ADLS
ADLS Gen2 (/raw/cdc/bookings/)
↓ Auto Loader (cloudFiles) reads new files
Bronze Delta Table (raw CDC events — all c/u/d records)
↓ MERGE on booking_id + apply CDC logic
Silver Delta Table (current state of all bookings)
↓ Aggregate
Gold Delta Table (daily booking stats per route)

Q24How do you handle CDC deletes in Silver?

Two approaches:

  1. Soft delete: Set is_deleted = true, keep the row (for audit trail)
  2. Hard delete: Actually delete from Silver (for GDPR compliance)
sql
MERGE INTO silver_bookings t USING bronze_cdc s
ON t.booking_id = s.booking_id
WHEN MATCHED AND s.op = 'd' THEN DELETE  -- Hard delete
WHEN MATCHED AND s.op = 'u' THEN UPDATE SET ...
WHEN NOT MATCHED AND s.op = 'c' THEN INSERT ...

Q25What if CDC events arrive out of order?

Use a sequence column (like updated_at timestamp or lsn — log sequence number). During MERGE, only apply the change if the incoming record is NEWER:

sql
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET ...

SECTION 4: AUTO LOADER

🧠 Memory Map: Auto Loader

📐 Architecture Diagram
AUTO LOADER = "Automatically ingest new files as they arrive"

cloudFiles = the Spark source name for Auto Loader

TWO MODES:
    Directory Listing → Scans folder for new files (simple, small folders)
    File Notification → Azure Event Grid notifies when new file arrives (scalable, huge folders)

    ┌─────────────────────────────────────────────┐
    │ < 10,000 files/day → Directory Listing      │
    │ > 10,000 files/day → File Notification      │
    └─────────────────────────────────────────────┘

SCHEMA EVOLUTION:
    New column appears → Auto Loader adds it automatically
    Unexpected data → Goes to _rescued_data column (not lost!)

Remember: "Auto Loader = cloudFiles + Directory or Notification + Schema rescue"

AUTO LOADER vs COPY INTO:
    Auto Loader = streaming (always on, processes new files automatically)
    COPY INTO   = batch (run manually, idempotent — safe to re-run)

⚡ MUST KNOW DIRECT QUESTIONS

Q26What is Auto Loader?

A Databricks feature that automatically ingests new files from cloud storage (ADLS Gen2) as they arrive. Uses cloudFiles as the Spark source format.

Q27What are the two modes of Auto Loader?

  1. Directory listing — scans folder periodically (good for <10K files/day)
  2. File notification — Azure Event Grid sends notifications (good for >10K files/day, more efficient)

Q28What is _rescued_data?

A special column where Auto Loader puts data that doesn't match the expected schema — instead of failing, unexpected fields are saved in this column for debugging.

Q29How does Auto Loader handle schema evolution?

Set cloudFiles.schemaEvolutionMode = "addNewColumns" — when source adds new columns, Auto Loader auto-adds them to the target table. Pipeline restarts to pick up the new schema.

⚠️ Q30Auto Loader vs COPY INTO — when to use which?
💡 Interview Tip
Auto LoaderCOPY INTO
TypeStreaming (always on)Batch (run manually)
ScalabilityMillions of filesThousands of files
Schema evolutionAuto-detectsNo auto-detection
File trackingCheckpoint (remembers processed files)Idempotent (safe to re-run)
Use whenContinuous ingestionOne-time or scheduled loads

SECTION 5: LAKEFLOW DECLARATIVE PIPELINES (formerly DLT)

🧠 Memory Map: Lakeflow

📐 Architecture Diagram
LAKEFLOW = "Declarative ETL — you say WHAT, it does HOW"

Old name: DLT (Delta Live Tables) → renamed to "Lakeflow Declarative Pipelines" (2025)

TWO TABLE TYPES:
    Streaming Table = append-only, processes new data only (like a river)
    Materialized View = recomputed from scratch when data changes (like a snapshot)

    ┌───────────────────────────────────────────────┐
    │ Append-only data (logs, events) → Streaming   │
    │ Aggregations, lookups → Materialized View     │
    └───────────────────────────────────────────────┘

DATA QUALITY = EXPECTATIONS (3 levels):
    EXPECT        → warn only (log bad rows, keep them)
    EXPECT OR DROP → silently drop bad rows
    EXPECT OR FAIL → stop pipeline on bad data

Remember: "WDF" = Warn, Drop, Fail
    W — expect (Warn)
    D — expect or DROP
    F — expect or FAIL

PIPELINE MODES:
    TRIGGERED   = run once, then stop (batch — for cost savings)
    CONTINUOUS  = always running (streaming — for low latency)

⚡ MUST KNOW DIRECT QUESTIONS

Q31What is Lakeflow (DLT)?

A declarative framework for building ETL pipelines. You define WHAT your tables should look like (SQL/Python), and Lakeflow figures out HOW to execute, manage dependencies, handle retries, and enforce data quality.

Q32What was DLT renamed to?

Lakeflow Declarative Pipelines (2025). The product is the same, just rebranded.

Q33What is a Streaming Table?

An append-only table that processes only NEW data (incremental). Like a river — new water flows through, old water is already downstream. Best for: logs, events, CDC streams.

Q34What is a Materialized View?

A table that is fully recomputed when underlying data changes. Like a fresh photo of your data. Best for: aggregations, star schema, Gold layer tables.

Q35What are the 3 levels of Expectations?

  1. EXPECT — warn only, bad rows kept (log for monitoring)
  2. EXPECT OR DROP — silently remove bad rows
  3. EXPECT OR FAIL — stop the entire pipeline
Amadeus: "For booking amounts, we use EXPECT OR FAIL — a negative fare means corrupted data, stop immediately."

Q36What is apply_changes?

A Lakeflow function that automatically handles CDC/SCD logic. Give it a source stream, target table, keys, and SCD type — it does the MERGE internally.

Q37TRIGGERED vs CONTINUOUS mode?

  • TRIGGERED: Runs once, processes all available data, stops. (cheaper, for batch)
  • CONTINUOUS: Runs forever, processes data as it arrives. (expensive, for real-time)

🔑 MID-LEVEL QUESTIONS

Q38Streaming Table vs Materialized View — when to use which?

Streaming TableMaterialized View
ProcessingOnly NEW data (incremental)Recomputes ALL data
Good forHigh-volume append (CDC, logs)Aggregations, joins, Gold layer
Can UPDATE existing rows?No (append-only)Yes (full recompute)
CostLow (reads only new data)Higher (reprocesses everything)
Exampleraw_bookings Bronzedaily_revenue Gold

Q39How do Expectations work with Quarantine pattern?

Bad rows are not just dropped — they're sent to a quarantine table for investigation:

python — editable
@dlt.table
@dlt.expect_or_drop("valid_amount", "amount > 0")
def silver_bookings():
    return spark.readStream.table("bronze_bookings")

@dlt.table
def quarantine_bookings():  # Capture what was dropped
    return spark.readStream.table("bronze_bookings").filter("amount <= 0")

Q40How do you build a complete Lakeflow pipeline?

python — editable
import dlt

# Bronze: Auto Loader ingests raw files
@dlt.table
def bronze_bookings():
    return (spark.readStream.format("cloudFiles")
            .option("cloudFiles.format", "json")
            .load("/raw/bookings/"))

# Silver: Clean + validate
@dlt.table
@dlt.expect_or_drop("valid_id", "booking_id IS NOT NULL")
@dlt.expect_or_drop("valid_amount", "amount > 0")
def silver_bookings():
    return dlt.readStream("bronze_bookings")
             .select("booking_id", "passenger_id", "amount", "booking_date")

# Gold: Aggregate
@dlt.table
def gold_daily_revenue():
    return dlt.read("silver_bookings")
             .groupBy("booking_date")
             .agg(sum("amount").alias("total_revenue"))

SECTION 6: PIPELINE SCENARIOS

🔑 MID-LEVEL QUESTIONS

Q41How would you design a booking pipeline for Amadeus?
🗺️ Memory Map
🧠 Memory Map
Sources:
Oracle (bookings) → Debezium → Kafka → ADLS landing zone
Flight APIs (JSON) → Azure Event Hubs → ADLS landing zone
Loyalty system (CSV) → SFTP → ADLS landing zone
Bronze (Auto Loader):
cloudFiles reads from 3 landing zones3 Bronze tables
Schema: raw + _ingested_at + _source_file
Silver (Lakeflow + MERGE):
Deduplicate on booking_id
Apply SCD Type 2 for passenger dimensions
Tag PII columns (Unity Catalog)
Quality: EXPECT OR FAIL for null booking_id
Gold (Materialized Views):
daily_bookings_by_route (Power BI dashboard)
passenger_360 (customer analytics)
revenue_per_airline (partner reporting)
⚠️ Q42MERGE takes 3 hours on fact table. How to fix?

5-step fix:

  1. Liquid Clustering on merge key (biggest win — scans only relevant files)
  2. Filter source to only today's changes (don't MERGE entire history)
  3. OPTIMIZE target table first (compact small files)
  4. Enable low shuffle merge: spark.databricks.delta.merge.lowShuffle.enabled = true
  5. Check for data skew: one key has millions of rows → use salting

Q43How to build a Data Quality Framework?

4 layers:

  1. Schema validation (Bronze) — reject wrong formats at ingestion
  2. Row-level rules (Silver) — Lakeflow Expectations (NULL checks, range checks)
  3. Cross-table rules (Gold) — referential integrity (every booking has a valid passenger)
  4. Statistical monitoring — Lakehouse Monitoring detects drift (column mean shifted 20%)

🧠 FINAL REVISION — Day 2 Summary Card

📐 Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│                    DAY 2: ETL PIPELINES                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  MEDALLION = Bronze(raw) → Silver(clean) → Gold(business)   │
│  Bronze: append-only + metadata (_ingested_at, _source)     │
│  Silver: MERGE + dedup + PII tagging                        │
│  Gold: aggregations for BI + ML                             │
│                                                             │
│  SCD Type 1 = Overwrite (no history)                        │
│  SCD Type 2 = Add new row (full history) ⭐                 │
│  merge_key trick: NULL key → NOT MATCHED → INSERT new row   │
│  Easy way: Lakeflow apply_changes(stored_as_scd_type=2)     │
│                                                             │
│  CDC = Oracle → Debezium → Kafka → Auto Loader → Delta      │
│  CDC ops: c=create, u=update, d=delete, r=read              │
│  CDF ≠ CDC! CDF = Delta's internal change tracking          │
│                                                             │
│  Auto Loader = cloudFiles source                            │
│  Two modes: Directory Listing (<10K) / Notification (>10K)  │
│  Schema evolution: addNewColumns + _rescued_data             │
│  Auto Loader (streaming) vs COPY INTO (batch)               │
│                                                             │
│  Lakeflow (was DLT):                                        │
│  Streaming Table = append-only, incremental                 │
│  Materialized View = full recompute                         │
│  Expectations: Warn / Drop / Fail ("WDF")                   │
│  apply_changes = auto SCD Type 2                            │
│  Modes: TRIGGERED (batch) vs CONTINUOUS (streaming)         │
│                                                             │
│  TOP 5 THINGS TO SAY IN INTERVIEW:                          │
│  1. "Medallion with Bronze append-only + Silver MERGE"      │
│  2. "SCD Type 2 with merge_key trick or apply_changes"      │
│  3. "CDC: Debezium reads redo logs, zero source impact"     │
│  4. "Auto Loader with notification mode for scale"          │
│  5. "Lakeflow Expectations: Warn/Drop/Fail for quality"     │
│                                                             │
└─────────────────────────────────────────────────────────────┘
🗺️ Memory Map
Study tip: Read this file TWICE:
  1. First pass (30 min): Read only 🧠 Memory Maps + ⚡ Direct Questions
  2. Second pass (30 min): Read 🔑 Mid-Level Questions + ⚠️ Traps
  3. Before interview (15 min): Read ONLY the Final Revision Summary Card