🧱
Databricks
Day 2: ETL Patterns & Pipeline Design (Azure Databricks)
🧱
🧱
Databricks · Section 11 of 17

Day 2: ETL Patterns & Pipeline Design (Azure Databricks)

Day 2: ETL Patterns & Pipeline Design (Azure Databricks)

💡 Interview Tip
Time: 6-7 hours | Priority: HIGH — ETL design is 25-30% of Databricks interviews Amadeus Context: Oracle CDC → Kafka → Delta Lake, flight booking pipelines, fare pricing updates Approach: Every topic starts with simple explanation → then interview-level depth

SECTION 1: MEDALLION ARCHITECTURE (1 hour)

Q1: What is Medallion Architecture?

Simple Explanation: Medallion Architecture is a way to organize your data into 3 layers (like 3 levels of cleaning). Raw data comes in, gets cleaned step by step, and ends up ready for business reports and ML.

Think of it like cooking:

  • Bronze = Raw ingredients from the market (dirty, uncut, as-is)
  • Silver = Washed, cut, measured ingredients (clean, validated, ready to cook)
  • Gold = Final dish served to the customer (aggregated, formatted for business use)

Why do we need it? Without layers, you'd have a mess — raw data mixed with cleaned data, no one knows what's trustworthy. Medallion gives you a clear pipeline: raw → clean → business-ready.

BRONZE (Raw Layer) SILVER (Clean Layer) GOLD (Business Layer)
────────────────── ───────────────── ─────────────────
What goes in: What goes in: What goes in:
• Raw data exactly as • Deduplicated data • Aggregated metrics
received from source • Validated (nulls removed) • Star schema for BI
• No transformations • Data types fixed • Pre-computed KPIs
• Just append, never update • Business keys applied • Wide tables for ML
Who reads it: Who reads it: Who reads it:
• Only data engineers • Data engineers + • Business analysts
(for debugging) data scientists • BI tools (Power BI)
• ML models
Metadata added: Transformations: Optimized for:
• _ingested_at (when) • MERGE (upsert) • Fast queries
• _source_file (from where) • SCD Type 2 for dimensions • Self-service analytics
• _batch_id (which batch) • PII masking/tagging • Dashboard performance

Design decisions per layer (Amadeus example):

DecisionBronzeSilverGold
FormatDelta (store raw JSON/CSV as Delta)DeltaDelta
ClusteringBy ingestion dateBy business key (booking_id)By query pattern (airport, date)
Quality checksSchema validation onlyNull checks, range checks, dedupBusiness rule validation
Retention90 days raw7 years (regulatory for travel)Latest + 2 years
Who readsData engineers onlyEngineers + scientistsAnalysts + BI + ML
Update patternAppend only (never update)MERGE (upsert)Overwrite partition or MERGE

Interview tip: Always explain Medallion with a real example: "In our booking pipeline, Bronze stores raw airline JSON as-is, Silver cleans and deduplicates bookings with MERGE, and Gold has daily revenue aggregations that Power BI reads."

Q2: When would you NOT follow Medallion Architecture?

Simple Explanation: Medallion is a guideline, not a strict rule. Sometimes you skip layers or add extra layers:

  1. Real-time dashboards: Data is already clean → read directly from Silver (skip Gold)
  2. Simple data: Source is already clean → Bronze directly to Gold (skip Silver)
  3. ML features: Add a "Feature" layer between Silver and Gold for ML feature tables
  4. Data Mesh: Each team (bookings team, flights team) owns their own Bronze→Silver→Gold
  5. Reverse ETL: Gold data pushed back to external systems (CRM, marketing tools)

Interview tip: When they ask "When would you deviate?", say: "It depends on the use case. For our real-time flight status dashboard, we skip Gold and read directly from Silver with materialized views — adding Gold would add unnecessary latency."

Q3: How do you handle late-arriving data?

Simple Explanation: In real life, data doesn't always arrive on time. A booking created on March 1 might reach your pipeline on March 5 (airline system was down, or batch was delayed). This is called late-arriving data. Each layer handles it differently.

Amadeus scenario: An airline's system was offline for 4 days. When it comes back, it sends 4 days of bookings at once.

sql
-- BRONZE: No problem! Bronze is append-only.
-- Late data just gets appended with _ingested_at = March 5 (when it actually arrived).
-- We keep both the actual booking_date (March 1) and _ingested_at (March 5).

-- SILVER: MERGE handles late data naturally!
MERGE INTO silver.bookings t              -- t = existing clean bookings
USING bronze.bookings_new s               -- s = new data (includes late-arriving)
ON t.booking_id = s.booking_id            -- Match by booking ID

WHEN MATCHED AND s.updated_at > t.updated_at THEN
    UPDATE SET *                           -- Update only if new data is newer
    -- Late data with older updated_at won't overwrite newer records

WHEN NOT MATCHED THEN
    INSERT *                               -- Brand new bookings get inserted
    -- Late bookings that were never seen before get added
python — editable
# GOLD: Reprocess affected date partitions
# Since late data from March 1 arrived on March 5,
# we need to recompute Gold aggregations for March 1-5
df.write.format("delta") \
    .mode("overwrite") \
    .option("replaceWhere", "booking_date >= '2026-03-01'") \  # Only rewrite March dates
    .saveAsTable("gold.daily_bookings")
# replaceWhere = "only overwrite these specific partitions, leave everything else untouched"

SECTION 2: SCD IMPLEMENTATIONS (1.5 hours)

Q4: What is SCD (Slowly Changing Dimension)?

Simple Explanation: In any database, some data changes over time. A passenger's address changes when they move. A loyalty tier goes from Silver to Gold. The question is: how do you handle these changes?

SCD = Slowly Changing Dimension — it's a set of techniques for handling data that changes over time in your dimension tables (like passenger, product, customer tables).

Real-world analogy: A phone contact changes their number. Do you:

  • Type 1: Just overwrite the old number (you lose the old number forever)
  • Type 2: Keep both numbers with dates ("old number: until March 2026, new number: from March 2026")
  • Type 3: Add a column "previous_number" (keeps only ONE previous value)
TypeWhat It DoesDo You Keep History?Amadeus Example
Type 0Never update — data is staticNoAirport codes (BLR, DEL, JFK never change)
Type 1Overwrite old value with newNo — old value is lostCorrecting a typo in passenger name
Type 2Add new row, close old rowFull historyPassenger changes address → keep both old and new
Type 3Add "previous value" columnOnly 1 previousLoyalty tier: current=Gold, previous=Silver
Type 6Hybrid of 1+2+3Full + current + previousComplex regulatory dimensions

Interview tip: SCD Type 2 is the #1 most asked coding question in Databricks interviews. Know it by heart.

Q5: SCD Type 2 — FULL IMPLEMENTATION (The #1 most asked question)

Simple Explanation: SCD Type 2 keeps full history of changes. When a passenger changes their address, we DON'T overwrite the old row. Instead, we:

  1. Mark the old row as "no longer current" (set is_current = FALSE, set effective_end = now)
  2. Insert a NEW row with the new address (set is_current = TRUE, set effective_end = 9999-12-31)

This way, you can always answer: "What was this passenger's address on January 15, 2025?"

Example: Passenger moves from Bangalore to Mumbai:

surrogate_keypassenger_idaddressis_currenteffective_starteffective_end
1PAX-100BangaloreTRUE2024-01-019999-12-31
surrogate_keypassenger_idaddressis_currenteffective_starteffective_end
1PAX-100BangaloreFALSE2024-01-012026-03-24
2PAX-100MumbaiTRUE2026-03-249999-12-31

Table structure:

sql
CREATE TABLE dim_passenger (
    surrogate_key BIGINT GENERATED ALWAYS AS IDENTITY,
    -- surrogate_key: Auto-generated unique ID for each row (not the business key)
    -- Why? Because passenger_id can have multiple rows (one per version)

    passenger_id STRING,           -- Business key (the real passenger ID from source system)
    full_name STRING,
    email STRING,
    address STRING,
    loyalty_tier STRING,

    hash_value STRING,             -- MD5 hash of all tracked columns (name + email + address + loyalty)
    -- Why hash? To quickly check if anything changed. Compare hash instead of comparing every column.

    effective_start TIMESTAMP,     -- When this version became active
    effective_end TIMESTAMP,       -- When this version was replaced (9999-12-31 = still current)
    is_current BOOLEAN             -- TRUE = this is the latest version of this passenger
) USING DELTA
CLUSTER BY (passenger_id);         -- Cluster for fast lookups by passenger_id

The Merge Key Trick — this is the clever part:

The challenge: In a single MERGE, we need to do TWO things:

  1. UPDATE the old row (close it: is_current = FALSE)
  2. INSERT the new row (new version: is_current = TRUE)

But MERGE's WHEN MATCHED only does updates, and WHEN NOT MATCHED only does inserts. How do we do both?

The trick: Create TWO rows for each changed passenger in the source:

  • Row 1: merge_key = passenger_id → This MATCHES the target → triggers UPDATE (closes old row)
  • Row 2: merge_key = NULL → NULL never matches anything → triggers NOT MATCHED → INSERT (new row)
sql
-- Step 1: Prepare two rows per changed passenger
WITH staged_changes AS (

    -- === ROW TYPE A: INSERT new current version ===
    -- merge_key = NULL → will NOT match target → goes to WHEN NOT MATCHED → INSERT
    SELECT
        s.passenger_id,
        s.full_name,
        s.email,
        s.address,
        s.loyalty_tier,
        MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) AS hash_value,
        -- MD5 creates a hash of all tracked columns
        -- If any column changes, the hash changes → we know something changed
        NULL AS merge_key
        -- NULL merge_key = this row will NEVER match target
        -- So it goes to WHEN NOT MATCHED → INSERT as new current row
    FROM staging_passenger s
    JOIN dim_passenger t
        ON s.passenger_id = t.passenger_id    -- Find existing passenger
        AND t.is_current = TRUE                -- Only compare with current version
        AND MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) != t.hash_value
        -- Only include passengers whose data actually CHANGED (hash is different)

    UNION ALL

    -- === ROW TYPE B: CLOSE the old version ===
    -- merge_key = passenger_id → WILL match target → goes to WHEN MATCHED → UPDATE
    SELECT
        s.passenger_id,
        s.full_name,
        s.email,
        s.address,
        s.loyalty_tier,
        MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) AS hash_value,
        s.passenger_id AS merge_key
        -- Real passenger_id as merge_key = this row WILL match target
        -- So it goes to WHEN MATCHED → UPDATE (close the old row)
    FROM staging_passenger s
    JOIN dim_passenger t
        ON s.passenger_id = t.passenger_id
        AND t.is_current = TRUE
        AND MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) != t.hash_value

    UNION ALL

    -- === ROW TYPE C: Brand new passengers (never seen before) ===
    -- merge_key = NULL → WHEN NOT MATCHED → INSERT
    SELECT
        s.passenger_id,
        s.full_name,
        s.email,
        s.address,
        s.loyalty_tier,
        MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) AS hash_value,
        NULL AS merge_key              -- New passenger → insert
    FROM staging_passenger s
    LEFT JOIN dim_passenger t
        ON s.passenger_id = t.passenger_id AND t.is_current = TRUE
    WHERE t.passenger_id IS NULL       -- Not found in target → brand new passenger
)

-- Step 2: MERGE does both close + insert in ONE atomic operation
MERGE INTO dim_passenger AS target
USING staged_changes AS source
ON target.passenger_id = source.merge_key AND target.is_current = TRUE
-- Match condition uses merge_key (not passenger_id!)
-- Row Type B (merge_key = real ID) → matches → WHEN MATCHED
-- Row Type A & C (merge_key = NULL) → never matches → WHEN NOT MATCHED

-- Close the old record (Row Type B triggered this)
WHEN MATCHED THEN UPDATE SET
    target.is_current = FALSE,                    -- Mark as no longer current
    target.effective_end = current_timestamp()     -- Record when it was closed

-- Insert new current record (Row Type A and C triggered this)
WHEN NOT MATCHED THEN INSERT (
    passenger_id, full_name, email, address, loyalty_tier,
    hash_value, effective_start, effective_end, is_current
) VALUES (
    source.passenger_id, source.full_name, source.email, source.address,
    source.loyalty_tier, source.hash_value,
    current_timestamp(),                          -- This version starts NOW
    CAST('9999-12-31' AS TIMESTAMP),              -- Far future = "no end date yet"
    TRUE                                          -- This is the current version
);

Summary of the trick:

Row Typemerge_keyWhat HappensAction
A (changed passenger - new row)NULLDoesn't match targetINSERT new current version
B (changed passenger - close old)Real IDMatches targetUPDATE: close old row
C (brand new passenger)NULLDoesn't match targetINSERT first version

Interview tip: Practice writing this from memory. Draw the "before/after" table on paper to visualize. The merge_key trick is the key insight — be ready to explain WHY we use NULL.

Q6: SCD Type 2 the EASY way — Lakeflow apply_changes

Simple Explanation: The merge_key trick above is complex. Databricks has a MUCH easier way to do SCD Type 2 using Lakeflow Declarative Pipelines (formerly DLT). One function call — apply_changes() — does everything automatically.

When to use which:

  • Merge key trick: When you need full control, or when not using Lakeflow
  • apply_changes: When using Lakeflow pipelines (recommended — simpler and less error-prone)
python — editable
import dlt  # This is the Lakeflow (formerly DLT) library

# Step 1: Define the source — where new passenger data comes from
@dlt.table                                 # @dlt.table = "this is a Lakeflow table"
def staging_passenger():
    return spark.readStream \              # readStream = read data as a stream (continuously)
        .format("delta") \                 # Source is a Delta table
        .table("bronze.passenger_events")  # Read from Bronze layer

# Step 2: Create the SCD Type 2 dimension table — just ONE function call!
dlt.create_streaming_table("dim_passenger")  # Create empty streaming table

dlt.apply_changes(
    target="dim_passenger",                # Where to write (target dimension table)
    source="staging_passenger",            # Where to read from (staging table above)
    keys=["passenger_id"],                 # Business key — how to identify a passenger
    sequence_by="updated_at",              # How to order events — latest wins
    stored_as_scd_type=2,                  # Tell Lakeflow: "use SCD Type 2 pattern"
    track_history_column_list=[            # Which columns to watch for changes
        "full_name", "email", "address", "loyalty_tier"
    ]
    # If any of these columns change, a new version is created
)

# That's it! Databricks automatically handles:
# ✅ surrogate_key generation
# ✅ effective_start and effective_end timestamps
# ✅ is_current flag
# ✅ Deduplication of source data
# ✅ Ordering by sequence_by column
# ✅ Closing old rows and inserting new rows

Interview tip: Mention BOTH approaches. "I can implement SCD Type 2 manually using the merge_key trick for full control, but in production I prefer apply_changes in Lakeflow because it's simpler and handles edge cases automatically."

SECTION 3: CHANGE DATA CAPTURE (CDC) (1 hour)

Q7: What is CDC? Design a pipeline: Oracle → Kafka → Delta Lake

Simple Explanation: CDC (Change Data Capture) means capturing every change (insert, update, delete) that happens in a source database and streaming those changes to your data platform.

Why do we need it? The old way: Run a batch job every night that copies the ENTIRE table from Oracle to your data lake (full extract). This is slow, expensive, and data is always 12-24 hours stale.

The CDC way: Capture only the CHANGES as they happen in real-time. A new booking? Stream it. A cancellation? Stream it. Data arrives in minutes, not hours.

Real-world analogy: Instead of photocopying the entire phone book every day to check for changes, you set up a notification: "Tell me whenever someone changes their number." That's CDC.

How does Debezium work? (the most popular CDC tool) Debezium reads the Oracle redo logs (database's internal change diary). It doesn't query the database — so there's ZERO impact on Oracle performance. It produces a Kafka message for every insert, update, or delete.

📐 Architecture Diagram
End-to-end flow for Amadeus:

ORACLE (Legacy System)       KAFKA                    DATABRICKS
═══════════════════         ═════                    ═══════════
Bookings table          →   Debezium reads      →   BRONZE: Raw CDC events
  INSERT new booking        Oracle redo logs         (append every event as-is)
  UPDATE booking status     and sends to Kafka            │
  DELETE cancelled          topics                        ▼
                                                    SILVER: Apply changes
Passengers table        →   One Kafka topic     →     MERGE (for facts)
  UPDATE address            per source table           apply_changes (for dims)
  UPDATE loyalty tier                                      │
                                                          ▼
Flights table           →                        →   GOLD: Business metrics
  UPDATE schedule                                     Daily revenue, route stats
  UPDATE gate assignment                              ML feature tables

Architecture decisions explained:

🗂️1. SOURCE: Debezium Oracle Connector
Reads Oracle redo logs (NOT SQL queries → zero DB performance impact)
Captures every INSERT, UPDATE, DELETE with before/after values
Each message looks like:
│ {
│ "op": "u", // Operation: c=create, u=update, d=delete
│ "before": {"id": 1, "status": "CONFIRMED"}, // Old values
│ "after": {"id": 1, "status": "CHECKED_IN"}, // New values
│ "ts_ms": 1711234567890 // When the change happened in Oracle
│ }
No code changes needed in Oracle — just configure the connector
2. KAFKA (Azure Event Hubs in Amadeus — managed Kafka service)
One topic per source table (bookings_cdc, passengers_cdc, flights_cdc)
Messages partitioned by primary key → ensures ordering per record
Retention: 7 days (if pipeline fails, you can replay from Kafka)
Why Kafka? Decouples source from destination. Oracle doesn't need to know about Databricks.
3. BRONZE LAYER: Store raw CDC events as-is
Use Auto Loader or Kafka connector to read from Event Hubs
Append every event — never update or delete in Bronze
Add metadata: _ingested_at, _kafka_offset, _kafka_partition
Keep the full CDC structure: {before, after, op, ts_ms}
4. SILVER LAYER: Apply the changes to create clean tables
Fact tables (bookings): Use MERGE to apply inserts/updates/deletes
Dimension tables (passengers): Use apply_changes for SCD Type 2
Use ts_ms to handle out-of-order events (latest timestamp wins)
5. GOLD LAYER: Business aggregations
Daily booking revenue by route
Airline performance metrics
Passenger 360 view for ML

Interview tip: This is a VERY likely question for Amadeus since they're migrating from Oracle. Be ready to draw this architecture and explain each component.

Q8: What is Change Data Feed (CDF)? How is it different from CDC?

Simple Explanation: People often confuse CDC and CDF. They sound similar but are different things:

  • CDC = Capturing changes from an external source system (Oracle → Kafka → Delta)
  • CDF = Tracking changes that happen inside a Delta table (Delta table feature)

CDF (Change Data Feed) is a Delta Lake feature. When enabled, Delta records what changed in a table — every insert, update, delete — so downstream consumers can read ONLY the changes instead of the full table.

Real-world analogy:

  • CDC = A reporter covering live events at the airport (captures external events)
  • CDF = The airport's flight status board history (records what changed inside the airport system)

Why is CDF useful? Without CDF: Gold layer reads the ENTIRE Silver table every day (slow, wasteful). With CDF: Gold layer reads only the rows that changed since last run (fast, efficient).

sql
-- Step 1: Enable CDF on a table
ALTER TABLE bookings SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
-- After this, Delta starts tracking every change to this table

-- Step 2: Read changes between versions
SELECT * FROM table_changes('bookings', 5, 10);
-- Returns ONLY the rows that changed between version 5 and version 10
-- Much faster than reading the entire table!

-- Step 3: Read changes by timestamp
SELECT * FROM table_changes('bookings', '2026-03-01', '2026-03-15');
-- What changed in the bookings table during March 1-15?

-- Step 4: Streaming read of changes (for real-time downstream processing)
spark.readStream.format("delta") \
    .option("readChangeFeed", "true") \         -- Enable CDF reading
    .option("startingVersion", 5) \              -- Start from version 5
    .table("bookings")
-- This continuously reads new changes as they happen

CDF adds these extra columns to the output:

ColumnWhat It Tells YouExample Values
_change_typeWhat kind of changeinsert, update_preimage (old value), update_postimage (new value), delete
_commit_versionWhich Delta version6, 7, 8...
_commit_timestampWhen the change happened2026-03-15 10:30:00

Use cases:

  • Silver → Gold incremental: Gold layer reads only changed Silver rows (not the full table)
  • Audit trail: "What changes were made to this booking last week?"
  • Downstream sync: Push changes to another system (e.g., sync to a search index)

SECTION 4: AUTO LOADER (45 min)

Q9: What is Auto Loader?

Simple Explanation: Auto Loader is a Databricks feature that automatically detects and loads new files from cloud storage (like ADLS Gen2). You point it at a folder, and whenever new files appear, Auto Loader picks them up and loads them into a Delta table.

Why do we need it? Without Auto Loader: You write custom code to check "are there new files?", track which files were already processed, handle failures, etc. Lots of boilerplate. With Auto Loader: Just point to a folder and say "load new files." It handles everything automatically — file tracking, exactly-once processing, schema detection.

Real-world analogy: Auto Loader is like a mailroom worker. New mail (files) arrives in the mailbox (ADLS folder). The worker automatically picks it up, stamps it as "processed", and delivers it to the right desk (Delta table). You don't need to keep checking the mailbox yourself.

Two modes:

ModeHow It Finds New FilesBest For
Directory Listing (default)Lists all files in the folder, compares with checkpoint to find new onesSimple setups, <10K files/day
File NotificationAzure Event Grid sends a notification when a new file arrivesHigh volume (>10K files/day), lower latency
python — editable
5 minutes (for near-real-time)
# - once=True              → Process once and stop (one-time backfill)"># Auto Loader example: Load booking JSON files from ADLS Gen2 into Bronze Delta table

# READING: Auto Loader detects and reads new files
df = spark.readStream \
    .format("cloudFiles") \                    # "cloudFiles" = Auto Loader
    .option("cloudFiles.format", "json") \     # Source files are JSON format
    .option("cloudFiles.schemaLocation",
            "/checkpoints/bookings/schema") \  # Where to store inferred schema
    .option("cloudFiles.inferColumnTypes",
            "true") \                          # Auto-detect column types (int, string, etc.)
    .load("abfss://raw@storage.dfs.core.windows.net/bookings/")
    # ↑ The ADLS Gen2 folder to watch for new files
    # Auto Loader will ONLY process files it hasn't seen before

# WRITING: Write new data to Bronze Delta table
df.writeStream \
    .format("delta") \                         # Write as Delta format
    .option("checkpointLocation",
            "/checkpoints/bookings/") \        # Checkpoint = "bookmark" tracking progress
    .option("mergeSchema", "true") \           # If schema changes, merge new columns in
    .trigger(availableNow=True) \              # Process all available files RIGHT NOW, then stop
    .toTable("bronze.bookings")                # Target Delta table name

# trigger options:
# - availableNow=True     → Process all new files, then stop (for scheduled batch jobs)
# - processingTime="5 min" → Check for new files every 5 minutes (for near-real-time)
# - once=True              → Process once and stop (one-time backfill)

Interview tip: Auto Loader is the recommended way to ingest files in Databricks. If asked "How would you ingest files?", always mention Auto Loader first.

Q10: How does Auto Loader handle schema changes?

Simple Explanation: In real life, source systems change their data format. An airline might add a new field "baggage_count" to their booking JSON. Auto Loader can handle this automatically — you choose how:

ModeWhat Happens When New Columns AppearWhen to Use
addNewColumns (default)New columns are added to the Delta table automaticallyDev/testing — flexible
rescueNew columns are stored in a special _rescued_data column as JSONProduction — safest
failOnNewColumnsPipeline STOPS with an error — you must manually handle itStrict schema control
noneNew columns are silently ignoredWhen you don't care about new fields
python — editable
# RECOMMENDED FOR PRODUCTION: rescue mode
df = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaEvolutionMode", "rescue") \
    # ↑ "rescue" = if a file has unexpected columns, put them in _rescued_data
    # This way, your pipeline NEVER breaks because of schema changes
    # You can inspect _rescued_data later and decide what to do
    .option("cloudFiles.schemaLocation", "/checkpoints/schema/") \
    .load(path)

# After loading, check _rescued_data for unexpected columns:
# df.filter("_rescued_data IS NOT NULL").show()
# If you find valid new columns, update your schema to include them

Real-world scenario: Airline X starts sending "meal_preference" in their booking JSON. Without rescue mode, your pipeline might break. With rescue mode, "meal_preference" goes into _rescued_data, pipeline continues running, and you handle it when ready.

Q11: Auto Loader vs COPY INTO — when to use which?

Simple Explanation: Both load files into Delta tables. The difference is: Auto Loader is a streaming approach (continuously watches for new files), while COPY INTO is a batch SQL command (you run it manually or on a schedule).

AspectAuto LoaderCOPY INTO
TypeStreaming (continuous)Batch (SQL command, run manually)
How it tracks filesAutomatic checkpoint (never reprocesses)Tracks via file metadata
Schema evolutionFull support (rescue mode, etc.)Limited
ScaleHandles millions of filesHandles thousands of files
Exactly-onceYes (checkpoint-based)Yes (idempotent — safe to rerun)
When to useRegular/frequent ingestionOne-time loads, ad-hoc imports
sql
-- COPY INTO example (simple one-off load):
COPY INTO bronze.bookings
FROM 'abfss://raw@storage.dfs.core.windows.net/bookings/'
FILEFORMAT = JSON
-- This loads all files from the folder. If you run it again, it skips already-loaded files.

Rule of thumb: Always prefer Auto Loader. Use COPY INTO only for simple, one-time, ad-hoc loads.

SECTION 5: LAKEFLOW DECLARATIVE PIPELINES (formerly DLT) (1 hour)

Q12: What is Lakeflow? (formerly called DLT / Delta Live Tables)

Simple Explanation: Lakeflow Declarative Pipelines is Databricks' managed ETL framework. Instead of writing complex pipeline code (read data, clean it, write it, handle errors, manage retries...), you just DECLARE what your tables should look like, and Databricks handles the rest.

Why "declarative"? Because you declare WHAT you want, not HOW to do it:

  • Regular code: "Read from this source, filter out nulls, join with this table, write to this target, handle errors..."
  • Declarative: "This table should have no nulls and should come from that source." Databricks figures out the HOW.

Name history (important for interviews!):

  • Before June 2025: Called "Delta Live Tables (DLT)"
  • After June 2025: Renamed to "Lakeflow Declarative Pipelines"
  • In Apache Spark: The core technology was contributed as "Spark Declarative Pipelines"
  • Use the new name in interviews to show you're up to date!

What Lakeflow gives you (that regular code doesn't):

  • Built-in data quality checks (expectations)
  • Automatic lineage tracking (see which table depends on which)
  • Managed infrastructure (serverless available — no cluster management)
  • Automatic recovery from failures
  • Easy SCD Type 2 with apply_changes (see Q6 above)

Two table types in Lakeflow:

TypeWhat It DoesWhen to Use
Streaming TableProcesses ONLY new data (incremental)Append-heavy data, CDC events, real-time
Materialized ViewRecomputes the ENTIRE result each timeAggregations, complex joins, Gold layer summaries

Real-world analogy:

  • Streaming Table = A conveyor belt that processes only new items
  • Materialized View = A report that's regenerated from scratch each time you ask for it

Q13: What are Data Quality Expectations? (3 levels)

Simple Explanation: Expectations are data quality rules you attach to your Lakeflow tables. They check every row as it flows through your pipeline. Think of them as quality inspectors on an assembly line.

There are 3 levels depending on what should happen when bad data is found:

python — editable
import dlt  # Lakeflow library

@dlt.table  # This decorator says "this function defines a Lakeflow table"

# Level 1: EXPECT — log the bad row but KEEP it (monitoring only)
@dlt.expect("valid_booking_id", "booking_id IS NOT NULL")
# "If booking_id is null, log a warning but still keep the row"
# Use when: You want to MONITOR data quality without blocking the pipeline

# Level 2: EXPECT_OR_DROP — silently DROP bad rows
@dlt.expect_or_drop("valid_amount", "fare_amount > 0")
# "If fare_amount is 0 or negative, drop the row (don't include it in the table)"
# Use when: Bad rows are useless garbage — just filter them out

# Level 3: EXPECT_OR_FAIL — STOP the entire pipeline
@dlt.expect_or_fail("valid_currency", "currency IN ('USD','EUR','GBP','INR')")
# "If currency is not one of these, STOP everything. Something is very wrong."
# Use when: This is a CRITICAL data integrity issue — pipeline must not continue

def silver_bookings():
    return spark.readStream.table("bronze.bookings")

Summary:

ExpectationBad Data?PipelineReal Use
@dlt.expectKept + loggedContinues"Track how many bad rows we get" (monitoring)
@dlt.expect_or_dropDropped silentlyContinues"Remove known junk data" (filtering)
@dlt.expect_or_failN/ASTOPS"Something critical is broken — investigate!"

New in 2026: Expectations are now stored in Unity Catalog — version-controlled, auditable, and shareable across pipelines. You can define an expectation once and reuse it in multiple pipelines.

Interview tip: "In our Amadeus pipeline, we use expect at Bronze to monitor raw data quality, expect_or_drop at Silver to filter invalid bookings, and expect_or_fail at Gold for critical business rules like 'departure date must be after booking date.'"

Q14: What is the difference between Materialized View and Streaming Table?

Simple Explanation:

Streaming Table = Processes only NEW data. Like a factory assembly line — it only works on items that just arrived. It remembers what it already processed and never redoes old work.

Materialized View = Recomputes EVERYTHING from scratch. Like a daily summary report — it reads all the data and produces a fresh result each time. Slower, but always accurate.

python — editable
import dlt

# STREAMING TABLE — incremental processing (only new data)
@dlt.table
def silver_bookings():
    return spark.readStream.table("LIVE.bronze_bookings_raw")
    # readStream = streaming = only processes NEW rows since last run
    # Use for: Continuously adding new bookings as they arrive

# MATERIALIZED VIEW — full recompute (all data, every time)
@dlt.table
def gold_daily_revenue():
    return spark.sql("""
        SELECT
            booking_date,
            departure_airport,
            arrival_airport,
            SUM(fare_amount) AS total_revenue,     -- Sum all fares
            COUNT(*) AS booking_count               -- Count all bookings
        FROM LIVE.silver_bookings                   -- Read from Silver layer
        GROUP BY booking_date, departure_airport, arrival_airport
    """)
    # No readStream = batch = recomputes the entire aggregation each time
    # Use for: Gold aggregations that need to reflect all data including updates
AspectStreaming TableMaterialized View
ProcessingOnly new/unprocessed dataEverything from scratch
SpeedFast (less work each run)Slower (reads all data)
Source requirementMust be a streaming sourceAny table or view
Best forBronze→Silver (append-heavy, CDC)Gold layer (aggregations, joins)
Handles updates?Appends only (unless using apply_changes)Yes (recomputes all)

Q15: What is TRIGGERED vs CONTINUOUS pipeline mode?

Simple Explanation: When you create a Lakeflow pipeline, you choose how it runs:

ModeWhat It DoesAnalogyWhen to Use
TRIGGEREDProcess all available data, then STOPLike a batch job — run, finish, doneScheduled runs (daily at 6am, hourly)
CONTINUOUSRun FOREVER, process data as it arrivesLike a 24/7 factory lineReal-time use cases (live dashboards)

Amadeus example:

  • Daily revenue report → TRIGGERED (run once a day at midnight)
  • Live flight status board → CONTINUOUS (update within seconds)

SECTION 6: SCENARIO-BASED PIPELINE DESIGN (1 hour)

Q16: Design a booking pipeline for Amadeus (10 billion events/day)

Simple Explanation: This is a "system design" interview question. They want to see HOW you'd build an end-to-end pipeline. Think about: sources, ingestion, layers, compute, governance, monitoring.

📐 Architecture Diagram
ARCHITECTURE OVERVIEW:
═══════════════════════════════════════════════════════════════

SOURCES (200+ airlines send data in different ways):
├── REST APIs → push booking events
├── SFTP files → daily CSV/JSON file drops
└── Kafka streams → real-time event streams

INGESTION:
├── Files from SFTP → Auto Loader (watches ADLS folder for new files)
└── Kafka streams → Kafka connector (reads from Azure Event Hubs)

BRONZE LAYER (raw data, as-is):
├── What: Store every event exactly as received — never modify
├── How: Streaming tables via Lakeflow Pipelines
├── Schema: raw JSON with extra metadata columns:
│   - _ingested_at (when we received it)
│   - _source_airline (which airline sent it)
│   - _batch_id (which batch it belongs to)
├── Clustering: BY (ingestion_date, source_airline)
├── Retention: 90 days (for debugging and replay)
└── Quality: @dlt.expect_or_drop("booking_id IS NOT NULL")

SILVER LAYER (cleaned, validated):
├── What: Deduplicated, validated, business keys applied
├── Deduplication: ROW_NUMBER by booking_id + updated_at (keep latest)
├── Quality expectations:
│   - expect: passenger_count > 0 (monitor)
│   - expect_or_drop: booking_id IS NOT NULL (filter junk)
│   - expect_or_fail: departure_date > booking_date (critical rule)
├── Dimensions: SCD Type 2 for dim_passenger (using apply_changes)
├── Facts: SCD Type 1 for fact_bookings (using MERGE — overwrite)
└── Clustering: BY (booking_id)

GOLD LAYER (business-ready):
├── daily_booking_revenue → Materialized View (aggregation by route)
├── airline_performance_metrics → Materialized View
├── passenger_360 → Wide denormalized table for ML models
└── Clustering: BY (booking_date, departure_airport)

COMPUTE (which machines run what):
├── Ingestion: Serverless streaming (auto-scales with data volume)
├── Silver MERGE: Job cluster + Photon (heavy transformations need power)
├── Gold queries: Serverless SQL Warehouse (BI tools like Power BI connect here)
└── ML: GPU cluster for fare prediction models

GOVERNANCE (who can access what):
├── Unity Catalog: travel_catalog.bookings.* (3-level namespace)
├── PII masking: passenger email and phone masked for non-authorized users
├── Row-level security: each airline partner sees only their own bookings
└── GDPR: deletion pipeline for "right to be forgotten" requests

MONITORING (is everything healthy?):
├── Lakeflow expectations dashboard (% of rows passing quality checks)
├── Row count anomaly detection (new Feb 2026 feature)
├── SLA tracking: Bronze < 5 min latency, Silver < 30 min, Gold < 1 hour
└── PagerDuty alerts for pipeline failures or expectation breaches

Interview tip: Draw this on a whiteboard. Start from sources on the left, flow to Gold on the right. Mention compute, governance, and monitoring separately — it shows you think about production systems, not just data transformations.

Q17: Scenario — Our MERGE on booking fact table takes 3 hours. How do you fix it?

Simple Explanation: This is a debugging scenario. The approach: first INVESTIGATE (why is it slow?), then FIX (apply optimizations in order of impact).

sql
-- STEP 1: INVESTIGATE — check the table's health
DESCRIBE DETAIL fact_bookings;
-- Key things to look at:
-- numFiles: 50,000 → TOO MANY! Small file problem (should be ~500 files for 2 TB)
-- sizeInBytes: 2 TB → That's the data size. 50K files means avg file is only 40 MB (should be ~1 GB)
sql
-- STEP 2: FIX (in order of impact — do #1 first, then check if it's fast enough)

-- Fix 1: Compact files (biggest impact if you have small file problem)
OPTIMIZE fact_bookings;
-- Combines 50,000 small files → ~500 large files
-- MERGE now creates ~500 tasks instead of 50,000 → much faster

-- Fix 2: Add partition column to MERGE ON clause (partition pruning)
MERGE INTO fact_bookings t USING staging s
ON t.booking_id = s.booking_id
   AND t.booking_date = s.booking_date;
-- booking_date = partition column → MERGE only scans today's data, not all 2 TB

-- Fix 3: Z-ORDER or Liquid Clustering on the merge key
OPTIMIZE fact_bookings ZORDER BY (booking_id);
-- Groups similar booking_ids together → data skipping works better during MERGE
-- OR for new tables: ALTER TABLE fact_bookings CLUSTER BY (booking_date, booking_id);

-- Fix 4: Switch to Photon runtime (3-5x faster MERGE, no code changes)
-- Just change the cluster runtime to Photon in the job configuration

-- Fix 5: Filter source to only merge rows that actually changed
-- (See Day 1, Q7 for the full technique)

-- Fix 6: Prevent future small files
ALTER TABLE fact_bookings SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',   -- Coalesce on write
    'delta.autoOptimize.autoCompact' = 'true'       -- Auto-compact after writes
);

Q18: Scenario — Migrate Oracle CDC to Delta Lake for Amadeus

Simple Explanation: Amadeus currently uses old Sqoop jobs that do a FULL COPY of Oracle tables every night (slow, 12-24 hours stale). The goal: switch to real-time CDC using Debezium → Kafka → Delta Lake (< 5 minute latency).

🗂️CURRENT STATE (what we have now):
Oracle DB with 500 tables
Sqoop batch jobs run every night (full table copy → ADLS → Hive tables)
Problem 1: Data is 12-24 hours stale (not real-time)
Problem 2: Full copies are expensive (copying 500 tables every night)
Problem 3: No ACID, no governance, no quality checks
TARGET STATE (what we want):
Real-time CDC → < 5 minute latency
Delta Lake on ADLS Gen2 → ACID transactions
Unity Catalog → governance and PII protection
Lakeflow → automated quality checks
MIGRATION PLAN:
Phase 1 (Weeks 1-4): SET UP FOUNDATION
Deploy Debezium Oracle connector → Azure Event Hubs (managed Kafka)
│ Debezium reads Oracle redo logs → produces change events to Kafka topics
Set up Databricks workspace with Unity Catalog
│ Create catalogs: travel_prod, travel_dev, travel_staging
Bronze layer: Auto Loader reads from Event Hubs
│ Store raw CDC events as-is (append-only)
Test with 1 non-critical table first (e.g., flight_schedules)
Phase 2 (Weeks 5-8): MIGRATE TOP 20 TABLES
Migrate the 20 most important tables (by business value)
│ bookings, passengers, flights, fares, routes...
Silver: MERGE for fact tables, apply_changes (SCD2) for dimensions
Validate DAILY: count rows in Oracle vs Delta (must match!)
Run BOTH old Sqoop AND new CDC in parallel for 2 weeks (safety net)
Phase 3 (Weeks 9-12): MIGRATE REMAINING 480 TABLES
Migrate all remaining tables using the same pattern
Decommission Sqoop jobs one by one (after validation)
Build Gold layer: recreate existing reports and dashboards
Connect BI tools (Power BI) to Databricks SQL Warehouse
Phase 4 (Weeks 13-16): OPTIMIZE
Enable Predictive Optimization on all managed tables
Set up Lakeflow Pipelines for critical data paths
Build GDPR deletion pipeline for passenger PII
Cost optimization: right-size clusters, use Spot VMs

Q19: How do you design a Data Quality Framework?

Simple Explanation: Data quality = making sure your data is correct, complete, and timely at every layer. Without quality checks, bad data silently flows through your pipeline and shows up in business reports — very dangerous.

python — editable
# BRONZE LAYER: Basic checks — "did we receive the minimum required fields?"
@dlt.expect_or_drop("has_required_fields",
    "booking_id IS NOT NULL AND flight_id IS NOT NULL")
# Drop rows that don't even have basic required fields — they're garbage

# SILVER LAYER: Business rule checks — "does this data make sense?"
@dlt.expect("valid_dates", "departure_date >= booking_date")
# A flight can't depart BEFORE it was booked — flag if it does (but keep the row)

@dlt.expect("valid_fare", "fare_amount > 0 AND fare_amount < 100000")
# Fare should be positive and reasonable (not $0 or $1 million)

@dlt.expect_or_fail("valid_airport",
    "departure_airport IN (SELECT iata_code FROM ref_airports)")
# Airport code must exist in our reference table — if not, something is VERY wrong, STOP pipeline

# GOLD LAYER: Completeness checks
@dlt.expect("complete_data", "passenger_name IS NOT NULL")
# Gold layer should have complete data — flag if passenger name is missing

# MONITORING (NEW — February 2026 feature):
# Data Quality Monitoring Anomaly Detection
# Databricks automatically detects unexpected changes in:
# - Row counts (suddenly 50% fewer rows than usual)
# - Null rates (null % jumped from 1% to 20%)
# - Value distributions (a column that's usually 99% "USD" is now 50% "USD")
# No configuration needed — it learns normal patterns and alerts on anomalies

Interview tip: Show you think about quality at EVERY layer, not just one. "We put expect_or_drop at Bronze for garbage filtering, expect at Silver for business rules monitoring, and expect_or_fail at Gold for critical data integrity."

QUICK REVISION CHECKLIST — DAY 2

Test yourself — can you answer each in 2-3 minutes?

  • What is Medallion Architecture? Explain Bronze/Silver/Gold with Amadeus examples. (Q1)
  • When would you NOT follow Medallion? (Q2)
  • What is SCD? Explain Types 0, 1, 2, 3 with examples. (Q4)
  • Can you write SCD Type 2 with the merge_key trick? Explain how the trick works. (Q5)
  • Can you write SCD Type 2 with apply_changes in Lakeflow? (Q6)
  • What is CDC? Design Oracle → Kafka → Delta Lake pipeline. (Q7)
  • What is CDF (Change Data Feed)? How is it different from CDC? (Q8)
  • What is Auto Loader? Explain the two modes. (Q9)
  • How does Auto Loader handle schema changes? What is rescue mode? (Q10)
  • Auto Loader vs COPY INTO — when to use each? (Q11)
  • What is Lakeflow? What was it called before? (Q12)
  • What are the 3 levels of expectations? Give examples. (Q13)
  • Streaming Table vs Materialized View — when to use each? (Q14)
  • Can you design a 10B events/day booking pipeline? (Q16)
  • How would you fix a 3-hour slow MERGE? (Q17)
  • Design an Oracle → CDC → Delta Lake migration plan. (Q18)