Day 2: ETL Patterns & Pipeline Design (Azure Databricks)
SECTION 1: MEDALLION ARCHITECTURE (1 hour)
Q1: What is Medallion Architecture?
Simple Explanation: Medallion Architecture is a way to organize your data into 3 layers (like 3 levels of cleaning). Raw data comes in, gets cleaned step by step, and ends up ready for business reports and ML.
Think of it like cooking:
- Bronze = Raw ingredients from the market (dirty, uncut, as-is)
- Silver = Washed, cut, measured ingredients (clean, validated, ready to cook)
- Gold = Final dish served to the customer (aggregated, formatted for business use)
Why do we need it? Without layers, you'd have a mess — raw data mixed with cleaned data, no one knows what's trustworthy. Medallion gives you a clear pipeline: raw → clean → business-ready.
Design decisions per layer (Amadeus example):
| Decision | Bronze | Silver | Gold |
|---|---|---|---|
| Format | Delta (store raw JSON/CSV as Delta) | Delta | Delta |
| Clustering | By ingestion date | By business key (booking_id) | By query pattern (airport, date) |
| Quality checks | Schema validation only | Null checks, range checks, dedup | Business rule validation |
| Retention | 90 days raw | 7 years (regulatory for travel) | Latest + 2 years |
| Who reads | Data engineers only | Engineers + scientists | Analysts + BI + ML |
| Update pattern | Append only (never update) | MERGE (upsert) | Overwrite partition or MERGE |
Interview tip: Always explain Medallion with a real example: "In our booking pipeline, Bronze stores raw airline JSON as-is, Silver cleans and deduplicates bookings with MERGE, and Gold has daily revenue aggregations that Power BI reads."
Q2: When would you NOT follow Medallion Architecture?
Simple Explanation: Medallion is a guideline, not a strict rule. Sometimes you skip layers or add extra layers:
- Real-time dashboards: Data is already clean → read directly from Silver (skip Gold)
- Simple data: Source is already clean → Bronze directly to Gold (skip Silver)
- ML features: Add a "Feature" layer between Silver and Gold for ML feature tables
- Data Mesh: Each team (bookings team, flights team) owns their own Bronze→Silver→Gold
- Reverse ETL: Gold data pushed back to external systems (CRM, marketing tools)
Interview tip: When they ask "When would you deviate?", say: "It depends on the use case. For our real-time flight status dashboard, we skip Gold and read directly from Silver with materialized views — adding Gold would add unnecessary latency."
Q3: How do you handle late-arriving data?
Simple Explanation: In real life, data doesn't always arrive on time. A booking created on March 1 might reach your pipeline on March 5 (airline system was down, or batch was delayed). This is called late-arriving data. Each layer handles it differently.
Amadeus scenario: An airline's system was offline for 4 days. When it comes back, it sends 4 days of bookings at once.
-- BRONZE: No problem! Bronze is append-only.
-- Late data just gets appended with _ingested_at = March 5 (when it actually arrived).
-- We keep both the actual booking_date (March 1) and _ingested_at (March 5).
-- SILVER: MERGE handles late data naturally!
MERGE INTO silver.bookings t -- t = existing clean bookings
USING bronze.bookings_new s -- s = new data (includes late-arriving)
ON t.booking_id = s.booking_id -- Match by booking ID
WHEN MATCHED AND s.updated_at > t.updated_at THEN
UPDATE SET * -- Update only if new data is newer
-- Late data with older updated_at won't overwrite newer records
WHEN NOT MATCHED THEN
INSERT * -- Brand new bookings get inserted
-- Late bookings that were never seen before get added
# GOLD: Reprocess affected date partitions
# Since late data from March 1 arrived on March 5,
# we need to recompute Gold aggregations for March 1-5
df.write.format("delta") \
.mode("overwrite") \
.option("replaceWhere", "booking_date >= '2026-03-01'") \ # Only rewrite March dates
.saveAsTable("gold.daily_bookings")
# replaceWhere = "only overwrite these specific partitions, leave everything else untouched"
SECTION 2: SCD IMPLEMENTATIONS (1.5 hours)
Q4: What is SCD (Slowly Changing Dimension)?
Simple Explanation: In any database, some data changes over time. A passenger's address changes when they move. A loyalty tier goes from Silver to Gold. The question is: how do you handle these changes?
SCD = Slowly Changing Dimension — it's a set of techniques for handling data that changes over time in your dimension tables (like passenger, product, customer tables).
Real-world analogy: A phone contact changes their number. Do you:
- Type 1: Just overwrite the old number (you lose the old number forever)
- Type 2: Keep both numbers with dates ("old number: until March 2026, new number: from March 2026")
- Type 3: Add a column "previous_number" (keeps only ONE previous value)
| Type | What It Does | Do You Keep History? | Amadeus Example |
|---|---|---|---|
| Type 0 | Never update — data is static | No | Airport codes (BLR, DEL, JFK never change) |
| Type 1 | Overwrite old value with new | No — old value is lost | Correcting a typo in passenger name |
| Type 2 | Add new row, close old row | Full history | Passenger changes address → keep both old and new |
| Type 3 | Add "previous value" column | Only 1 previous | Loyalty tier: current=Gold, previous=Silver |
| Type 6 | Hybrid of 1+2+3 | Full + current + previous | Complex regulatory dimensions |
Interview tip: SCD Type 2 is the #1 most asked coding question in Databricks interviews. Know it by heart.
Q5: SCD Type 2 — FULL IMPLEMENTATION (The #1 most asked question)
Simple Explanation: SCD Type 2 keeps full history of changes. When a passenger changes their address, we DON'T overwrite the old row. Instead, we:
- Mark the old row as "no longer current" (set
is_current = FALSE, seteffective_end = now) - Insert a NEW row with the new address (set
is_current = TRUE, seteffective_end = 9999-12-31)
This way, you can always answer: "What was this passenger's address on January 15, 2025?"
Example: Passenger moves from Bangalore to Mumbai:
| surrogate_key | passenger_id | address | is_current | effective_start | effective_end |
|---|---|---|---|---|---|
| 1 | PAX-100 | Bangalore | TRUE | 2024-01-01 | 9999-12-31 |
| surrogate_key | passenger_id | address | is_current | effective_start | effective_end |
| 1 | PAX-100 | Bangalore | FALSE | 2024-01-01 | 2026-03-24 |
| 2 | PAX-100 | Mumbai | TRUE | 2026-03-24 | 9999-12-31 |
Table structure:
CREATE TABLE dim_passenger (
surrogate_key BIGINT GENERATED ALWAYS AS IDENTITY,
-- surrogate_key: Auto-generated unique ID for each row (not the business key)
-- Why? Because passenger_id can have multiple rows (one per version)
passenger_id STRING, -- Business key (the real passenger ID from source system)
full_name STRING,
email STRING,
address STRING,
loyalty_tier STRING,
hash_value STRING, -- MD5 hash of all tracked columns (name + email + address + loyalty)
-- Why hash? To quickly check if anything changed. Compare hash instead of comparing every column.
effective_start TIMESTAMP, -- When this version became active
effective_end TIMESTAMP, -- When this version was replaced (9999-12-31 = still current)
is_current BOOLEAN -- TRUE = this is the latest version of this passenger
) USING DELTA
CLUSTER BY (passenger_id); -- Cluster for fast lookups by passenger_id
The Merge Key Trick — this is the clever part:
The challenge: In a single MERGE, we need to do TWO things:
- UPDATE the old row (close it:
is_current = FALSE) - INSERT the new row (new version:
is_current = TRUE)
But MERGE's WHEN MATCHED only does updates, and WHEN NOT MATCHED only does inserts. How do we do both?
The trick: Create TWO rows for each changed passenger in the source:
- Row 1:
merge_key = passenger_id→ This MATCHES the target → triggers UPDATE (closes old row) - Row 2:
merge_key = NULL→ NULL never matches anything → triggers NOT MATCHED → INSERT (new row)
-- Step 1: Prepare two rows per changed passenger
WITH staged_changes AS (
-- === ROW TYPE A: INSERT new current version ===
-- merge_key = NULL → will NOT match target → goes to WHEN NOT MATCHED → INSERT
SELECT
s.passenger_id,
s.full_name,
s.email,
s.address,
s.loyalty_tier,
MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) AS hash_value,
-- MD5 creates a hash of all tracked columns
-- If any column changes, the hash changes → we know something changed
NULL AS merge_key
-- NULL merge_key = this row will NEVER match target
-- So it goes to WHEN NOT MATCHED → INSERT as new current row
FROM staging_passenger s
JOIN dim_passenger t
ON s.passenger_id = t.passenger_id -- Find existing passenger
AND t.is_current = TRUE -- Only compare with current version
AND MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) != t.hash_value
-- Only include passengers whose data actually CHANGED (hash is different)
UNION ALL
-- === ROW TYPE B: CLOSE the old version ===
-- merge_key = passenger_id → WILL match target → goes to WHEN MATCHED → UPDATE
SELECT
s.passenger_id,
s.full_name,
s.email,
s.address,
s.loyalty_tier,
MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) AS hash_value,
s.passenger_id AS merge_key
-- Real passenger_id as merge_key = this row WILL match target
-- So it goes to WHEN MATCHED → UPDATE (close the old row)
FROM staging_passenger s
JOIN dim_passenger t
ON s.passenger_id = t.passenger_id
AND t.is_current = TRUE
AND MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) != t.hash_value
UNION ALL
-- === ROW TYPE C: Brand new passengers (never seen before) ===
-- merge_key = NULL → WHEN NOT MATCHED → INSERT
SELECT
s.passenger_id,
s.full_name,
s.email,
s.address,
s.loyalty_tier,
MD5(CONCAT_WS('||', s.full_name, s.email, s.address, s.loyalty_tier)) AS hash_value,
NULL AS merge_key -- New passenger → insert
FROM staging_passenger s
LEFT JOIN dim_passenger t
ON s.passenger_id = t.passenger_id AND t.is_current = TRUE
WHERE t.passenger_id IS NULL -- Not found in target → brand new passenger
)
-- Step 2: MERGE does both close + insert in ONE atomic operation
MERGE INTO dim_passenger AS target
USING staged_changes AS source
ON target.passenger_id = source.merge_key AND target.is_current = TRUE
-- Match condition uses merge_key (not passenger_id!)
-- Row Type B (merge_key = real ID) → matches → WHEN MATCHED
-- Row Type A & C (merge_key = NULL) → never matches → WHEN NOT MATCHED
-- Close the old record (Row Type B triggered this)
WHEN MATCHED THEN UPDATE SET
target.is_current = FALSE, -- Mark as no longer current
target.effective_end = current_timestamp() -- Record when it was closed
-- Insert new current record (Row Type A and C triggered this)
WHEN NOT MATCHED THEN INSERT (
passenger_id, full_name, email, address, loyalty_tier,
hash_value, effective_start, effective_end, is_current
) VALUES (
source.passenger_id, source.full_name, source.email, source.address,
source.loyalty_tier, source.hash_value,
current_timestamp(), -- This version starts NOW
CAST('9999-12-31' AS TIMESTAMP), -- Far future = "no end date yet"
TRUE -- This is the current version
);
Summary of the trick:
| Row Type | merge_key | What Happens | Action |
|---|---|---|---|
| A (changed passenger - new row) | NULL | Doesn't match target | INSERT new current version |
| B (changed passenger - close old) | Real ID | Matches target | UPDATE: close old row |
| C (brand new passenger) | NULL | Doesn't match target | INSERT first version |
Interview tip: Practice writing this from memory. Draw the "before/after" table on paper to visualize. The merge_key trick is the key insight — be ready to explain WHY we use NULL.
Q6: SCD Type 2 the EASY way — Lakeflow apply_changes
Simple Explanation:
The merge_key trick above is complex. Databricks has a MUCH easier way to do SCD Type 2 using Lakeflow Declarative Pipelines (formerly DLT). One function call — apply_changes() — does everything automatically.
When to use which:
- Merge key trick: When you need full control, or when not using Lakeflow
- apply_changes: When using Lakeflow pipelines (recommended — simpler and less error-prone)
import dlt # This is the Lakeflow (formerly DLT) library
# Step 1: Define the source — where new passenger data comes from
@dlt.table # @dlt.table = "this is a Lakeflow table"
def staging_passenger():
return spark.readStream \ # readStream = read data as a stream (continuously)
.format("delta") \ # Source is a Delta table
.table("bronze.passenger_events") # Read from Bronze layer
# Step 2: Create the SCD Type 2 dimension table — just ONE function call!
dlt.create_streaming_table("dim_passenger") # Create empty streaming table
dlt.apply_changes(
target="dim_passenger", # Where to write (target dimension table)
source="staging_passenger", # Where to read from (staging table above)
keys=["passenger_id"], # Business key — how to identify a passenger
sequence_by="updated_at", # How to order events — latest wins
stored_as_scd_type=2, # Tell Lakeflow: "use SCD Type 2 pattern"
track_history_column_list=[ # Which columns to watch for changes
"full_name", "email", "address", "loyalty_tier"
]
# If any of these columns change, a new version is created
)
# That's it! Databricks automatically handles:
# ✅ surrogate_key generation
# ✅ effective_start and effective_end timestamps
# ✅ is_current flag
# ✅ Deduplication of source data
# ✅ Ordering by sequence_by column
# ✅ Closing old rows and inserting new rows
Interview tip: Mention BOTH approaches. "I can implement SCD Type 2 manually using the merge_key trick for full control, but in production I prefer apply_changes in Lakeflow because it's simpler and handles edge cases automatically."
SECTION 3: CHANGE DATA CAPTURE (CDC) (1 hour)
Q7: What is CDC? Design a pipeline: Oracle → Kafka → Delta Lake
Simple Explanation: CDC (Change Data Capture) means capturing every change (insert, update, delete) that happens in a source database and streaming those changes to your data platform.
Why do we need it? The old way: Run a batch job every night that copies the ENTIRE table from Oracle to your data lake (full extract). This is slow, expensive, and data is always 12-24 hours stale.
The CDC way: Capture only the CHANGES as they happen in real-time. A new booking? Stream it. A cancellation? Stream it. Data arrives in minutes, not hours.
Real-world analogy: Instead of photocopying the entire phone book every day to check for changes, you set up a notification: "Tell me whenever someone changes their number." That's CDC.
How does Debezium work? (the most popular CDC tool) Debezium reads the Oracle redo logs (database's internal change diary). It doesn't query the database — so there's ZERO impact on Oracle performance. It produces a Kafka message for every insert, update, or delete.
End-to-end flow for Amadeus:
ORACLE (Legacy System) KAFKA DATABRICKS
═══════════════════ ═════ ═══════════
Bookings table → Debezium reads → BRONZE: Raw CDC events
INSERT new booking Oracle redo logs (append every event as-is)
UPDATE booking status and sends to Kafka │
DELETE cancelled topics ▼
SILVER: Apply changes
Passengers table → One Kafka topic → MERGE (for facts)
UPDATE address per source table apply_changes (for dims)
UPDATE loyalty tier │
▼
Flights table → → GOLD: Business metrics
UPDATE schedule Daily revenue, route stats
UPDATE gate assignment ML feature tables
Architecture decisions explained:
Interview tip: This is a VERY likely question for Amadeus since they're migrating from Oracle. Be ready to draw this architecture and explain each component.
Q8: What is Change Data Feed (CDF)? How is it different from CDC?
Simple Explanation: People often confuse CDC and CDF. They sound similar but are different things:
- CDC = Capturing changes from an external source system (Oracle → Kafka → Delta)
- CDF = Tracking changes that happen inside a Delta table (Delta table feature)
CDF (Change Data Feed) is a Delta Lake feature. When enabled, Delta records what changed in a table — every insert, update, delete — so downstream consumers can read ONLY the changes instead of the full table.
Real-world analogy:
- CDC = A reporter covering live events at the airport (captures external events)
- CDF = The airport's flight status board history (records what changed inside the airport system)
Why is CDF useful? Without CDF: Gold layer reads the ENTIRE Silver table every day (slow, wasteful). With CDF: Gold layer reads only the rows that changed since last run (fast, efficient).
-- Step 1: Enable CDF on a table
ALTER TABLE bookings SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
-- After this, Delta starts tracking every change to this table
-- Step 2: Read changes between versions
SELECT * FROM table_changes('bookings', 5, 10);
-- Returns ONLY the rows that changed between version 5 and version 10
-- Much faster than reading the entire table!
-- Step 3: Read changes by timestamp
SELECT * FROM table_changes('bookings', '2026-03-01', '2026-03-15');
-- What changed in the bookings table during March 1-15?
-- Step 4: Streaming read of changes (for real-time downstream processing)
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \ -- Enable CDF reading
.option("startingVersion", 5) \ -- Start from version 5
.table("bookings")
-- This continuously reads new changes as they happen
CDF adds these extra columns to the output:
| Column | What It Tells You | Example Values |
|---|---|---|
_change_type | What kind of change | insert, update_preimage (old value), update_postimage (new value), delete |
_commit_version | Which Delta version | 6, 7, 8... |
_commit_timestamp | When the change happened | 2026-03-15 10:30:00 |
Use cases:
- Silver → Gold incremental: Gold layer reads only changed Silver rows (not the full table)
- Audit trail: "What changes were made to this booking last week?"
- Downstream sync: Push changes to another system (e.g., sync to a search index)
SECTION 4: AUTO LOADER (45 min)
Q9: What is Auto Loader?
Simple Explanation: Auto Loader is a Databricks feature that automatically detects and loads new files from cloud storage (like ADLS Gen2). You point it at a folder, and whenever new files appear, Auto Loader picks them up and loads them into a Delta table.
Why do we need it? Without Auto Loader: You write custom code to check "are there new files?", track which files were already processed, handle failures, etc. Lots of boilerplate. With Auto Loader: Just point to a folder and say "load new files." It handles everything automatically — file tracking, exactly-once processing, schema detection.
Real-world analogy: Auto Loader is like a mailroom worker. New mail (files) arrives in the mailbox (ADLS folder). The worker automatically picks it up, stamps it as "processed", and delivers it to the right desk (Delta table). You don't need to keep checking the mailbox yourself.
Two modes:
| Mode | How It Finds New Files | Best For |
|---|---|---|
| Directory Listing (default) | Lists all files in the folder, compares with checkpoint to find new ones | Simple setups, <10K files/day |
| File Notification | Azure Event Grid sends a notification when a new file arrives | High volume (>10K files/day), lower latency |
5 minutes (for near-real-time)
# - once=True → Process once and stop (one-time backfill)"># Auto Loader example: Load booking JSON files from ADLS Gen2 into Bronze Delta table
# READING: Auto Loader detects and reads new files
df = spark.readStream \
.format("cloudFiles") \ # "cloudFiles" = Auto Loader
.option("cloudFiles.format", "json") \ # Source files are JSON format
.option("cloudFiles.schemaLocation",
"/checkpoints/bookings/schema") \ # Where to store inferred schema
.option("cloudFiles.inferColumnTypes",
"true") \ # Auto-detect column types (int, string, etc.)
.load("abfss://raw@storage.dfs.core.windows.net/bookings/")
# ↑ The ADLS Gen2 folder to watch for new files
# Auto Loader will ONLY process files it hasn't seen before
# WRITING: Write new data to Bronze Delta table
df.writeStream \
.format("delta") \ # Write as Delta format
.option("checkpointLocation",
"/checkpoints/bookings/") \ # Checkpoint = "bookmark" tracking progress
.option("mergeSchema", "true") \ # If schema changes, merge new columns in
.trigger(availableNow=True) \ # Process all available files RIGHT NOW, then stop
.toTable("bronze.bookings") # Target Delta table name
# trigger options:
# - availableNow=True → Process all new files, then stop (for scheduled batch jobs)
# - processingTime="5 min" → Check for new files every 5 minutes (for near-real-time)
# - once=True → Process once and stop (one-time backfill)
Interview tip: Auto Loader is the recommended way to ingest files in Databricks. If asked "How would you ingest files?", always mention Auto Loader first.
Q10: How does Auto Loader handle schema changes?
Simple Explanation: In real life, source systems change their data format. An airline might add a new field "baggage_count" to their booking JSON. Auto Loader can handle this automatically — you choose how:
| Mode | What Happens When New Columns Appear | When to Use |
|---|---|---|
| addNewColumns (default) | New columns are added to the Delta table automatically | Dev/testing — flexible |
| rescue | New columns are stored in a special _rescued_data column as JSON | Production — safest |
| failOnNewColumns | Pipeline STOPS with an error — you must manually handle it | Strict schema control |
| none | New columns are silently ignored | When you don't care about new fields |
# RECOMMENDED FOR PRODUCTION: rescue mode
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaEvolutionMode", "rescue") \
# ↑ "rescue" = if a file has unexpected columns, put them in _rescued_data
# This way, your pipeline NEVER breaks because of schema changes
# You can inspect _rescued_data later and decide what to do
.option("cloudFiles.schemaLocation", "/checkpoints/schema/") \
.load(path)
# After loading, check _rescued_data for unexpected columns:
# df.filter("_rescued_data IS NOT NULL").show()
# If you find valid new columns, update your schema to include them
Real-world scenario: Airline X starts sending "meal_preference" in their booking JSON. Without rescue mode, your pipeline might break. With rescue mode, "meal_preference" goes into _rescued_data, pipeline continues running, and you handle it when ready.
Q11: Auto Loader vs COPY INTO — when to use which?
Simple Explanation: Both load files into Delta tables. The difference is: Auto Loader is a streaming approach (continuously watches for new files), while COPY INTO is a batch SQL command (you run it manually or on a schedule).
| Aspect | Auto Loader | COPY INTO |
|---|---|---|
| Type | Streaming (continuous) | Batch (SQL command, run manually) |
| How it tracks files | Automatic checkpoint (never reprocesses) | Tracks via file metadata |
| Schema evolution | Full support (rescue mode, etc.) | Limited |
| Scale | Handles millions of files | Handles thousands of files |
| Exactly-once | Yes (checkpoint-based) | Yes (idempotent — safe to rerun) |
| When to use | Regular/frequent ingestion | One-time loads, ad-hoc imports |
-- COPY INTO example (simple one-off load):
COPY INTO bronze.bookings
FROM 'abfss://raw@storage.dfs.core.windows.net/bookings/'
FILEFORMAT = JSON
-- This loads all files from the folder. If you run it again, it skips already-loaded files.
Rule of thumb: Always prefer Auto Loader. Use COPY INTO only for simple, one-time, ad-hoc loads.
SECTION 5: LAKEFLOW DECLARATIVE PIPELINES (formerly DLT) (1 hour)
Q12: What is Lakeflow? (formerly called DLT / Delta Live Tables)
Simple Explanation: Lakeflow Declarative Pipelines is Databricks' managed ETL framework. Instead of writing complex pipeline code (read data, clean it, write it, handle errors, manage retries...), you just DECLARE what your tables should look like, and Databricks handles the rest.
Why "declarative"? Because you declare WHAT you want, not HOW to do it:
- Regular code: "Read from this source, filter out nulls, join with this table, write to this target, handle errors..."
- Declarative: "This table should have no nulls and should come from that source." Databricks figures out the HOW.
Name history (important for interviews!):
- Before June 2025: Called "Delta Live Tables (DLT)"
- After June 2025: Renamed to "Lakeflow Declarative Pipelines"
- In Apache Spark: The core technology was contributed as "Spark Declarative Pipelines"
- Use the new name in interviews to show you're up to date!
What Lakeflow gives you (that regular code doesn't):
- Built-in data quality checks (expectations)
- Automatic lineage tracking (see which table depends on which)
- Managed infrastructure (serverless available — no cluster management)
- Automatic recovery from failures
- Easy SCD Type 2 with apply_changes (see Q6 above)
Two table types in Lakeflow:
| Type | What It Does | When to Use |
|---|---|---|
| Streaming Table | Processes ONLY new data (incremental) | Append-heavy data, CDC events, real-time |
| Materialized View | Recomputes the ENTIRE result each time | Aggregations, complex joins, Gold layer summaries |
Real-world analogy:
- Streaming Table = A conveyor belt that processes only new items
- Materialized View = A report that's regenerated from scratch each time you ask for it
Q13: What are Data Quality Expectations? (3 levels)
Simple Explanation: Expectations are data quality rules you attach to your Lakeflow tables. They check every row as it flows through your pipeline. Think of them as quality inspectors on an assembly line.
There are 3 levels depending on what should happen when bad data is found:
import dlt # Lakeflow library
@dlt.table # This decorator says "this function defines a Lakeflow table"
# Level 1: EXPECT — log the bad row but KEEP it (monitoring only)
@dlt.expect("valid_booking_id", "booking_id IS NOT NULL")
# "If booking_id is null, log a warning but still keep the row"
# Use when: You want to MONITOR data quality without blocking the pipeline
# Level 2: EXPECT_OR_DROP — silently DROP bad rows
@dlt.expect_or_drop("valid_amount", "fare_amount > 0")
# "If fare_amount is 0 or negative, drop the row (don't include it in the table)"
# Use when: Bad rows are useless garbage — just filter them out
# Level 3: EXPECT_OR_FAIL — STOP the entire pipeline
@dlt.expect_or_fail("valid_currency", "currency IN ('USD','EUR','GBP','INR')")
# "If currency is not one of these, STOP everything. Something is very wrong."
# Use when: This is a CRITICAL data integrity issue — pipeline must not continue
def silver_bookings():
return spark.readStream.table("bronze.bookings")
Summary:
| Expectation | Bad Data? | Pipeline | Real Use |
|---|---|---|---|
@dlt.expect | Kept + logged | Continues | "Track how many bad rows we get" (monitoring) |
@dlt.expect_or_drop | Dropped silently | Continues | "Remove known junk data" (filtering) |
@dlt.expect_or_fail | N/A | STOPS | "Something critical is broken — investigate!" |
New in 2026: Expectations are now stored in Unity Catalog — version-controlled, auditable, and shareable across pipelines. You can define an expectation once and reuse it in multiple pipelines.
Interview tip: "In our Amadeus pipeline, we use expect at Bronze to monitor raw data quality, expect_or_drop at Silver to filter invalid bookings, and expect_or_fail at Gold for critical business rules like 'departure date must be after booking date.'"
Q14: What is the difference between Materialized View and Streaming Table?
Simple Explanation:
Streaming Table = Processes only NEW data. Like a factory assembly line — it only works on items that just arrived. It remembers what it already processed and never redoes old work.
Materialized View = Recomputes EVERYTHING from scratch. Like a daily summary report — it reads all the data and produces a fresh result each time. Slower, but always accurate.
import dlt
# STREAMING TABLE — incremental processing (only new data)
@dlt.table
def silver_bookings():
return spark.readStream.table("LIVE.bronze_bookings_raw")
# readStream = streaming = only processes NEW rows since last run
# Use for: Continuously adding new bookings as they arrive
# MATERIALIZED VIEW — full recompute (all data, every time)
@dlt.table
def gold_daily_revenue():
return spark.sql("""
SELECT
booking_date,
departure_airport,
arrival_airport,
SUM(fare_amount) AS total_revenue, -- Sum all fares
COUNT(*) AS booking_count -- Count all bookings
FROM LIVE.silver_bookings -- Read from Silver layer
GROUP BY booking_date, departure_airport, arrival_airport
""")
# No readStream = batch = recomputes the entire aggregation each time
# Use for: Gold aggregations that need to reflect all data including updates
| Aspect | Streaming Table | Materialized View |
|---|---|---|
| Processing | Only new/unprocessed data | Everything from scratch |
| Speed | Fast (less work each run) | Slower (reads all data) |
| Source requirement | Must be a streaming source | Any table or view |
| Best for | Bronze→Silver (append-heavy, CDC) | Gold layer (aggregations, joins) |
| Handles updates? | Appends only (unless using apply_changes) | Yes (recomputes all) |
Q15: What is TRIGGERED vs CONTINUOUS pipeline mode?
Simple Explanation: When you create a Lakeflow pipeline, you choose how it runs:
| Mode | What It Does | Analogy | When to Use |
|---|---|---|---|
| TRIGGERED | Process all available data, then STOP | Like a batch job — run, finish, done | Scheduled runs (daily at 6am, hourly) |
| CONTINUOUS | Run FOREVER, process data as it arrives | Like a 24/7 factory line | Real-time use cases (live dashboards) |
Amadeus example:
- Daily revenue report → TRIGGERED (run once a day at midnight)
- Live flight status board → CONTINUOUS (update within seconds)
SECTION 6: SCENARIO-BASED PIPELINE DESIGN (1 hour)
Q16: Design a booking pipeline for Amadeus (10 billion events/day)
Simple Explanation: This is a "system design" interview question. They want to see HOW you'd build an end-to-end pipeline. Think about: sources, ingestion, layers, compute, governance, monitoring.
ARCHITECTURE OVERVIEW:
═══════════════════════════════════════════════════════════════
SOURCES (200+ airlines send data in different ways):
├── REST APIs → push booking events
├── SFTP files → daily CSV/JSON file drops
└── Kafka streams → real-time event streams
INGESTION:
├── Files from SFTP → Auto Loader (watches ADLS folder for new files)
└── Kafka streams → Kafka connector (reads from Azure Event Hubs)
BRONZE LAYER (raw data, as-is):
├── What: Store every event exactly as received — never modify
├── How: Streaming tables via Lakeflow Pipelines
├── Schema: raw JSON with extra metadata columns:
│ - _ingested_at (when we received it)
│ - _source_airline (which airline sent it)
│ - _batch_id (which batch it belongs to)
├── Clustering: BY (ingestion_date, source_airline)
├── Retention: 90 days (for debugging and replay)
└── Quality: @dlt.expect_or_drop("booking_id IS NOT NULL")
SILVER LAYER (cleaned, validated):
├── What: Deduplicated, validated, business keys applied
├── Deduplication: ROW_NUMBER by booking_id + updated_at (keep latest)
├── Quality expectations:
│ - expect: passenger_count > 0 (monitor)
│ - expect_or_drop: booking_id IS NOT NULL (filter junk)
│ - expect_or_fail: departure_date > booking_date (critical rule)
├── Dimensions: SCD Type 2 for dim_passenger (using apply_changes)
├── Facts: SCD Type 1 for fact_bookings (using MERGE — overwrite)
└── Clustering: BY (booking_id)
GOLD LAYER (business-ready):
├── daily_booking_revenue → Materialized View (aggregation by route)
├── airline_performance_metrics → Materialized View
├── passenger_360 → Wide denormalized table for ML models
└── Clustering: BY (booking_date, departure_airport)
COMPUTE (which machines run what):
├── Ingestion: Serverless streaming (auto-scales with data volume)
├── Silver MERGE: Job cluster + Photon (heavy transformations need power)
├── Gold queries: Serverless SQL Warehouse (BI tools like Power BI connect here)
└── ML: GPU cluster for fare prediction models
GOVERNANCE (who can access what):
├── Unity Catalog: travel_catalog.bookings.* (3-level namespace)
├── PII masking: passenger email and phone masked for non-authorized users
├── Row-level security: each airline partner sees only their own bookings
└── GDPR: deletion pipeline for "right to be forgotten" requests
MONITORING (is everything healthy?):
├── Lakeflow expectations dashboard (% of rows passing quality checks)
├── Row count anomaly detection (new Feb 2026 feature)
├── SLA tracking: Bronze < 5 min latency, Silver < 30 min, Gold < 1 hour
└── PagerDuty alerts for pipeline failures or expectation breaches
Interview tip: Draw this on a whiteboard. Start from sources on the left, flow to Gold on the right. Mention compute, governance, and monitoring separately — it shows you think about production systems, not just data transformations.
Q17: Scenario — Our MERGE on booking fact table takes 3 hours. How do you fix it?
Simple Explanation: This is a debugging scenario. The approach: first INVESTIGATE (why is it slow?), then FIX (apply optimizations in order of impact).
-- STEP 1: INVESTIGATE — check the table's health
DESCRIBE DETAIL fact_bookings;
-- Key things to look at:
-- numFiles: 50,000 → TOO MANY! Small file problem (should be ~500 files for 2 TB)
-- sizeInBytes: 2 TB → That's the data size. 50K files means avg file is only 40 MB (should be ~1 GB)
-- STEP 2: FIX (in order of impact — do #1 first, then check if it's fast enough)
-- Fix 1: Compact files (biggest impact if you have small file problem)
OPTIMIZE fact_bookings;
-- Combines 50,000 small files → ~500 large files
-- MERGE now creates ~500 tasks instead of 50,000 → much faster
-- Fix 2: Add partition column to MERGE ON clause (partition pruning)
MERGE INTO fact_bookings t USING staging s
ON t.booking_id = s.booking_id
AND t.booking_date = s.booking_date;
-- booking_date = partition column → MERGE only scans today's data, not all 2 TB
-- Fix 3: Z-ORDER or Liquid Clustering on the merge key
OPTIMIZE fact_bookings ZORDER BY (booking_id);
-- Groups similar booking_ids together → data skipping works better during MERGE
-- OR for new tables: ALTER TABLE fact_bookings CLUSTER BY (booking_date, booking_id);
-- Fix 4: Switch to Photon runtime (3-5x faster MERGE, no code changes)
-- Just change the cluster runtime to Photon in the job configuration
-- Fix 5: Filter source to only merge rows that actually changed
-- (See Day 1, Q7 for the full technique)
-- Fix 6: Prevent future small files
ALTER TABLE fact_bookings SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true', -- Coalesce on write
'delta.autoOptimize.autoCompact' = 'true' -- Auto-compact after writes
);
Q18: Scenario — Migrate Oracle CDC to Delta Lake for Amadeus
Simple Explanation: Amadeus currently uses old Sqoop jobs that do a FULL COPY of Oracle tables every night (slow, 12-24 hours stale). The goal: switch to real-time CDC using Debezium → Kafka → Delta Lake (< 5 minute latency).
Q19: How do you design a Data Quality Framework?
Simple Explanation: Data quality = making sure your data is correct, complete, and timely at every layer. Without quality checks, bad data silently flows through your pipeline and shows up in business reports — very dangerous.
# BRONZE LAYER: Basic checks — "did we receive the minimum required fields?"
@dlt.expect_or_drop("has_required_fields",
"booking_id IS NOT NULL AND flight_id IS NOT NULL")
# Drop rows that don't even have basic required fields — they're garbage
# SILVER LAYER: Business rule checks — "does this data make sense?"
@dlt.expect("valid_dates", "departure_date >= booking_date")
# A flight can't depart BEFORE it was booked — flag if it does (but keep the row)
@dlt.expect("valid_fare", "fare_amount > 0 AND fare_amount < 100000")
# Fare should be positive and reasonable (not $0 or $1 million)
@dlt.expect_or_fail("valid_airport",
"departure_airport IN (SELECT iata_code FROM ref_airports)")
# Airport code must exist in our reference table — if not, something is VERY wrong, STOP pipeline
# GOLD LAYER: Completeness checks
@dlt.expect("complete_data", "passenger_name IS NOT NULL")
# Gold layer should have complete data — flag if passenger name is missing
# MONITORING (NEW — February 2026 feature):
# Data Quality Monitoring Anomaly Detection
# Databricks automatically detects unexpected changes in:
# - Row counts (suddenly 50% fewer rows than usual)
# - Null rates (null % jumped from 1% to 20%)
# - Value distributions (a column that's usually 99% "USD" is now 50% "USD")
# No configuration needed — it learns normal patterns and alerts on anomalies
Interview tip: Show you think about quality at EVERY layer, not just one. "We put expect_or_drop at Bronze for garbage filtering, expect at Silver for business rules monitoring, and expect_or_fail at Gold for critical data integrity."
QUICK REVISION CHECKLIST — DAY 2
Test yourself — can you answer each in 2-3 minutes?
- What is Medallion Architecture? Explain Bronze/Silver/Gold with Amadeus examples. (Q1)
- When would you NOT follow Medallion? (Q2)
- What is SCD? Explain Types 0, 1, 2, 3 with examples. (Q4)
- Can you write SCD Type 2 with the merge_key trick? Explain how the trick works. (Q5)
- Can you write SCD Type 2 with apply_changes in Lakeflow? (Q6)
- What is CDC? Design Oracle → Kafka → Delta Lake pipeline. (Q7)
- What is CDF (Change Data Feed)? How is it different from CDC? (Q8)
- What is Auto Loader? Explain the two modes. (Q9)
- How does Auto Loader handle schema changes? What is rescue mode? (Q10)
- Auto Loader vs COPY INTO — when to use each? (Q11)
- What is Lakeflow? What was it called before? (Q12)
- What are the 3 levels of expectations? Give examples. (Q13)
- Streaming Table vs Materialized View — when to use each? (Q14)
- Can you design a 10B events/day booking pipeline? (Q16)
- How would you fix a 3-hour slow MERGE? (Q17)
- Design an Oracle → CDC → Delta Lake migration plan. (Q18)