🧱
Databricks
Day 1: Delta Lake & Lakehouse Deep Dive (Azure Databricks)
🧱
🧱
Databricks · Section 9 of 17

Day 1: Delta Lake & Lakehouse Deep Dive (Azure Databricks)

Day 1: Delta Lake & Lakehouse Deep Dive (Azure Databricks)

💡 Interview Tip
Time: 6-7 hours | Priority: HIGHEST — Delta Lake is 30-40% of any Databricks interview Amadeus Context: Travel booking tables with billions of rows, fare pricing history, passenger PII Approach: Every topic starts with simple explanation → then interview-level depth

SECTION 1: DELTA LAKE INTERNALS (1.5 hours)

Q1: What is Delta Lake? And what is the transaction log?

Simple Explanation: Think of a normal data lake — you store files (like Parquet) in cloud storage (Azure ADLS). But there's a big problem: if two people write to the same folder at the same time, data can get corrupted. There's no "undo" button. There's no way to know what changed.

Delta Lake solves this. It adds a "smart layer" on top of your Parquet files. This smart layer is called the transaction log (stored in a folder called _delta_log/). It's like a diary that records every change — "file X was added", "file Y was removed", "schema changed", etc.

Real-world analogy: Imagine a hotel booking register. Every time a booking is made or cancelled, the receptionist writes it in a numbered logbook (commit 1, commit 2, commit 3...). If someone asks "what did our bookings look like yesterday?", you can replay the logbook up to yesterday. That logbook = Delta transaction log.

Why do we need it?

  • Without Delta: Two booking agents update same file → data gets corrupted
  • With Delta: Transaction log ensures only one change goes through at a time (like a queue)

Technical details:

🗂️bookings_table/ -- Your table folder on ADLS Gen2
_delta_log/ -- THE TRANSACTION LOG (the "diary")
00000000000000000000.json -- Commit 0: table was created
00000000000000000001.json -- Commit 1: 1000 bookings inserted
00000000000000000002.json -- Commit 2: 50 bookings updated
00000000000000000010.checkpoint.parquet -- Checkpoint (summary of first 10 commits)
_last_checkpoint -- Points to the latest checkpoint file
part-00000-abc123.snappy.parquet -- Actual data file 1
part-00001-def456.snappy.parquet -- Actual data file 2

What's inside each JSON commit file?

  • add → "I added this new Parquet file" (new data was written)
  • remove → "I logically deleted this file" (but file is still physically there until VACUUM cleans it)
  • metaData → "The table schema changed" or "table properties changed"
  • commitInfo → "Who did this, when, what operation (INSERT/UPDATE/DELETE)"

What is ACID? (You know this from databases, same concept here):

PropertyWhat It MeansHow Delta Does It
AtomicityEither ALL changes apply, or NONE apply. No half-done writes.Each commit is a single JSON file — it either fully writes or doesn't
ConsistencyData always follows the rules (schema). You can't insert wrong data types.Schema enforcement rejects mismatched columns/types
IsolationReaders don't see half-written data. Each reader sees a clean snapshot.Snapshot isolation — when you start a query, you see the table as it was at that moment
DurabilityOnce data is committed, it won't be lost (even if server crashes).Data is stored as Parquet files on ADLS Gen2 (cloud storage = durable)

Amadeus example: "When 10 booking agents update the same passenger table simultaneously, Delta's transaction log ensures no partial writes corrupt the table. Each agent's changes are atomic — either fully applied or not at all."

Interview tip: They won't ask "What is ACID?". They'll ask "How does Delta Lake ensure data consistency when multiple pipelines write to the same table?" — answer with the transaction log + optimistic concurrency.

Q2: What are checkpoint files? Why are they important?

Simple Explanation: Imagine your transaction log has 10,000 commits (10,000 JSON files). To read the current table state, you'd need to read ALL 10,000 files — very slow!

A checkpoint is a summary file. Every 10 commits (by default), Delta creates a single Parquet file that says "here's the complete state of the table right now." So instead of reading 10,000 files, you read 1 checkpoint + the few commits after it.

Real-world analogy: Like a bank account statement. Instead of adding up every transaction since the account was opened, the monthly statement gives you the current balance. You only need to add transactions after the statement date.

Key points:

  • Created every 10 commits (configurable via delta.checkpointInterval)
  • Format: Parquet (not JSON) — faster to read
  • _last_checkpoint file → tells Delta which checkpoint is the latest
  • Without checkpoints: reading table = replaying ALL commits (slow!)
  • With checkpoints: reading table = read latest checkpoint + only a few recent JSONs (fast!)

Q3: What is Optimistic Concurrency Control?

Simple Explanation: When two people try to update the same table at the same time, Delta uses "optimistic concurrency" to handle it. The word "optimistic" means: Delta ASSUMES there won't be a conflict, lets both work, and only checks for conflicts at commit time.

Real-world analogy: Two travel agents are updating different passenger records in the same table. Agent A updates passenger 1's email. Agent B updates passenger 2's address. Both started with version 5 of the table. Agent A finishes first and creates version 6. When Agent B tries to commit, Delta checks: "Did Agent A touch the same data as Agent B?" If no → Agent B's changes go in as version 7. If yes → conflict error.

🗂️Writer A reads version 5 (fare update batch)
Writer B reads version 5 (booking cancellation batch)
Writer A commits version 6 → succeeds (it was first)
Writer B tries to commit version 6:
Sees: "Wait, version 6 already exists!"
Reads version 6 to see what Writer A changed
Checks for LOGICAL conflict:
Different files/rows touched → No conflict → Auto-retries as version 7 ✓
Same files touched → CONFLICT → throws ConcurrentModificationException ✗
Automatic retry happens up to 3 times (configurable)

Two isolation levels:

LevelDefault?When to Use
WriteSerializableYesNormal workloads — two pipelines writing to different parts of the table can run in parallel
SerializableNoStrict audit/compliance tables — even reads during writes can cause conflicts

Interview tip: If asked "How do you handle concurrent writes?", mention: optimistic concurrency + partition your writes by date/region so different pipelines touch different files → no conflicts.

Q4: What is data skipping? How do file-level statistics work?

Simple Explanation: When you query a Delta table, you don't want to read ALL the data files. That's wasteful. Delta stores min and max values for each column in each data file (inside the transaction log). When you run a query with a WHERE clause, Delta checks: "Can this file possibly contain matching rows?" If the answer is NO, it SKIPS the file entirely.

Real-world analogy: You have 100 file folders, each labeled with date ranges (Jan 1-15, Jan 16-31, Feb 1-15, ...). If someone asks for bookings on March 15, you skip ALL folders that don't include March 15. You only open the relevant folder.

sql
-- Query: Find bookings on a specific date
SELECT * FROM bookings WHERE booking_date = '2026-03-15'

-- What Delta does behind the scenes:
-- File A: min(booking_date) = 2026-01-01, max = 2026-01-31
--         → SKIP! March 15 can't possibly be in this file
-- File B: min(booking_date) = 2026-03-01, max = 2026-03-31
--         → READ! March 15 might be in this file
-- File C: min(booking_date) = 2026-06-01, max = 2026-06-30
--         → SKIP! March 15 can't be here either

Key details:

  • Stats are stored for the first 32 columns by default
  • Config: delta.dataSkippingNumIndexedCols (default 32)
  • Works best when data is sorted/clustered (that's why OPTIMIZE + Z-ORDER helps!)
  • For a 10 TB booking table, data skipping can reduce scan from 10 TB to just 50 GB

Why this matters for Amadeus: With billions of booking records, scanning the entire table for one date would take ages. Data skipping makes queries fast by reading only the relevant files.

SECTION 2: MERGE INTO — ALL SCENARIOS (1.5 hours)

Q5: What is MERGE? Basic syntax (upsert)

Simple Explanation: MERGE is the most important operation in Databricks. It combines INSERT + UPDATE + DELETE into a single command. In simple words: "Look at my new data (source). Compare it with existing data (target). If a record already exists → update it. If it's new → insert it. If it's cancelled → delete it."

This is called an upsert (update + insert).

Why do we need it? Without MERGE, you'd need to write 3 separate queries (one for insert, one for update, one for delete) — and they wouldn't be atomic. MERGE does everything in one atomic operation.

Real-world analogy: Amadeus receives a daily file of booking changes. Some are new bookings (INSERT), some are updates to existing bookings (UPDATE), some are cancellations (DELETE). MERGE handles all three in one go.

sql
-- MERGE = Compare source (new data) with target (existing table), then act
MERGE INTO bookings_fact AS t            -- t = target (our existing bookings table)
USING bookings_staging AS s              -- s = source (new booking data that just arrived)
ON t.booking_id = s.booking_id           -- Match condition: how to find matching records

-- Case 1: Booking exists AND is cancelled → delete it
WHEN MATCHED AND s.status = 'CANCELLED' THEN
    DELETE

-- Case 2: Booking exists AND has been updated → update the record
WHEN MATCHED AND s.updated_at > t.updated_at THEN
    UPDATE SET
        t.status = s.status,             -- Update status (e.g., CONFIRMED → CHECKED_IN)
        t.fare_amount = s.fare_amount,   -- Maybe fare was recalculated
        t.passenger_count = s.passenger_count,
        t.updated_at = s.updated_at      -- Track when this change happened

-- Case 3: Booking doesn't exist in target → it's a brand new booking → insert
WHEN NOT MATCHED THEN
    INSERT (booking_id, flight_id, passenger_id, status, fare_amount,
            passenger_count, created_at, updated_at)
    VALUES (s.booking_id, s.flight_id, s.passenger_id, s.status, s.fare_amount,
            s.passenger_count, s.created_at, s.updated_at)

-- Case 4: Record exists in target but NOT in source → orphaned data → clean up
-- (Databricks extension — not available in standard SQL)
WHEN NOT MATCHED BY SOURCE AND t.status = 'PENDING' THEN
    DELETE

Interview tip: Be ready to write MERGE from memory. This is the #1 coding question in Databricks interviews. Practice it 3-4 times.

Q6: What happens when source has duplicate keys? How to fix?

Simple Explanation: MERGE requires that each target row matches at most ONE source row. If your source data has duplicate booking_ids (e.g., two records for booking ABC123), MERGE doesn't know which one to use → it throws an error.

The fix: Deduplicate the source data BEFORE merging. Keep only the latest record per key.

sql
-- Problem: bookings_staging has 2 rows for booking_id = 'ABC123'
-- Solution: Use ROW_NUMBER to keep only the latest one

WITH deduped AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY booking_id        -- Group by booking_id
            ORDER BY updated_at DESC       -- Latest record first
        ) AS rn                            -- rn=1 means the latest record
    FROM bookings_staging
)
SELECT * FROM deduped WHERE rn = 1         -- Keep only the latest record per booking

Why duplicates happen in real life:

  • Source system sent the same event twice (retry)
  • Multiple Kafka partitions delivered the same record
  • File was reprocessed accidentally

Q7: How to make MERGE faster? (CRITICAL — commonly asked)

Simple Explanation: The biggest problem with MERGE is: it has to scan the ENTIRE target table to find matching records. If your bookings table has 2 billion rows, MERGE reads all 2 billion rows just to match a few thousand new records. This is VERY slow.

Why is it slow? MERGE works like this: for every row in the source, scan the entire target to find a match. More target data = slower MERGE.

6 ways to make it faster:

1. Partition pruning — tell MERGE which partition to look in:

sql
-- WITHOUT partition column: MERGE scans ALL data across ALL dates
MERGE INTO bookings t USING staging s
ON t.booking_id = s.booking_id              -- Scans 2 billion rows!

-- WITH partition column: MERGE only scans matching date partitions
MERGE INTO bookings t USING staging s
ON t.booking_id = s.booking_id
   AND t.booking_date = s.booking_date      -- Only scans today's partition!
-- This is like telling MERGE: "only look in March 2026 folder, not all folders"

2. Z-ORDER on merge key — organize data so matching records are close together:

sql
-- Z-ORDER sorts/groups data by booking_id within files
-- This makes data skipping work better during MERGE
OPTIMIZE bookings ZORDER BY (booking_id);
-- Now when MERGE looks for booking ABC123, it can skip most files
-- (See Q10 below for full explanation of Z-ORDER)

3. Don't merge rows that haven't changed — filter source first:

sql
-- Problem: Source has 1 million rows, but only 10,000 actually changed
-- Without filter: MERGE processes all 1 million rows (wasteful)
-- With filter: MERGE only processes 10,000 changed rows (fast!)

MERGE INTO bookings t
USING (
    SELECT s.* FROM staging s
    LEFT JOIN bookings t ON s.booking_id = t.booking_id
    WHERE t.booking_id IS NULL                  -- Brand new bookings (not in target)
       OR s.hash_value != t.hash_value          -- Changed bookings (different data)
    -- Unchanged bookings are filtered out — saves time!
) AS s
ON t.booking_id = s.booking_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

4. Broadcast small source — if source is small, send it to all workers:

python — editable
100 MB. Don't use when source is large."># What is broadcast? When source is small (e.g., 10K rows),
# Spark sends a copy to every worker machine.
# This avoids expensive "shuffle" (moving data between machines).
from pyspark.sql.functions import broadcast
target.alias("t").merge(
    broadcast(source_df).alias("s"),       # Send source to all workers
    "t.booking_id = s.booking_id"          # Workers can match locally
)
# Use when: source < 100 MB. Don't use when source is large.

5. Compact target first — fewer files = fewer tasks = faster:

sql
-- If target has 50,000 small files, MERGE creates 50,000 tasks (slow!)
-- OPTIMIZE combines small files into ~1 GB files (e.g., 500 files)
OPTIMIZE bookings;
-- Now MERGE only creates ~500 tasks (much faster!)
-- See Q9 below for full explanation of OPTIMIZE

6. Use Photon runtime — C++ engine that's 3-5x faster for MERGE:

  • Just select "Photon" runtime when creating your cluster
  • No code changes needed — same SQL, just runs faster
  • See Day 3 for full Photon explanation

Q8: What is schema evolution with MERGE?

Simple Explanation: Sometimes your source data has NEW columns that don't exist in the target table yet. For example, the airline adds a "loyalty_tier" field to passenger data. Normally, MERGE would fail because the target table doesn't have this column.

With schema evolution, Delta automatically adds the new column to the target table during MERGE.

sql
-- Step 1: Enable auto schema merge (tell Delta: "it's okay if source has new columns")
SET spark.databricks.delta.schema.autoMerge.enabled = true;

-- Step 2: MERGE as normal — new columns from source get added to target automatically
MERGE INTO passengers t                -- Target: has columns (id, name, email)
USING staging s                        -- Source: has columns (id, name, email, loyalty_tier) ← NEW!
ON t.passenger_id = s.passenger_id
WHEN MATCHED THEN UPDATE SET *         -- * means "all columns" — includes loyalty_tier
WHEN NOT MATCHED THEN INSERT *         -- New column "loyalty_tier" is auto-added to target table
-- After this: target table now has (id, name, email, loyalty_tier)

When to use: When source systems add new fields over time (very common in real life). When NOT to use: When you want strict schema control (e.g., regulatory tables where schema changes need approval).

SECTION 3: OPTIMIZE, VACUUM, Z-ORDER, LIQUID CLUSTERING (1 hour)

Q9: What is OPTIMIZE? What is VACUUM? What's the difference?

Simple Explanation:

OPTIMIZE = File compaction. Over time, your table accumulates many small files (especially with streaming or frequent small writes). Small files are bad for performance because each file means a separate read operation. OPTIMIZE combines many small files into fewer large files (~1 GB each).

Real-world analogy: You have 10,000 Post-it notes scattered on your desk. OPTIMIZE = combining them into 10 neat notebooks. Much easier to find things now!

VACUUM = Garbage collection. When you UPDATE or DELETE data in Delta, the old files are NOT deleted immediately (they're kept for time travel). Over time, these old unused files pile up and waste storage. VACUUM physically deletes old files that are no longer needed.

Real-world analogy: VACUUM = throwing away old drafts that you no longer need. Once thrown away, you can't go back to them.

AspectOPTIMIZEVACUUM
What it doesCombines small files → fewer large filesDeletes old unused files from storage
Why you need itSmall files → slow readsOld files → wasted storage cost
Is it destructive?No — old files still exist afterYES — old files are permanently deleted
Affects time travel?NoYes — you can't time travel to versions whose files were vacuumed
Default retentionN/A7 days (files older than 7 days get deleted)
How often to runDaily or after large writesWeekly or after OPTIMIZE
sql
-- OPTIMIZE: Combine small files into ~1 GB files
OPTIMIZE flight_schedules;
-- Example: 10,000 small files (1 MB each) → 10 large files (1 GB each)
-- Result: Queries that used to take 5 minutes now take 30 seconds

-- OPTIMIZE + Z-ORDER: Combine files AND sort data by a column
-- (See Q10 for what Z-ORDER means)
OPTIMIZE flight_schedules ZORDER BY (departure_airport, flight_date);

-- VACUUM: Delete old files that are older than 7 days (default)
VACUUM flight_schedules;
-- This frees up storage space on ADLS Gen2

-- You can specify custom retention period:
VACUUM flight_schedules RETAIN 168 HOURS;  -- 168 hours = 7 days

-- DANGEROUS — never do this in production:
-- VACUUM flight_schedules RETAIN 0 HOURS;
-- This deletes ALL old files immediately, breaking ALL time travel!

Interview tip: Always mention OPTIMIZE and VACUUM together — they're a pair. OPTIMIZE creates new files, VACUUM cleans up the old ones.

Q10: What is Z-Ordering? What is Partitioning? What is Liquid Clustering?

These are 3 different ways to organize your data for faster queries. Let's understand each one:

PARTITIONING — The oldest and simplest approach

Simple Explanation: Partitioning creates separate folders for each value of a column. If you partition by booking_date, each date gets its own folder. When you query WHERE booking_date = '2026-03-15', Spark only reads the March 15 folder.

Real-world analogy: Filing cabinet with one drawer per month. Need January data? Open only the January drawer.

Problem: Only works for low-cardinality columns (few unique values). If you partition by passenger_id (millions of unique values), you get millions of tiny folders = disaster (called "over-partitioning").

sql
-- Good: booking_date has ~365 values per year → manageable
CREATE TABLE bookings (...) PARTITIONED BY (booking_date);

-- BAD: passenger_id has millions of values → millions of tiny folders!
-- CREATE TABLE bookings (...) PARTITIONED BY (passenger_id);  -- DON'T DO THIS

Z-ORDERING — Sort data WITHIN files for better data skipping

Simple Explanation: Z-ORDER doesn't create separate folders. Instead, it sorts and groups related data WITHIN the Parquet files so that similar values are close together. This makes data skipping (Q4) much more effective.

Real-world analogy: Imagine a library. Partitioning = separate rooms per genre. Z-ORDER = within each room, books are sorted by author name. If you want "books by author X", you go to a specific shelf, not search the entire room.

Why "Z-ORDER"? It uses a mathematical technique called "Z-curve" (space-filling curve) to sort data on multiple columns simultaneously. You don't need to understand the math — just know it groups similar values together.

sql
-- Z-ORDER is always used WITH OPTIMIZE (not standalone)
OPTIMIZE bookings ZORDER BY (passenger_id);
-- This reorganizes all files so that passenger_id values are grouped together
-- Now queries like WHERE passenger_id = 'PAX-123' can skip most files

-- You can Z-ORDER on up to 4 columns (more than 4 = less effective)
OPTIMIZE bookings ZORDER BY (departure_airport, booking_date);

Limitations of Z-ORDER:

  • Must run manually (OPTIMIZE ZORDER BY ...)
  • Rewrites ALL files every time (slow for large tables)
  • Can't change Z-ORDER columns easily

LIQUID CLUSTERING — The NEW and BEST approach (2024+)

Simple Explanation: Liquid Clustering is Databricks' latest solution that replaces BOTH partitioning and Z-ordering. It automatically organizes data as you write, only processes new/changed data (incremental), and you can change clustering columns anytime without rewriting the whole table.

Think of it as "smart auto-organizing" — Delta figures out the best way to arrange your data based on the columns you specify.

sql
-- Create table with Liquid Clustering (replaces PARTITIONED BY + ZORDER)
CREATE TABLE bookings (
    booking_id LONG,
    passenger_id LONG,
    booking_date DATE,
    fare_amount DECIMAL(10,2)
) CLUSTER BY (booking_date, passenger_id);  -- ← Use CLUSTER BY instead of PARTITIONED BY
-- Delta will automatically organize data by these columns

-- Change clustering columns anytime — NO full rewrite needed!
ALTER TABLE bookings CLUSTER BY (departure_airport, booking_date);
-- Only NEW writes use the new clustering. Old data gets reorganized gradually.

-- Trigger optimization (only processes new/changed data)
OPTIMIZE bookings;
-- Unlike Z-ORDER, this is INCREMENTAL — fast even on huge tables

Comparison summary:

AspectPartitioningZ-OrderingLiquid Clustering
What it doesSeparate folders per valueSorts data within filesAuto-organizes data
Good forLow cardinality (date, country)High cardinality (user_id)Any cardinality
Applied whenOn writeManual OPTIMIZE commandAutomatically on writes
IncrementalN/ANo (rewrites all files)Yes (only new/changed)
Change columnsRequires full rewriteMust re-OPTIMIZE everythingJust ALTER TABLE
Replaces others?NoYES — replaces both

Recommendation for Amadeus:

  • New tables: Always use Liquid Clustering
  • Existing partitioned tables: Migrate to Liquid Clustering when possible
  • Z-ORDER tips: If you're on older tables, max 4 columns, choose columns used in WHERE/JOIN/MERGE

Interview tip: If they ask "How would you organize a new bookings table?", answer: "I'd use Liquid Clustering with CLUSTER BY (booking_date, departure_airport) because it's incremental, automatic, and I can change the keys later without rewriting data."

Q11: What are Deletion Vectors?

Simple Explanation: Normally, when you DELETE or UPDATE even a single row in a Parquet file, Delta has to rewrite the ENTIRE file. If the file is 1 GB, Delta writes a new 1 GB file just to remove one row. This is very expensive.

Deletion Vectors solve this by creating a tiny separate file that says "row #47 in file X is deleted." The original file stays untouched. When reading, Delta checks the deletion vector and skips that row. The actual file rewrite happens later during OPTIMIZE (not immediately).

Real-world analogy: Instead of reprinting an entire 500-page book because of one typo, you just stick a Post-it note on the page saying "ignore this line." The actual reprint happens later when convenient.

  • Without Deletion Vectors: DELETE 1 row from 1 GB file → rewrite entire 1 GB file (slow)
  • With Deletion Vectors: DELETE 1 row → write tiny marker file (~bytes) (fast!)
  • Cleanup: OPTIMIZE will do the actual file rewrite later
sql
-- Enable deletion vectors on a table
ALTER TABLE bookings SET TBLPROPERTIES (
    'delta.enableDeletionVectors' = 'true'    -- Tells Delta: use DV for this table
);
-- After this, DELETEs and UPDATEs become much faster
-- Reads are slightly slower (must check deletion vectors) — but usually worth it

Trade-off: Writes get much faster, reads get slightly slower. Run OPTIMIZE periodically to clean up.

New in Delta 4.1 (2025): You can enable deletion vectors without blocking concurrent writes (conflict-free enablement).

Q12: What are the important Delta table properties?

Simple Explanation: Delta tables have settings (called "table properties") that control behavior — like auto-compaction, change tracking, retention periods, etc. These are set using ALTER TABLE ... SET TBLPROPERTIES.

sql
ALTER TABLE bookings SET TBLPROPERTIES (

    -- AUTO OPTIMIZATION: Automatically fix small files problem
    'delta.autoOptimize.optimizeWrite' = 'true',
    -- What: When writing data, Delta automatically combines small output files
    -- Why: Prevents the small file problem without manually running OPTIMIZE

    'delta.autoOptimize.autoCompact' = 'true',
    -- What: After each write, Delta automatically runs a mini-OPTIMIZE
    -- Why: Keeps files at a healthy size over time

    -- CHANGE DATA FEED (CDF): Track what changed row-by-row
    'delta.enableChangeDataFeed' = 'true',
    -- What: Records every INSERT/UPDATE/DELETE at the row level
    -- Why: Downstream tables can read only the changes (not the full table)
    -- Example: Gold layer reads only changed Silver rows → faster pipeline

    -- RETENTION: How long to keep old data for time travel
    'delta.logRetentionDuration' = 'interval 30 days',
    -- What: Keep commit logs for 30 days (for DESCRIBE HISTORY)

    'delta.deletedFileRetentionDuration' = 'interval 7 days',
    -- What: VACUUM won't delete files newer than 7 days
    -- Why: Protects running queries and time travel for 7 days

    -- COLUMN MAPPING: Enable column rename and drop
    'delta.columnMapping.mode' = 'name',
    -- What: Maps columns by name instead of position
    -- Why: Allows ALTER TABLE RENAME COLUMN and DROP COLUMN
    -- Without this: you can't rename or drop columns in Delta

    -- DELETION VECTORS: Faster deletes/updates
    'delta.enableDeletionVectors' = 'true'
    -- What: Mark rows as deleted without rewriting files (see Q11)
);

Interview tip: Know the top 3: autoOptimize, enableChangeDataFeed, and columnMapping.mode. These are the most commonly discussed in interviews.

SECTION 4: TIME TRAVEL & RECOVERY (30 min)

Q13: What is Time Travel? How to query old versions of a table?

Simple Explanation: Because Delta keeps a log of every change (the transaction log), you can "go back in time" and see what the data looked like at any previous point. This is called Time Travel.

Why is it useful?

  • Debugging: "The report showed wrong numbers yesterday — let me check yesterday's data"
  • Recovery: "Someone accidentally deleted 1000 bookings — let me restore them"
  • Auditing: "What did the passenger table look like before the migration?"
sql
-- METHOD 1: Query by version number
-- (Every commit gets a version: 0, 1, 2, 3, ...)
SELECT * FROM bookings VERSION AS OF 5;     -- See table as it was at version 5
SELECT * FROM bookings@v5;                   -- Shorthand for the same thing

-- METHOD 2: Query by timestamp
-- (Go back to a specific date/time)
SELECT * FROM bookings TIMESTAMP AS OF '2026-03-15 10:30:00';

-- See full history of all changes
DESCRIBE HISTORY bookings;
-- Shows: version, timestamp, operation (INSERT/DELETE/MERGE), user, metrics

-- RESTORE: Roll back the entire table to a previous version
RESTORE TABLE bookings TO VERSION AS OF 5;
-- WARNING: This creates a NEW version (not destructive) — you can undo the restore too!

Limits to remember:

  • Default data retention: 7 days — can't time travel beyond this (VACUUM deletes old files)
  • Default log retention: 30 days — DESCRIBE HISTORY works for 30 days
  • VACUUM breaks time travel for vacuumed versions

Q14: Scenario — Someone accidentally deleted critical passenger data 3 days ago. How to recover?

Simple Explanation: This is a very common interview scenario. The answer uses time travel to see the data before the delete, then restore it.

sql
-- Step 1: Find the version BEFORE the accidental delete
DESCRIBE HISTORY dim_passenger;
-- Look at the output — find the DELETE operation
-- Let's say the DELETE was at version 42
-- So we want version 41 (the version just before the delete)

-- Step 2 Option A: FULL RESTORE — simplest, rolls back entire table
RESTORE TABLE dim_passenger TO VERSION AS OF 41;
-- This makes the table look exactly like version 41
-- But it's a new version (43), so nothing is lost — you can undo this too

-- Step 2 Option B: SELECTIVE RESTORE — only bring back deleted rows
-- Use MERGE to insert only the rows that are missing (were deleted)
MERGE INTO dim_passenger AS target                        -- Current table (missing rows)
USING dim_passenger VERSION AS OF 41 AS source            -- Old version (has all rows)
ON target.passenger_id = source.passenger_id              -- Match by business key
WHEN NOT MATCHED THEN INSERT *;
-- This only inserts rows that exist in old version but NOT in current table
-- i.e., exactly the rows that were deleted

Interview tip: Option B is the better answer — it shows you understand MERGE + time travel together, and it's more surgical (doesn't overwrite any changes that happened after the delete).

SECTION 5: LAKEHOUSE ARCHITECTURE (30 min)

Q15: What is a Data Lakehouse? How is it different from Data Lake and Data Warehouse?

Simple Explanation:

Data Lake = Cheap cloud storage (like ADLS Gen2 or S3) where you dump all your raw data in any format (JSON, CSV, Parquet). It's cheap and flexible, but messy — no ACID transactions, no schema enforcement, slow for BI queries.

Data Warehouse = Expensive, structured database (like Azure Synapse, Snowflake) optimized for BI queries. Fast and well-governed, but expensive and bad for ML/unstructured data.

Data Lakehouse = The BEST of both. It takes the cheap storage of a data lake, adds Delta Lake for ACID transactions and schema enforcement, adds Photon for fast BI queries, and adds Unity Catalog for governance. You get data lake flexibility + data warehouse reliability at data lake prices.

Real-world analogy:

  • Data Lake = Big messy warehouse (cheap rent, hard to find things)
  • Data Warehouse = Expensive organized office (everything in place, but high rent)
  • Lakehouse = Organized warehouse (cheap rent + everything labeled and easy to find)
AspectData LakeData WarehouseLakehouse
Storage costCheap (ADLS/S3)Expensive (proprietary)Cheap (ADLS/S3)
File formatOpen (Parquet, JSON)Proprietary (locked in)Open (Delta, Iceberg)
ACID transactionsNo (data can get corrupted)YesYes (Delta Lake)
SchemaSchema-on-read (messy)Schema-on-write (strict)Both (flexible)
BI query speedSlowFastFast (Photon engine)
ML supportGoodPoorExcellent
GovernanceLimitedStrongStrong (Unity Catalog)

Q16: What technologies make the Lakehouse possible?

These are the key building blocks — know what each one does:

  1. Delta Lake → Adds ACID transactions to cloud storage (the foundation of lakehouse)
  2. Photon Engine → C++ query engine that makes queries as fast as a data warehouse (see Day 3)
  3. Unity Catalog → Centralized governance — who can access what data (see Day 3)
  4. Serverless SQL Warehouses → BI tools (Power BI, Tableau) connect directly to Databricks
  5. MLflow → Manage the ML lifecycle (track experiments, deploy models)
  6. Lakeflow Declarative Pipelines → Build ETL pipelines with built-in data quality checks (see Day 2)

Interview tip: When asked "Why Databricks over Snowflake?", mention: open formats (no vendor lock-in), unified BI + ML on one platform, Delta Lake is open source, and Photon gives warehouse-level speed on open data.

SECTION 6: NEW 2025-2026 FEATURES (30 min)

Q17: What's new in Delta Lake 4.x? (Mention 2-3 in interview to show you're up to date)

FeatureVersionSimple Explanation
Variant Data Type4.0Store messy JSON data without defining a schema first. Useful when source sends unpredictable JSON structures.
Type Widening4.0Change a column type (e.g., INT → BIGINT) without rewriting all data files. Before this, you had to recreate the table!
Coordinated Commits4.0Multiple writers from different systems can write to the same table safely. Useful for multi-cloud setups.
Delta Connect4.0Do Delta operations (MERGE, etc.) remotely over Spark Connect — no need to run on the same cluster.
Conflict-Free Deletion Vectors4.1Enable deletion vectors on a table without blocking other writers. Before, enabling DV required exclusive access.
Server-Side Planning4.1Query planning done by the catalog server instead of the client — faster startup for large tables.
Atomic CTAS4.1CREATE TABLE AS SELECT is now fully atomic — if it fails midway, no partial table is left behind.

Breaking change: Delta 4.x needs Spark 4.x and Java 17+ (older versions won't work).

Interview tip: Mention Variant Data Type and Type Widening — they solve real problems that interviewers care about.

Q18: What is Predictive Optimization?

Simple Explanation: Remember how we said you need to manually run OPTIMIZE, VACUUM, and ANALYZE TABLE to keep your tables healthy? Predictive Optimization does this AUTOMATICALLY. Databricks watches how your tables are used and runs these commands at the right time, without you scheduling anything.

Real-world analogy: Like a self-cleaning oven. Instead of manually scheduling "clean the oven every Sunday," the oven detects when it's dirty and cleans itself.

Key points:

  • Automatically runs OPTIMIZE (file compaction)
  • Automatically runs VACUUM (cleanup old files)
  • Automatically runs ANALYZE TABLE (refresh statistics)
  • Enabled by default on all new Unity Catalog managed tables (since 2025)
  • Learns your table's access patterns to optimize scheduling
  • No configuration needed — just use managed tables!

Amadeus answer: "For our 500+ Delta tables, Predictive Optimization eliminates the need for manual OPTIMIZE/VACUUM scheduling — the platform learns each table's access patterns and optimizes automatically. This saves our team hours of maintenance work."

Q19: What is Lakebase?

Simple Explanation: Lakebase is a brand NEW feature (GA on Azure March 2026). It's a serverless PostgreSQL-compatible database built into Databricks.

Why does it exist? Delta Lake is great for analytics (batch queries, BI), but NOT great for low-latency application queries (like "get this passenger's details in 10 ms for the mobile app"). Lakebase fills this gap — it's a real database for application use cases, running inside Databricks.

Key features:

  • Scale-to-zero: When nobody is querying, it costs $0 (shuts down automatically)
  • Database branching: Create an instant copy of your database for testing (like git branch for databases!)
  • Instant restore: Go back to any point in time if something goes wrong
  • Auto-failover HA: If one server fails, another takes over automatically

When to use Lakebase vs Delta tables:

Use CaseLakebaseDelta Table
App backend (API serving, low-latency lookups)✅ Best choice❌ Too slow
Batch analytics (BI, reporting)❌ Not designed for this✅ Best choice
Feature serving for ML models✅ Good (low-latency)✅ Good (batch)
Application backend (CRUD operations)✅ Best choice❌ Not designed for this

Q20: What is the difference between Managed and External tables?

Simple Explanation: When you create a table in Unity Catalog, you choose where the data is stored:

  • Managed Table: Databricks decides where to store the data (in a Databricks-managed location). If you DROP the table, BOTH the metadata AND the data are deleted. Simple and recommended for most cases.

  • External Table: YOU specify where the data lives (e.g., a specific ADLS Gen2 path). If you DROP the table, only the metadata is removed — the actual data files survive. Use this when data must persist even if the table definition is removed, or when data is shared with other systems.

Real-world analogy:

  • Managed = Renting a furnished apartment. If you end the lease, furniture goes too.
  • External = Renting an empty apartment and bringing your own furniture. If you end the lease, you take your furniture with you.
AspectManaged TableExternal Table
Where data livesDatabricks-managed location (auto)Your ADLS Gen2 path (you specify)
DROP TABLEDeletes metadata AND dataDeletes metadata ONLY — data survives
Predictive Optimization✅ Works automatically❌ Not supported
Best forMost tables (default choice)Data shared with other systems, legacy data
GovernanceFull Unity Catalog governanceNeeds External Location + Storage Credential setup
sql
-- Managed table (recommended for new tables — simpler to manage)
CREATE TABLE travel_catalog.bookings.flights (
    flight_id LONG,                    -- Unique flight identifier
    departure STRING,                  -- Departure airport code (e.g., BLR)
    arrival STRING                     -- Arrival airport code (e.g., DEL)
);
-- Databricks stores data in its managed location automatically
-- DROP TABLE will delete everything

-- External table (for data that must survive table drops)
CREATE TABLE travel_catalog.bookings.legacy_flights (
    flight_id LONG,
    departure STRING,
    arrival STRING
) LOCATION 'abfss://container@storage.dfs.core.windows.net/legacy/flights/';
-- Data lives at YOUR ADLS Gen2 path
-- DROP TABLE only removes the table definition, data files stay in ADLS

Amadeus use case: "We use managed tables for new Delta tables (Predictive Optimization works automatically). We use external tables for legacy data migrated from Oracle that other systems also read."

QUICK REVISION CHECKLIST — DAY 1

Test yourself — can you answer each in 2-3 minutes?

  • What is Delta Lake? What problem does it solve? (Q1)
  • What is the transaction log and how does it ensure ACID? (Q1)
  • What are checkpoint files and why are they needed? (Q2)
  • How does optimistic concurrency control work? (Q3)
  • What is data skipping and how do file-level statistics help? (Q4)
  • Can you write a MERGE with all 4 clauses (matched, not-matched, delete, not-matched-by-source)? (Q5)
  • How do you handle duplicate keys in source during MERGE? (Q6)
  • List 6 ways to optimize a slow MERGE. (Q7)
  • What is OPTIMIZE? What is VACUUM? What's the difference? (Q9)
  • What is Z-ORDER? What is Partitioning? What is Liquid Clustering? When to use each? (Q10)
  • What are Deletion Vectors and why are they useful? (Q11)
  • How does Time Travel work? How to recover deleted data? (Q13, Q14)
  • What is a Lakehouse? How is it different from Data Lake and Data Warehouse? (Q15)
  • What are 2-3 new features in Delta Lake 4.x? (Q17)
  • What is Predictive Optimization? (Q18)
  • What is the difference between Managed and External tables? (Q20)