🔺

Delta Lake

Delta Lake — Complete Interview Guide

🔺

Delta Lake · Section 1 of 1

Delta Lake — Complete Interview Guide

⚠️ Common Trap

Priority: HIGHEST — Delta Lake is 30-40% of any Databricks interview Goal: After this page, you should NEVER struggle with a Delta Lake question again Approach: Every topic starts with WHY → WHAT → HOW → Interview trap to avoid

Memory Map

🧠 DELTA LAKE MASTERY → TACOVS

DELTA LAKE MASTERYTACOVS

──────────────────────────────

TTransaction Log (the brain of Delta)

AACID + Architecture (why Delta exists)

CCommands (MERGE, OPTIMIZE, VACUUM, Z-ORDER)

OOptimization (small files, data skipping, Liquid Clustering)

VVersioning (Time Travel, Clones, Recovery)

SSpecial Features (CDF, Deletion Vectors, UniForm, Sharing)

SECTION 0: DELTA LAKE vs DELTA TABLE — The #1 Confusion

💡 Interview Tip

Why this section exists: Most people mix up "Delta Lake" and "Delta Table." Interviewers LOVE testing this. Clear this confusion FIRST, everything else becomes easy.

The Simple Answer

Delta Lake = The TECHNOLOGY (the engine, the framework, the system)

Delta Table = ONE TABLE created using that technology

It's like:

MySQL→Delta Lake (the system)

one table→one Delta Table (a table inside that system)

Real-World Analogy

Think of a LIBRARY SYSTEM:

📚 Delta Lake = The entire library management system

The rules (ACID transactions)

The catalogue system (transaction log)

The checkout/return process (reads/writes)

The version history (time travel)

📖 Delta Table = ONE book that the library manages

It follows the library's rules

It's tracked in the catalogue

It has checkout history

It can be restored to any past version

You don't say "I built a MySQL" — you say "I built a table IN MySQL"

You don't say "I built a Delta Lake" — you say "I built a Delta table USING Delta Lake"

Technical Difference — What Each Actually IS

📐 Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                      DELTA LAKE (Technology)                 │
│                                                             │
│  ┌─────────────────┐  ┌─────────────────┐                  │
│  │  Delta Table:    │  │  Delta Table:    │                  │
│  │  orders          │  │  customers       │                  │
│  │                  │  │                  │                  │
│  │  📁 _delta_log/  │  │  📁 _delta_log/  │  ← each table   │
│  │  📄 data.parquet │  │  📄 data.parquet │    has its OWN   │
│  │                  │  │                  │    transaction   │
│  └─────────────────┘  └─────────────────┘    log            │
│                                                             │
│  Provides: ACID, Schema Enforcement, Time Travel,           │
│            MERGE, OPTIMIZE, VACUUM, Data Skipping           │
└─────────────────────────────────────────────────────────────┘

Side-by-Side Comparison

Aspect	Delta Lake	Delta Table
What is it?	Open-source storage layer/framework	A single table stored in Delta format
Created by	Databricks (open-sourced in 2019)	You, using `CREATE TABLE ... USING DELTA`
How many?	ONE per environment (it's the technology)	MANY — you create hundreds of Delta tables
Contains	The protocol, the rules, the engine	Parquet data files + `_delta_log/` folder
Analogy	The operating system (Windows)	One file on that operating system
Lives where?	Runs inside Spark/Databricks runtime	Stored on S3 / ADLS / GCS / DBFS
Version	Delta Lake 3.x (the protocol version)	Table has its own version history (0, 1, 2, 3...)

How You Create a Delta Table (using Delta Lake)

python — editable

# Method 1: SQL
spark.sql("""
    CREATE TABLE orders (
        order_id INT,
        customer STRING,
        amount DOUBLE,
        order_date DATE
    ) USING DELTA                          -- ← tells Spark: "use Delta Lake format"
    LOCATION 's3://my-bucket/orders/'
""")

# Method 2: DataFrame write
df.write.format("delta").save("s3://my-bucket/orders/")
#                 ↑
#          this "delta" = use Delta Lake technology

# Method 3: In Databricks (default since DBR 8.0+)
spark.sql("CREATE TABLE orders (...)")
#   ↑ In Databricks, DELTA is the DEFAULT format
#     You don't even need to say USING DELTA!

# What gets created on disk:
# s3://my-bucket/orders/
# ├── _delta_log/                    ← Delta Lake adds this (transaction log)
# │   └── 00000000000000000000.json  ← First commit: "table created with schema..."
# ├── part-00000.snappy.parquet      ← Actual data (still Parquet!)
# └── part-00001.snappy.parquet      ← More data

The Key Insight — Delta Table = Parquet + Transaction Log

🗂️Regular Parquet file:

📄 data.parquet ← Just a file. No history. No ACID. No schema check.

Delta Table:

📁 my_table/

_delta_log/ ← THIS is what makes it "Delta"

00000.json ← Commit 0: created table

00001.json ← Commit 1: inserted 1000 rows

00002.json ← Commit 2: updated 50 rows

00003.json ← Commit 3: deleted 10 rows

part-00000.snappy.parquet ← Same Parquet format as before!

part-00001.snappy.parquet

part-00002.snappy.parquet

So Delta Table = Parquet files + _delta_log folder

Delta Lake = The technology that READS and WRITES that _delta_log

What Delta Lake GIVES to a Delta Table

Without Delta Lake (plain Parquet): With Delta Lake (Delta Table):

❌ No transactions ✅ ACID transactions

❌ Overwrite = data gone forever ✅ Time Travel (undo any change)

❌ Two writers = corrupt data ✅ Optimistic concurrency control

❌ Schema? What schema? ✅ Schema enforcement + evolution

❌ UPDATE one row = rewrite entire file ✅ MERGE handles insert/update/delete

❌ 10,000 small files = slow queries ✅ OPTIMIZE compacts files

❌ Scan everything for every query ✅ Data skipping (min/max stats)

❌ Delete data = rewrite files manually ✅ DELETE command + VACUUM cleanup

Interview Answers — What to Say

Q"What is Delta Lake?"

"Delta Lake is an open-source storage layer that sits on top of cloud storage like S3 or ADLS. It adds ACID transactions, schema enforcement, time travel, and scalable metadata handling to data lakes. Under the hood, data is still stored as Parquet files — Delta Lake adds a transaction log that tracks every change."

Q"What is a Delta Table?"

"A Delta table is a specific table stored in Delta format. It's a collection of Parquet data files plus a _delta_log folder that contains the transaction history. Every Delta table has its own independent transaction log."

Q"What's the difference between Delta Lake and Delta Table?"

"Delta Lake is the technology — the protocol and engine. A Delta table is one table that uses that technology. It's like MySQL vs a table in MySQL. Delta Lake provides the ACID guarantees, and each Delta table is an individual dataset that benefits from those guarantees."

What NOT to Say:

🧠 Memory Map

❌ "Delta Lake and Delta Table are the same thing"

→ No. Lake = technology, Table = one table using that technology

❌ "Delta Lake is a database"

→ No. It's a STORAGE LAYER. It doesn't have a query engine.

Spark/Databricks is the engine. Delta Lake is the format.

❌ "Delta Table is different from Parquet"

→ Partially wrong. Delta Table IS Parquet + transaction log.

The data files are still .parquet format.

❌ "Delta Lake stores data in its own format"

→ No. Data is standard Parquet. Only the _delta_log is Delta-specific.

Quick Memory Trick

Delta LAKE = the L is for LAYER (it's a storage LAYER / technology)

Delta TABLE = the T is for TABLE (it's one specific TABLE)

LAKELarge system, like a real lake (holds everything)

TABLESmall unit, like a table inside a restaurant (one specific thing)

SECTION 1: WHY DELTA LAKE EXISTS

Q01 — What is Delta Lake? Why do we need it?

Question: Your interviewer asks: "Why can't we just use Parquet files on a data lake? Why do we need Delta Lake?"

The Problem (Plain Parquet):

🗂️data_lake/

orders_2026_01.parquet ← File 1

orders_2026_02.parquet ← File 2

orders_2026_03.parquet ← File 3

PROBLEM 1: Two pipelines write to orders_2026_03.parquet at the same time

→ Data gets CORRUPTED. No transaction support.

PROBLEM 2: You accidentally overwrite orders_2026_02.parquet

→ Gone forever. No undo, no history, no time travel.

PROBLEM 3: A pipeline writes half the data and crashes

→ Half-written file stays. Readers see partial/garbage data.

PROBLEM 4: Source system adds a new column "discount_pct"

→ Old files don't have it. Schema mismatch everywhere.

PROBLEM 5: You want to UPDATE one row in a 2 GB Parquet file

→ You must rewrite the ENTIRE 2 GB file. Parquet is immutable.

The Solution (Delta Lake): Delta Lake adds a transaction log (_delta_log/) on top of Parquet files. This log tracks every change — which files were added, removed, or modified. It turns a chaotic collection of files into a proper table with ACID transactions, schema enforcement, and version history.

🗂️delta_table/

_delta_log/ ← THE BRAIN — tracks every change

00000000000000000000.json ← Commit 0: table created

00000000000000000001.json ← Commit 1: 1000 rows inserted

00000000000000000002.json ← Commit 2: 50 rows updated

00000000000000000010.checkpoint.parquet ← Checkpoint (summary)

_last_checkpoint ← Points to latest checkpoint

part-00000-abc123.snappy.parquet ← Actual data file 1

part-00001-def456.snappy.parquet ← Actual data file 2

part-00002-ghi789.snappy.parquet ← Actual data file 3

One-liner for interviews: "Delta Lake is an open-source storage layer that adds ACID transactions, scalable metadata handling, and time travel to existing data lakes built on Parquet."

Q02 — What are the key features of Delta Lake? (The Top 8)

Question: "List the main features of Delta Lake and explain why each matters."

#	Feature	What It Does	Why It Matters
1	ACID Transactions	Every write is atomic — all or nothing	No more corrupted tables from failed writes
2	Time Travel	Query any historical version of the table	Undo mistakes, audit changes, debug issues
3	Schema Enforcement	Rejects writes that don't match the table schema	Prevents bad data from entering your table
4	Schema Evolution	Automatically adds new columns when needed	Handles evolving source systems gracefully
5	MERGE (Upsert)	INSERT + UPDATE + DELETE in one atomic operation	The #1 operation for data engineering pipelines
6	Data Skipping	Skips files that can't contain query results	Makes queries 10-100x faster on large tables
7	Unified Batch + Streaming	Same table for both batch and streaming writes	No separate streaming tables needed
8	Open Format	Data stored as Parquet (open standard)	No vendor lock-in — read with Spark, Trino, Presto, etc.

Interview Tip: Don't just list features — connect them to PROBLEMS they solve. "We needed MERGE because our booking system sends daily change feeds that contain inserts, updates, and cancellations — all in one file."

What NOT to Say: "Delta Lake is a database." — No, it's a STORAGE LAYER on top of cloud storage. The data still lives as Parquet files on ADLS/S3/GCS.

Q03 — How does Delta Lake ensure ACID transactions?

Question: "Walk me through how each ACID property is implemented in Delta Lake."

Property	What It Means	How Delta Does It
Atomicity	All changes in a transaction succeed, or none do	Each commit writes ONE JSON file atomically using `put-if-absent` (cloud storage rename). If it fails, nothing changes.
Consistency	Data always follows schema rules	Schema enforcement rejects writes with wrong column types/names
Isolation	Concurrent readers/writers don't interfere	Snapshot isolation — readers see the table as it was when they started reading. Writers use optimistic concurrency (see Q05).
Durability	Committed data survives crashes	Data is stored as Parquet on cloud storage (S3/ADLS/GCS) — inherently durable

Follow-up they'll ask: "What happens if a write fails midway?"

🧠 Memory Map

Step 1: Spark writes new Parquet data files to storage ← files exist but NOT committed

Step 2: Spark tries to write the commit JSON to _delta_log ← THIS is the atomic step

If Step 2 FAILS:

→ The Parquet files from Step 1 are "orphaned" (exist but not referenced)

→ No commit JSON means Delta doesn't know about them

→ Next VACUUM will clean them up

→ Table state is unchanged — ZERO corruption

If Step 2 SUCCEEDS:

→ Transaction is committed

→ Readers will see the new data

Interview Tip: "The key insight is that the commit JSON file is the SINGLE point of atomicity. Data files can be partially written — it doesn't matter until the commit JSON makes them visible."

SECTION 2: TRANSACTION LOG DEEP DIVE

Q04 — What is the Delta Transaction Log (_delta_log)? What's inside each commit?

Question: "Explain the Delta transaction log in detail. What exactly is stored in each commit file?"

Each JSON commit file contains these actions:

json

// Example: 00000000000000000005.json (Commit #5)
{
  "commitInfo": {
    "timestamp": 1711670400000,
    "operation": "MERGE",
    "operationParameters": {"predicate": "t.id = s.id"},
    "operationMetrics": {
      "numTargetRowsInserted": "1500",
      "numTargetRowsUpdated": "300",
      "numTargetRowsDeleted": "50"
    }
  },
  "add": {
    "path": "part-00000-new-file.snappy.parquet",
    "size": 1073741824,
    "partitionValues": {"date": "2026-03-29"},
    "stats": "{\"numRecords\":50000,\"minValues\":{\"id\":1},\"maxValues\":{\"id\":50000}}"
  },
  "remove": {
    "path": "part-00000-old-file.snappy.parquet",
    "deletionTimestamp": 1711670400000,
    "dataChange": true
  }
}

Action	What It Records	Example
`add`	New Parquet file was added to the table	After INSERT, MERGE, or OPTIMIZE
`remove`	File was logically deleted (still physically exists!)	After UPDATE, DELETE, or OPTIMIZE
`metaData`	Schema change, table properties changed	After ALTER TABLE or schema evolution
`protocol`	Reader/writer version requirements	After enabling new features (e.g., Deletion Vectors)
`commitInfo`	Who, when, what operation, and metrics	Every commit has this

Critical detail: remove is LOGICAL, not physical

🧠 Memory Map

remove: "part-00000-old-file.snappy.parquet"

→ File is marked as "no longer part of current table state"

→ File STILL EXISTS on disk (for time travel!)

→ File is PHYSICALLY deleted only when VACUUM runs

→ Default: VACUUM deletes files older than 7 days

Interview Tip: "The transaction log is append-only — you never modify existing commit files, you only add new ones. This is what makes it safe for concurrent readers."

Q05 — What are checkpoint files? Why are they critical?

Question: "Without checkpoints, what happens when a Delta table has 100,000 commits?"

Problem without checkpoints:

🧠 Memory Map

To read the current table state at version 100,000:

→ Read 00000000000000000000.json (commit 0)

→ Read 00000000000000000001.json (commit 1)

→ Read 00000000000000000002.json (commit 2)

→ ... replay ALL 100,000 JSON files

→ VERY SLOW — just to know which Parquet files to read!

Solution — Checkpoints:

🧠 Memory Map

Every 10 commits (default), Delta creates a checkpoint file:

→ 00000000000000000010.checkpoint.parquet ← State at commit 10

→ 00000000000000000020.checkpoint.parquet ← State at commit 20

→ ...

→ 00000000000000099990.checkpoint.parquet ← State at commit 99,990

To read at version 100,000:

→ Read checkpoint at 99,990 (ONE Parquet file — fast!)

→ Replay only 10 JSON files (99,991 to 100,000)

→ Done! Instead of 100,000 files, we read ~11 files

Key details:

Checkpoint interval: delta.checkpointInterval (default: 10)
_last_checkpoint file points to the latest checkpoint
Checkpoints are in Parquet format (columnar, fast to read)
Multi-part checkpoints exist for very large tables (splits into multiple files)

Q06 — How does Optimistic Concurrency Control work?

Question: "Two pipelines write to the same Delta table at the same time. What happens?"

🗂️Pipeline A reads table at version 5 (fare update batch)

Pipeline B reads table at version 5 (booking cancellation batch)

Pipeline A finishes first:

→ Writes commit 6 to _delta_log → SUCCEEDS (first to commit)

Pipeline B tries to commit:

→ Attempts to write commit 6 → FAILS (file already exists!)

→ Delta automatically:

1. Reads commit 6 to see what Pipeline A changed

2. Checks for LOGICAL conflict:

Pipeline A touched files X, Y

Pipeline B touched files M, N

X,Y ∩ M,N = empty → NO CONFLICT

Pipeline B retries as commit 7 → SUCCEEDS ✓

→ But if both touched the SAME files:

CONFLICT → throws ConcurrentModificationException ✗

Automatic retry (up to 3 times by default, configurable)

Two isolation levels:

Level	Default?	Behavior	Use When
WriteSerializable	Yes	Concurrent writes to different files succeed	Normal ETL workloads
Serializable	No	Even reads during writes can conflict	Strict audit/compliance tables

How to AVOID conflicts in practice:

Partition writes by date/region so different pipelines touch different files
Add partition column to MERGE condition: ON t.id = s.id AND t.date = s.date
Use Liquid Clustering to minimize file overlap

Interview Tip: "The word 'optimistic' is key — Delta assumes no conflict, lets both writers proceed, and only checks at commit time. This is efficient because most concurrent writes DON'T conflict."

SECTION 3: MERGE — THE MOST IMPORTANT COMMAND

Q07 — Write a MERGE from memory (upsert with all clauses)

Question: "Write a MERGE that handles inserts, updates, and deletes in one command."

This is the #1 coding question in Databricks interviews. Memorize it.

sql

-- MERGE = Compare source (new data) with target (existing table), then act
MERGE INTO orders_fact AS t                -- t = target (existing table)
USING orders_staging AS s                  -- s = source (new data)
ON t.order_id = s.order_id                -- Match condition

-- Case 1: Record exists AND is cancelled → delete it
WHEN MATCHED AND s.status = 'CANCELLED' THEN
    DELETE

-- Case 2: Record exists AND has changed → update it
WHEN MATCHED AND s.updated_at > t.updated_at THEN
    UPDATE SET
        t.status = s.status,
        t.amount = s.amount,
        t.updated_at = s.updated_at

-- Case 3: Record doesn't exist in target → insert it
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, status, amount, created_at, updated_at)
    VALUES (s.order_id, s.customer_id, s.status, s.amount, s.created_at, s.updated_at)

-- Case 4: Record in target but NOT in source → clean up (Databricks extension)
WHEN NOT MATCHED BY SOURCE AND t.status = 'PENDING' THEN
    DELETE

PySpark equivalent:

python — editable

from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "orders_fact")

target.alias("t").merge(
    source_df.alias("s"),
    "t.order_id = s.order_id"
).whenMatchedDelete(
    condition="s.status = 'CANCELLED'"
).whenMatchedUpdate(
    condition="s.updated_at > t.updated_at",
    set={"status": "s.status", "amount": "s.amount", "updated_at": "s.updated_at"}
).whenNotMatchedInsertAll(
).execute()

Q08 — How do you handle duplicate keys in source during MERGE?

Question: "Your staging table has duplicate order_ids. MERGE throws an error. How do you fix it?"

Why it fails: MERGE requires each target row to match AT MOST one source row. Duplicate keys = ambiguous → error.

The fix — deduplicate source with ROW_NUMBER:

sql

WITH deduped_source AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id           -- Group duplicates together
            ORDER BY updated_at DESC        -- Keep the LATEST record
        ) AS rn
    FROM orders_staging
)
MERGE INTO orders_fact AS t
USING (SELECT * FROM deduped_source WHERE rn = 1) AS s    -- Only latest per key
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Why duplicates happen in real life:

Kafka retry delivered the same event twice
Source system sent the same file twice
Backfill job reprocessed already-loaded data

Q09 — How to make MERGE faster? (6 techniques)

Question: "Our MERGE takes 3 hours on a 500 GB table. How do you optimize it?"

Why MERGE is slow: It scans the ENTIRE target to find matching records. 500 GB table + 10K source rows = reads all 500 GB just to match 10K rows.

1. Add partition column to match condition:

sql

-- BAD: Scans ALL data across ALL dates
MERGE INTO orders t USING staging s
ON t.order_id = s.order_id                              -- Scans 500 GB!

-- GOOD: Only scans matching date partitions
MERGE INTO orders t USING staging s
ON t.order_id = s.order_id AND t.order_date = s.order_date  -- Scans ~2 GB!

2. Z-ORDER or Liquid Cluster on merge key:

sql

OPTIMIZE orders ZORDER BY (order_id);
-- Data skipping now works for MERGE — skips irrelevant files

3. Filter unchanged rows from source:

sql

MERGE INTO orders t
USING (
    SELECT s.* FROM staging s
    LEFT JOIN orders t ON s.order_id = t.order_id
    WHERE t.order_id IS NULL OR s.hash_value != t.hash_value
) AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

4. Broadcast small source:

python — editable

100 MB → broadcast to all workers
    "t.order_id = s.order_id"
)">from pyspark.sql.functions import broadcast
target.alias("t").merge(
    broadcast(source_df).alias("s"),     # Source < 100 MB → broadcast to all workers
    "t.order_id = s.order_id"
)

5. Compact target first:

sql

OPTIMIZE orders;  -- 50,000 small files → 500 large files → fewer tasks

6. Use Photon runtime:

Select "Photon" when creating cluster — 3-5x faster for MERGE
No code changes needed

Q10 — MERGE with Schema Evolution

Question: "Source data has a new column that doesn't exist in the target. How do you handle it?"

sql

-- Enable auto schema merge
SET spark.databricks.delta.schema.autoMerge.enabled = true;

-- MERGE as normal — new columns from source auto-added to target
MERGE INTO customers t
USING staging s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *       -- * includes new columns
WHEN NOT MATCHED THEN INSERT *       -- New column auto-added to target schema

When to use: Source systems add new fields over time (very common). When NOT to use: Strict regulatory tables where schema changes need approval.

SECTION 4: OPTIMIZE, VACUUM, AND THE SMALL FILE PROBLEM

Q11 — What is OPTIMIZE? When is it good? When is it bad?

Question: "Explain OPTIMIZE. What are its trade-offs?"

What it does: Combines many small files into fewer large files (~1 GB each).

sql

-- Basic OPTIMIZE
OPTIMIZE orders;
-- Example: 10,000 files × 1 MB each → 10 files × 1 GB each

-- OPTIMIZE specific partition
OPTIMIZE orders WHERE order_date >= '2026-03-01';

-- OPTIMIZE with Z-ORDER (organize data within files)
OPTIMIZE orders ZORDER BY (customer_id, order_date);

GOOD things about OPTIMIZE:

Benefit	Why
Faster reads	Fewer files = fewer I/O operations = faster queries
Better data skipping	Sorted data has tighter min/max ranges per file
Faster MERGE	Fewer files = fewer tasks for Spark
Non-destructive	Old files still exist (for time travel) until VACUUM

BAD things about OPTIMIZE:

Downside	Why
Rewrites ALL files	Even unchanged files get rewritten (Z-ORDER does full rewrite)
Expensive for large tables	1 TB table = reads 1 TB + writes 1 TB = 2 TB of I/O
Doubles storage temporarily	Old files + new files exist until VACUUM
Not incremental (with Z-ORDER)	OPTIMIZE without Z-ORDER is incremental in newer versions
Blocks concurrent writes briefly	Takes a commit lock during the commit phase

Best practices:

WHEN to OPTIMIZE:

✅ After large batch writes (ETL job just loaded 10M rows)

✅ Before heavy read workloads (BI dashboard refresh coming)

✅ When MERGE is slow (too many small files in target)

✅ Daily maintenance window (schedule nightly)

WHEN NOT to OPTIMIZE:

❌ After every micro-batch in streaming (use autoCompact instead)

❌ On tables with very few files already

❌ During peak read hours (it competes for resources)

Interview Tip: "OPTIMIZE is a WRITE operation that benefits READS. Run it during low-traffic windows. For streaming tables, use autoOptimize table property instead of manual OPTIMIZE."

Q12 — What is VACUUM? What are the risks?

Question: "What does VACUUM do? Can I run VACUUM RETAIN 0 HOURS? What happens?"

What VACUUM does:

🗂️BEFORE VACUUM:

orders/

_delta_log/

part-00000-NEW.parquet ← Current (referenced by latest commit)

part-00001-NEW.parquet ← Current

part-00000-OLD.parquet ← Old (was replaced by OPTIMIZE 10 days ago)

part-00001-OLD.parquet ← Old (was replaced by UPDATE 8 days ago)

AFTER VACUUM (7-day retention):

orders/

_delta_log/

part-00000-NEW.parquet ← Kept (current)

part-00001-NEW.parquet ← Kept (current)

part-00000-OLD.parquet ← DELETED (older than 7 days, unreferenced)

part-00001-OLD.parquet ← DELETED (older than 7 days, unreferenced)

sql

-- Default: delete files older than 7 days
VACUUM orders;

-- Custom retention
VACUUM orders RETAIN 168 HOURS;    -- 168 hours = 7 days

-- DANGEROUS — delete ALL old files immediately
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM orders RETAIN 0 HOURS;

Risks of VACUUM RETAIN 0 HOURS:

Risk	Explanation
Breaks ALL time travel	Can't query any previous version — old files are gone
Active readers may fail	A running query that started before VACUUM may reference deleted files → `FileNotFoundException`
No recovery	Deleted files are gone permanently — no undo
Concurrent write failures	Writers that started with old versions lose their reference files

Rule: NEVER use RETAIN 0 HOURS in production. Default 7 days is a safety net.

OPTIMIZE vs VACUUM — they work as a pair:

Step 1: OPTIMIZE→Creates NEW large files, marks OLD small files as "removed"

Step 2: VACUUM→Physically DELETES the OLD files marked as "removed"

Without VACUUM after OPTIMIZE: Storage keeps growing (old + new files both exist)

Without OPTIMIZE before VACUUM: Nothing to clean up (no old files to remove)

Q13 — The Small File Problem — Root Causes and Solutions

Question: "Your Delta table has 50,000 small files (each <1 MB). Queries take 45 minutes. How do you fix it?"

Why small files are bad:

🧠 Memory Map

50,000 files × 1 MB each = 50 GB of data

Each file requires:

→ 1 Spark task to read

→ 1 metadata operation on cloud storage

→ Scheduler overhead

50,000 tasks vs 50 tasks (with 1 GB files) = ~100x slower

Root causes of small files:

Cause	Example
Streaming micro-batches	Processing every 10 seconds → 6 files/minute → 8,640 files/day
Over-partitioning	`PARTITIONED BY (date, hour, region, category)` → millions of tiny partitions
Frequent small appends	100-row inserts every minute
Too many shuffle partitions	`spark.sql.shuffle.partitions = 200` writes 200 files per write

Immediate fix:

sql

-- Step 1: Compact files now
OPTIMIZE slow_table;
-- 50,000 × 1 MB → ~50 × 1 GB

-- Step 2: Clean up old files
VACUUM slow_table;

-- Step 3: Refresh statistics
ANALYZE TABLE slow_table COMPUTE STATISTICS;

Prevent future small files:

sql

-- Option A: Auto-optimization table properties
ALTER TABLE my_table SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',     -- Coalesce during write
    'delta.autoOptimize.autoCompact' = 'true'          -- Mini-OPTIMIZE after each write
);

-- Option B: For streaming — increase trigger interval
# .trigger(processingTime="5 minutes")   -- Instead of "10 seconds"

-- Option C: For new tables — use Liquid Clustering
CREATE TABLE orders (...) CLUSTER BY (order_date, customer_id);
-- Handles compaction automatically

-- Option D: Reduce shuffle partitions
spark.conf.set("spark.sql.shuffle.partitions", "auto")  -- Adaptive (AQE)

SECTION 5: DATA ORGANIZATION — PARTITIONING vs Z-ORDER vs LIQUID CLUSTERING

Q14 — Partitioning — When is it good? When is it bad?

Question: "When should you partition a Delta table? When should you NOT?"

sql

-- Creating a partitioned table
CREATE TABLE bookings (
    booking_id LONG,
    passenger_name STRING,
    booking_date DATE,
    amount DECIMAL(10,2)
) PARTITIONED BY (booking_date);

GOOD — Use partitioning when:

Column has low cardinality (few unique values): date, country, region
Each partition has at least 1 GB of data
Most queries filter on the partition column
Example: PARTITIONED BY (year) where each year has 50 GB+ of data

BAD — Don't partition when:

Column has high cardinality (many unique values): user_id, order_id
Partitions are too small (< 1 GB each) → small file problem
You filter on multiple columns → partitioning only helps ONE column
Example: PARTITIONED BY (customer_id) with 10 million customers = disaster

The 1 GB rule:

Total data ÷ Number of unique partition values = Partition size

Example: 500 GB table, partitioned by date, 365 days/year

→ 500 GB ÷ 365 = ~1.4 GB per partition ✅ Good!

Example: 500 GB table, partitioned by customer_id, 1 million customers

→ 500 GB ÷ 1,000,000 = 0.5 MB per partition ❌ Terrible!

Q15 — Z-ORDER — How does it work internally?

Question: "What is Z-ORDER? How is it different from sorting? How many columns can you Z-ORDER on?"

What Z-ORDER does: Regular sorting (ORDER BY) sorts by one column, then another. Z-ORDER uses a space-filling curve to interleave multiple columns, so data is co-located on ALL specified columns simultaneously.

sql

-- Z-ORDER is always used WITH OPTIMIZE
OPTIMIZE orders ZORDER BY (customer_id, order_date);

How it helps — data skipping becomes effective:

📋 Overview

WITHOUT Z-ORDER:

File 1: customer_id [1-1000000], order_date [2025-01-01 to 2026-12-31]

File 2: customer_id [1-1000000], order_date [2025-01-01 to 2026-12-31]

→ Every file has ALL customer_ids and ALL dates

→ Query WHERE customer_id = 42→must read ALL files (no skipping!)

WITH Z-ORDER on (customer_id):

File 1: customer_id [1-10000]

File 2: customer_id [10001-20000]

→ Query WHERE customer_id = 42→reads only File 1 (skips rest!)

Limitations of Z-ORDER:

Limitation	Explanation
Max 4 columns effective	More columns = less effective (diminishing returns)
Full rewrite every time	OPTIMIZE ZORDER rewrites ALL files — expensive on large tables
Not incremental	Can't Z-ORDER just the new data — must redo everything
Manual operation	Must run OPTIMIZE manually or schedule it
Can't change columns easily	Changing Z-ORDER columns requires full rewrite

Q16 — Liquid Clustering — The Modern Replacement

Question: "What is Liquid Clustering? Why does Databricks say it replaces both partitioning AND Z-ORDER?"

sql

-- Create table with Liquid Clustering
CREATE TABLE orders (
    order_id LONG,
    customer_id LONG,
    order_date DATE,
    amount DECIMAL(10,2)
) CLUSTER BY (order_date, customer_id);

-- Change clustering columns anytime — NO full rewrite!
ALTER TABLE orders CLUSTER BY (region, order_date);

-- OPTIMIZE is now INCREMENTAL — only processes new/changed data
OPTIMIZE orders;

Why Liquid Clustering is better:

Aspect	Partitioning	Z-Ordering	Liquid Clustering
How it works	Separate folders per value	Sorts data within files	Auto-organizes data incrementally
Good for	Low cardinality only	High cardinality	Any cardinality
Applied when	On write (rigid)	Manual OPTIMIZE	Automatically on writes
Incremental	N/A	No (full rewrite)	Yes (only new/changed data)
Change columns	Requires full table rewrite	Must re-OPTIMIZE everything	Just ALTER TABLE
Small file handling	Can cause small files	Doesn't help	Prevents small files
Replaces others?	—	No	Yes — replaces BOTH

When to use what (2026 recommendation):

🧠 NEW TABLES → Always use Liquid Clustering

NEW TABLESAlways use Liquid Clustering

EXISTING TABLESMigrate to Liquid Clustering when possible

LEGACYKeep partitioning if working fine, don't fix what's not broken

Interview Tip: "For new tables, I always use CLUSTER BY instead of PARTITIONED BY. It's incremental, handles any cardinality, and I can change the clustering columns without rewriting the table."

Q17

Question: "How does Delta Lake skip files during queries? What are file-level statistics?"

sql

SELECT * FROM orders WHERE order_date = '2026-03-15' AND customer_id = 42

Behind the scenes:

📋 Overview

Delta reads statistics from the transaction log (NOT the actual files):

File A: min(order_date) = 2026-01-01, max = 2026-01-31, min(customer_id) = 1, max = 10000

→ Can order_date = '2026-03-15' be here? NO→SKIP ✅

File B: min(order_date) = 2026-03-01, max = 2026-03-31, min(customer_id) = 1, max = 100

→ Can order_date = '2026-03-15' be here? YES

→ Can customer_id = 42 be here? YES→READ this file

File C: min(order_date) = 2026-03-01, max = 2026-03-31, min(customer_id) = 5000, max = 10000

→ Can order_date = '2026-03-15' be here? YES

→ Can customer_id = 42 be here? NO (42 < 5000) → SKIP ✅

Result: Read 1 file instead of 3→3x faster!

Key details:

Stats stored for the first 32 columns by default
Config: delta.dataSkippingNumIndexedCols (default 32)
Move frequently filtered columns to the first 32 positions in your schema
Works BEST when data is sorted/clustered (Z-ORDER or Liquid Clustering)
Stats include: min, max, null count, row count per file

Why Z-ORDER/Clustering improves data skipping:

UNSORTED data: File min=1, max=1000000 (wide range → can't skip anything)

SORTED data: File min=1, max=10000 (narrow range → skip most files!)

SECTION 6: TIME TRAVEL, CLONES, AND RECOVERY

Q18 — Time Travel — Query any version of your table

Question: "How does time travel work? Show me all the ways to use it."

sql

-- METHOD 1: Query by version number
SELECT * FROM orders VERSION AS OF 5;
SELECT * FROM orders@v5;                   -- Shorthand

-- METHOD 2: Query by timestamp
SELECT * FROM orders TIMESTAMP AS OF '2026-03-15 10:30:00';

-- See full history of changes
DESCRIBE HISTORY orders;
-- Shows: version, timestamp, operation, user, metrics

-- RESTORE: Roll back the entire table to a previous version
RESTORE TABLE orders TO VERSION AS OF 5;
-- Creates a NEW version (non-destructive) — you can undo the restore too!

-- Compare two versions
SELECT * FROM orders@v10 EXCEPT ALL SELECT * FROM orders@v5;
-- Shows rows that exist in v10 but not v5

Limits:

Data retention: 7 days default (VACUUM deletes files older than this)
Log retention: 30 days default (DESCRIBE HISTORY works for 30 days)
After VACUUM, time travel is broken for vacuumed versions

Interview Tip: "Time travel works because Delta keeps old Parquet files around. VACUUM is what deletes them. So time travel range = VACUUM retention period."

Q19 — Scenario: Accidental DELETE — How to recover?

Question: "Someone ran DELETE FROM customers WHERE region = 'APAC' by mistake. 2 million rows deleted. How do you recover?"

sql

-- Step 1: Find the version BEFORE the accidental delete
DESCRIBE HISTORY customers;
-- Look for the DELETE operation — say it was at version 42
-- We want version 41 (just before the delete)

-- Option A: FULL RESTORE (simple but blunt)
RESTORE TABLE customers TO VERSION AS OF 41;
-- Table looks exactly like version 41
-- BUT: any legitimate changes between 42 and now are also lost!

-- Option B: SELECTIVE RESTORE (surgical — better answer!)
MERGE INTO customers AS target
USING customers VERSION AS OF 41 AS source
ON target.customer_id = source.customer_id
WHEN NOT MATCHED THEN INSERT *;
-- Only inserts rows that were deleted (exist in v41 but not current)
-- Preserves any legitimate changes that happened after the delete

Interview Tip: "Option B is the better answer — it shows you understand MERGE + time travel together, and it's surgical (doesn't overwrite changes made after the accidental delete)."

Q20 — Deep Clone vs Shallow Clone

Question: "What are clones? When would you use deep vs shallow?"

Aspect	Deep Clone	Shallow Clone
Copies data?	YES — full independent copy	NO — references source files
Independent?	Fully independent	Depends on source files
Speed	Slow (copies all data)	Fast (copies only metadata)
Use case	Production backups, migration	Testing, experimentation
VACUUM safe?	Yes — owns its data	NO — vacuuming source breaks clone
Storage cost	2x (full copy)	Minimal (metadata only)

sql

-- Deep clone: Full independent copy
CREATE TABLE orders_backup DEEP CLONE orders;
-- Subsequent runs are INCREMENTAL (only copies changes since last clone)

-- Shallow clone: Fast reference copy
CREATE TABLE orders_test SHALLOW CLONE orders;
-- Perfect for testing — make changes without affecting production
-- WARNING: If source is VACUUMed, shallow clone may break!

SECTION 7: SPECIAL FEATURES

Q21

Question: "How do you track what rows changed in a Delta table? How does downstream consume only the changes?"

sql

-- Step 1: Enable Change Data Feed on the table
ALTER TABLE orders SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

-- Step 2: Read only the changes between two versions
SELECT * FROM table_changes('orders', 5, 10);
-- Returns ALL changes between version 5 and 10 with metadata:

What CDF returns:

+----------+--------+--------+------------------+----------------+

+----------+--------+--------+------------------+----------------+

| 1001 | 99.99 | SHIPPED| update_postimage | 7 |

| 1001 | 89.99 | PENDING| update_preimage | 7 |

| 1002 | 149.00 | NEW | insert | 8 |

| 1003 | 25.00 | CANCEL | delete | 9 |

+----------+--------+--------+------------------+----------------+

Change Type	Meaning
`insert`	New row was added
`update_preimage`	Row BEFORE the update (old values)
`update_postimage`	Row AFTER the update (new values)
`delete`	Row was deleted

Why CDF matters for data engineering:

WITHOUT CDF

Silver→Gold: Read ALL 10 million Silver rows, compare, find changes

Time: 45 minutes

WITH CDF

Silver→Gold: Read only 5,000 changed rows since last run

Time: 30 seconds

PySpark to read changes:

python — editable

# Read changes since version 5
changes_df = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 5) \
    .table("orders")

# Read changes since a timestamp
changes_df = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingTimestamp", "2026-03-15") \
    .table("orders")

Interview Tip: "CDF is essential for incremental ETL in medallion architecture. Instead of reprocessing all Silver data to update Gold, we read only the changes. This reduces pipeline runtime from hours to minutes."

Q22 — Deletion Vectors — Faster DELETEs and UPDATEs

Question: "How do Deletion Vectors improve UPDATE/DELETE performance?"

Without Deletion Vectors:

📋 Overview

UPDATE orders SET status = 'SHIPPED' WHERE order_id = 42;

Step 1: Find the 1 GB file containing order_id = 42

Step 2: Read ALL rows from that file (1 million rows)

Step 3: Change 1 row

Step 4: Write a NEW 1 GB file with the change

Step 5: Mark old file as "removed" in transaction log

Total I/O: Read 1 GB + Write 1 GB = 2 GB of I/O for ONE row change!

With Deletion Vectors:

📋 Overview

UPDATE orders SET status = 'SHIPPED' WHERE order_id = 42;

Step 1: Find the file containing order_id = 42

Step 2: Write a tiny Deletion Vector file (~bytes) saying "row #47 in file X is deleted"

Step 3: Write a tiny new file with just the updated row

Total I/O: ~KB instead of 2 GB!

The original file is untouched — cleanup happens during OPTIMIZE later

sql

-- Enable deletion vectors
ALTER TABLE orders SET TBLPROPERTIES (
    'delta.enableDeletionVectors' = 'true'
);

Trade-offs:

Aspect	Without DV	With DV
Write speed (UPDATE/DELETE)	Slow (full file rewrite)	Fast (tiny marker file)
Read speed	Normal	Slightly slower (must check DV files)
Storage	Clean	Accumulates DV files until OPTIMIZE
Best for	Small tables	Large tables with frequent updates

Interview Tip: "Deletion Vectors change the write pattern from copy-on-write to a more efficient mark-and-sweep approach. The actual file rewrite is deferred to OPTIMIZE, which runs during off-peak hours."

Q23 — Delta Sharing — Share data without copying

Question: "How can you share Delta tables with external organizations without copying data?"

How it works:

📐 Architecture Diagram

Provider (your org)                     Recipient (partner org)
┌─────────────────┐                    ┌─────────────────┐
│ Delta Table on  │                    │ Any client:     │
│ your storage    │───── REST API ────→│ - Spark         │
│ (S3/ADLS/GCS)   │   (open protocol) │ - pandas        │
└─────────────────┘                    │ - Power BI      │
       ↑                               │ - Databricks    │
  Data stays HERE                      └─────────────────┘
  (never copied)                        Reads directly from
                                        your storage

python — editable

# Recipient reads shared data — one line!
df = spark.read.format("deltaSharing").load("profile.json#share.schema.table")

Key benefits:

No data copying — recipient reads directly from your storage
Open protocol — not locked to Databricks (works with any client)
Provider controls access — revoke anytime, audit all reads
Live data — recipient always sees the latest version

Q24 — UniForm — One table, all formats

Question: "What is UniForm? Why does it matter?"

Problem: Different tools expect different table formats:

Databricks → Delta Lake
Snowflake → Iceberg
Trino/Presto → Hudi or Iceberg
BigQuery → Iceberg

UniForm solution: Write once as Delta, automatically generates Iceberg and Hudi metadata.

sql

-- Enable UniForm on a table
ALTER TABLE orders SET TBLPROPERTIES (
    'delta.universalFormat.enabledFormats' = 'iceberg'
);
-- Now this Delta table can be read as an Iceberg table by Snowflake, Trino, etc.
-- Data is NOT duplicated — only metadata is generated in Iceberg format

Why it matters:

One copy of data, readable by ALL engines
No ETL to copy data between formats
Eliminates "format wars" (Delta vs Iceberg vs Hudi)

Q25 — Schema Evolution — All the options

Question: "How does Delta handle schema changes? What options are available?"

Option 1: Add new columns (mergeSchema):

python — editable

df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable("orders")
# New columns in df are auto-added to the table schema
# Existing columns are preserved

Option 2: Replace entire schema (overwriteSchema):

python — editable

df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("orders")
# Completely replaces the schema — destructive!

Option 3: Rename/drop columns (requires column mapping):

sql

-- Enable column mapping first
ALTER TABLE orders SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name');

-- Now you can rename and drop
ALTER TABLE orders RENAME COLUMN old_name TO new_name;
ALTER TABLE orders DROP COLUMN unused_column;

Option 4: Type widening (Delta 4.0+):

sql

-- Change column type without rewriting data
ALTER TABLE orders SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true');
ALTER TABLE orders ALTER COLUMN amount TYPE DECIMAL(20,2);
-- INT → BIGINT, FLOAT → DOUBLE, etc. — NO full rewrite needed!

Q26 — Important Delta Table Properties (Know These!)

Question: "What are the most important Delta table properties?"

sql

ALTER TABLE orders SET TBLPROPERTIES (

    -- AUTO OPTIMIZATION: Fix small files automatically
    'delta.autoOptimize.optimizeWrite' = 'true',
    -- Coalesces small output files during write

    'delta.autoOptimize.autoCompact' = 'true',
    -- Runs mini-OPTIMIZE after each write

    -- CHANGE DATA FEED: Track row-level changes for incremental ETL
    'delta.enableChangeDataFeed' = 'true',

    -- DELETION VECTORS: Faster UPDATE/DELETE
    'delta.enableDeletionVectors' = 'true',

    -- COLUMN MAPPING: Enable rename/drop columns
    'delta.columnMapping.mode' = 'name',

    -- RETENTION: How long to keep old data
    'delta.logRetentionDuration' = 'interval 30 days',
    'delta.deletedFileRetentionDuration' = 'interval 7 days',

    -- TYPE WIDENING: Allow type changes without rewrite
    'delta.enableTypeWidening' = 'true',

    -- UNIFORM: Auto-generate Iceberg metadata
    'delta.universalFormat.enabledFormats' = 'iceberg'
);

SECTION 8: DELTA LAKE vs COMPETITORS

Q27 — Delta Lake vs Apache Iceberg vs Apache Hudi

Question: "How does Delta Lake compare to Iceberg and Hudi? Why would you choose one over another?"

Aspect	Delta Lake	Apache Iceberg	Apache Hudi
Created by	Databricks	Netflix → Apache	Uber → Apache
Best with	Databricks, Spark	Snowflake, Trino, Spark, Flink	Spark, Flink
ACID	Yes	Yes	Yes
Time Travel	Yes	Yes	Yes (limited)
MERGE performance	Excellent (Photon)	Good	Good (MoR tables)
Streaming	Excellent	Good	Excellent
Schema Evolution	Good	Excellent (best)	Good
Partition Evolution	Via Liquid Clustering	Native (hidden partitioning)	Manual
Multi-engine support	UniForm bridges gap	Best (truly engine-agnostic)	Good
Community/Ecosystem	Large (Databricks-led)	Fastest growing	Smaller
Vendor independence	Databricks-optimized	Most vendor-neutral	Less common

When to choose each:

🧠 Memory Map

Delta Lake→You're on Databricks (obvious choice — best integration)

Iceberg→Multi-engine environment (Snowflake + Spark + Trino)

Hudi→Heavy streaming upsert workloads (Uber-style CDC)

Interview Tip: "Don't bash other formats. Say: 'Delta is the natural choice on Databricks because of deep integration with Photon, Unity Catalog, and Predictive Optimization. But UniForm lets us expose tables as Iceberg for teams using Snowflake.'"

Q28 — Managed vs External Tables

Question: "What is the difference? When do you use each?"

Aspect	Managed Table	External Table
Data location	Databricks-managed (auto)	YOUR storage path (you specify)
DROP TABLE	Deletes metadata AND data	Deletes metadata ONLY — data survives
Predictive Optimization	Yes (automatic)	No
Best for	Most tables (default)	Shared data, legacy migration

sql

-- Managed (recommended for new tables)
CREATE TABLE catalog.schema.orders (order_id LONG, amount DECIMAL);
-- Data stored in Databricks-managed location
-- DROP TABLE deletes everything

-- External (for data shared with other systems)
CREATE TABLE catalog.schema.legacy_orders (order_id LONG, amount DECIMAL)
LOCATION 'abfss://container@storage.dfs.core.windows.net/legacy/orders/';
-- Data stays at YOUR path even if table is dropped

SECTION 9: PRODUCTION SCENARIOS & GOTCHAS

Q29 — Predictive Optimization — Auto-maintenance

Question: "What is Predictive Optimization? How does it work?"

Predictive Optimization is Databricks' auto-maintenance system. It watches your tables and automatically runs OPTIMIZE, VACUUM, and ANALYZE TABLE at the right time.

Key points:

Enabled by default on all new Unity Catalog managed tables
Learns access patterns → optimizes during low-traffic windows
Automatically runs: OPTIMIZE (compaction), VACUUM (cleanup), ANALYZE TABLE (refresh stats)
No configuration needed — just use managed tables
Not available on external tables

Interview Tip: "With Predictive Optimization, I no longer need to schedule nightly OPTIMIZE/VACUUM jobs. The platform handles it. This is one reason I prefer managed tables over external tables."

Q30 — Scenario: Design a Delta Lake pipeline for a real-time order system

Question: "Design a medallion architecture for an e-commerce order system using Delta Lake."

📐 Architecture Diagram

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   BRONZE    │     │   SILVER    │     │    GOLD     │
│  Raw data   │────→│  Cleaned    │────→│ Aggregated  │
│  as-is      │     │  validated  │     │ business    │
└─────────────┘     └─────────────┘     └─────────────┘

BRONZE LAYER:
  - Auto Loader ingests raw JSON/CSV from cloud storage
  - CLUSTER BY (ingest_date)
  - Schema: raw_payload STRING, source STRING, ingest_timestamp TIMESTAMP
  - enableChangeDataFeed = true (for Silver to read changes)

SILVER LAYER:
  - Reads BRONZE changes via CDF (incremental!)
  - MERGE to deduplicate and validate
  - CLUSTER BY (order_date, customer_id)
  - Schema enforcement + data quality checks (expectations)
  - enableChangeDataFeed = true (for Gold to read changes)

GOLD LAYER:
  - Reads SILVER changes via CDF
  - Aggregations: daily revenue, customer metrics, product performance
  - Optimized for BI queries (pre-computed)
  - CLUSTER BY (report_date)

Key decisions in this design:

Liquid Clustering everywhere (not partitioning)
CDF enabled at each layer for incremental processing
Managed tables for Predictive Optimization
Auto Loader for exactly-once ingestion
MERGE for deduplication at Silver layer

Q31 — Common production gotchas with Delta Lake

Question: "What are the most common issues you've seen with Delta Lake in production?"

Gotcha	Symptom	Fix
Small file problem	Queries slow, too many Spark tasks	OPTIMIZE + autoOptimize properties
VACUUM too aggressive	Time travel broken, readers fail with FileNotFoundException	Keep default 7-day retention, never RETAIN 0
MERGE without partition filter	MERGE takes hours on large tables	Add partition column to ON clause
Over-partitioning	Thousands of tiny files per partition	Switch to Liquid Clustering
Schema mismatch	Write fails with schema error	Use mergeSchema or overwriteSchema as appropriate
Concurrent write conflicts	ConcurrentModificationException	Partition writes so pipelines touch different files
Stale statistics	Data skipping not working, queries slow	ANALYZE TABLE or enable Predictive Optimization
Z-ORDER on too many columns	No improvement in query speed	Max 4 columns, choose most filtered ones
Not using Photon	MERGE and queries 3-5x slower than needed	Switch cluster to Photon runtime
Streaming + OPTIMIZE conflict	OPTIMIZE blocks streaming writes	Use autoCompact instead of manual OPTIMIZE

SECTION 10: NEW FEATURES (2025-2026)

Q32 — What's new in Delta Lake 4.x?

Question: "What are the latest features? Mention 2-3 to show you're up to date."

Feature	Version	What It Does
Variant Data Type	4.0	Store semi-structured JSON without schema — query nested fields directly
Type Widening	4.0	Change column type (INT → BIGINT) without rewriting data
Coordinated Commits	4.0	Multiple engines can safely write to the same Delta table
UniForm	4.0	Auto-generate Iceberg/Hudi metadata for cross-engine reads
Conflict-Free DV	4.1	Enable Deletion Vectors without blocking concurrent writers
Atomic CTAS	4.1	CREATE TABLE AS SELECT is fully atomic (no partial tables on failure)
Lakebase	2026	Serverless PostgreSQL-compatible database inside Databricks

Interview power move: "I'd use Liquid Clustering with Deletion Vectors enabled and CDF turned on — that gives me automatic compaction, fast updates, and incremental ETL. For cross-team sharing with Snowflake users, I'd enable UniForm so they read the same table as Iceberg."

QUICK REVISION CHECKLIST

Can you answer each in 2-3 minutes?

Why do we need Delta Lake over plain Parquet? (Q01)
List 8 key features of Delta Lake (Q02)
How does Delta ensure ACID? What happens if a write fails midway? (Q03)
What's inside a transaction log commit? (Q04)
What are checkpoints and why are they critical? (Q05)
How does optimistic concurrency work? (Q06)
Write a MERGE from memory with all 4 clauses (Q07)
How to handle duplicate keys in MERGE? (Q08)
List 6 ways to make MERGE faster (Q09)
What is OPTIMIZE? Good and bad? (Q11)
What is VACUUM? Risks of RETAIN 0? (Q12)
How to fix the small file problem? (Q13)
Partitioning vs Z-ORDER vs Liquid Clustering — when to use each? (Q14-Q16)
How does data skipping work? (Q17)
How does Time Travel work? Recover from accidental delete? (Q18-Q19)
What is CDF? How does it help incremental ETL? (Q21)
What are Deletion Vectors? Trade-offs? (Q22)
Delta vs Iceberg vs Hudi — when to choose each? (Q27)
Design a medallion architecture with Delta (Q30)
Name 3 new features in Delta 4.x (Q32)

APPENDIX: LEARN BY WATCHING + DOING

(This section replaces YouTube videos — read + copy-paste + see the output)

LAB 1 — Watch the Transaction Log Grow in Real Time

Goal: See EXACTLY what Delta writes on every operation. No magic. Where: Databricks Community Edition (free) OR any Databricks workspace Time: 10 minutes

Setup (copy-paste this first)

python — editable

# ─── SETUP ────────────────────────────────────────────────
# Create a fresh Delta table in a clean location
path = "/tmp/delta_lab/orders"

# Clean up anything from previous runs
dbutils.fs.rm(path, recurse=True)   # ← wipes old data (safe, /tmp only)

# Create initial data
from pyspark.sql import Row
data = [
    Row(order_id=1, customer="Alice", amount=100.0),
    Row(order_id=2, customer="Bob",   amount=200.0),
    Row(order_id=3, customer="Carol", amount=300.0),
]
df = spark.createDataFrame(data)

# Write as Delta — this creates the table AND the _delta_log
df.write.format("delta").save(path)

print("✅ Delta table created at:", path)

Step 1 — Look at what Delta just created

python — editable

# List files in the folder
files = dbutils.fs.ls(path)
for f in files:
    print(f.name)

# EXPECTED OUTPUT:
# _delta_log/                          ← the transaction log folder
# part-00000-abc123.snappy.parquet     ← actual data (Parquet)
# part-00001-def456.snappy.parquet

# Now peek inside _delta_log
log_files = dbutils.fs.ls(f"{path}/_delta_log")
for f in log_files:
    print(f.name)

# EXPECTED OUTPUT:
# 00000000000000000000.json    ← ONE commit file. This is commit #0.

Step 2 — Read commit #0 (see the "birth certificate" of the table)

python — editable

# Read the first commit JSON
commit_0 = spark.read.json(f"{path}/_delta_log/00000000000000000000.json")
commit_0.show(truncate=False)

# EXPECTED OUTPUT (5 rows, one per action):
# ┌──────────┬────────────────────────────────────────┐
# │ action   │ details                                 │
# ├──────────┼────────────────────────────────────────┤
# │ protocol │ minReaderVersion=1, minWriterVersion=2 │  ← compat info
# │ metaData │ schema={order_id INT, customer STR,..} │  ← table schema
# │ add      │ path=part-00000-abc.parquet, size=921  │  ← data file 1
# │ add      │ path=part-00001-def.parquet, size=874  │  ← data file 2
# │ commitInfo│ operation=WRITE, timestamp=...         │  ← who did what
# └──────────┴────────────────────────────────────────┘

Step 3 — Do an UPDATE. Watch a NEW commit appear.

python — editable

from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, path)

# Update Alice's amount from 100 → 150
dt.update(
    condition = "customer = 'Alice'",
    set       = {"amount": "150.0"}
)

# Now check _delta_log AGAIN
log_files = dbutils.fs.ls(f"{path}/_delta_log")
for f in log_files:
    print(f.name)

# EXPECTED OUTPUT:
# 00000000000000000000.json    ← original commit (unchanged)
# 00000000000000000001.json    ← NEW commit! ✨ This is the UPDATE

Step 4 — Read commit #1. See what Delta did behind the scenes.

python — editable

commit_1 = spark.read.json(f"{path}/_delta_log/00000000000000000001.json")
commit_1.show(truncate=False)

# EXPECTED OUTPUT:
# ┌───────────┬───────────────────────────────────────────┐
# │ action    │ details                                    │
# ├───────────┼───────────────────────────────────────────┤
# │ remove    │ path=part-00000-abc.parquet (OLD file)    │  ← old file marked removed
# │ add       │ path=part-00002-xyz.parquet (NEW file)    │  ← new file with updated row
# │ commitInfo│ operation=UPDATE, numUpdatedRows=1        │
# └───────────┴───────────────────────────────────────────┘

# 🧠 KEY INSIGHT:
#    Delta NEVER modifies a Parquet file in place.
#    It writes a NEW file with the updated row,
#    and marks the OLD file as "removed" in the log.
#    The old file still EXISTS on disk until VACUUM runs!

Step 5 — Check the history (Delta's "audit log")

python — editable

dt.history().select("version", "timestamp", "operation", "operationMetrics").show(truncate=False)

# EXPECTED OUTPUT:
# ┌───────┬──────────────────┬─────────┬────────────────────────┐
# │version│ timestamp         │ operation│ operationMetrics        │
# ├───────┼──────────────────┼─────────┼────────────────────────┤
# │  1    │ 2026-04-05 10:31 │ UPDATE  │ {numUpdatedRows: 1}    │
# │  0    │ 2026-04-05 10:30 │ WRITE   │ {numOutputRows: 3}     │
# └───────┴──────────────────┴─────────┴────────────────────────┘
#
# Every commit = one row in DESCRIBE HISTORY. That's your audit trail.

Step 6 — TIME TRAVEL (go back to BEFORE the update)

python — editable

# Read version 0 (the original, before Alice got updated)
df_before = spark.read.format("delta").option("versionAsOf", 0).load(path)
df_before.show()

# EXPECTED OUTPUT (Alice is still 100.0):
# +--------+--------+------+
# |order_id|customer|amount|
# +--------+--------+------+
# |       1|   Alice| 100.0|   ← original value!
# |       2|     Bob| 200.0|
# |       3|   Carol| 300.0|
# +--------+--------+------+

# Now read current version (version 1)
df_now = spark.read.format("delta").load(path)
df_now.show()

# EXPECTED OUTPUT:
# +--------+--------+------+
# |order_id|customer|amount|
# +--------+--------+------+
# |       1|   Alice| 150.0|   ← updated value
# |       2|     Bob| 200.0|
# |       3|   Carol| 300.0|
# +--------+--------+------+
#
# 🎯 Same table. Two different versions. This is Time Travel.

Now you've SEEN the transaction log grow. No video needed.

LAB 2 — Watch MERGE Work (The #1 Delta Operation)

Goal: See insert + update + delete happen in ONE atomic operation.

python — editable

# ─── SETUP target table ───────────────────────────
target_path = "/tmp/delta_lab/customers"
dbutils.fs.rm(target_path, recurse=True)

target_data = [
    Row(id=1, name="Alice",   status="active"),
    Row(id=2, name="Bob",     status="active"),
    Row(id=3, name="Carol",   status="active"),
]
spark.createDataFrame(target_data).write.format("delta").save(target_path)

print("📋 TARGET (before MERGE):")
spark.read.format("delta").load(target_path).show()
# +---+-----+------+
# | id| name|status|
# +---+-----+------+
# |  1|Alice|active|
# |  2|  Bob|active|
# |  3|Carol|active|
# +---+-----+------+

# ─── Build source (daily change feed) ──────────────
source_data = [
    Row(id=2, name="Bob",     status="inactive"),   # UPDATE  (Bob deactivated)
    Row(id=3, name="Carol",   status="DELETE_ME"),  # DELETE  (special marker)
    Row(id=4, name="Dave",    status="active"),     # INSERT  (new customer)
]
source_df = spark.createDataFrame(source_data)

print("📥 SOURCE (daily change feed):")
source_df.show()
# +---+----+---------+
# | id|name|   status|
# +---+----+---------+
# |  2| Bob| inactive|     ← will UPDATE existing row
# |  3|Carol|DELETE_ME|    ← will DELETE existing row
# |  4|Dave|   active|     ← will INSERT new row
# +---+----+---------+

python — editable

# ─── THE MERGE ─────────────────────────────────────
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, target_path)

dt.alias("t").merge(
    source_df.alias("s"),
    "t.id = s.id"                                   # ← match on id
).whenMatchedDelete(
    condition = "s.status = 'DELETE_ME'"            # ← if marker, delete
).whenMatchedUpdate(
    set = {"name": "s.name", "status": "s.status"}  # ← else update
).whenNotMatchedInsert(
    values = {"id": "s.id", "name": "s.name", "status": "s.status"}  # ← new rows: insert
).execute()

print("📋 TARGET (after MERGE):")
spark.read.format("delta").load(target_path).orderBy("id").show()
# +---+-----+--------+
# | id| name|  status|
# +---+-----+--------+
# |  1|Alice|  active|   ← unchanged
# |  2|  Bob|inactive|   ← UPDATED ✅
# |  4| Dave|  active|   ← INSERTED ✅
# +---+-----+--------+
#                         ← Carol (id=3) DELETED ✅

# ─── Verify: one commit for the whole thing ────────
dt.history().select("version", "operation", "operationMetrics").show(truncate=False)
# +-------+---------+─────────────────────────────────────────────────┐
# |version|operation│ operationMetrics                                 │
# +-------+---------+─────────────────────────────────────────────────┤
# |   1   │ MERGE   │ numTargetRowsInserted=1                          │
# |       │         │ numTargetRowsUpdated=1                           │
# |       │         │ numTargetRowsDeleted=1                           │
# |   0   │ WRITE   │ numOutputRows=3                                  │
# +-------+---------+─────────────────────────────────────────────────┘
#
# 🎯 ONE atomic commit handled INSERT + UPDATE + DELETE together.
#    That's why MERGE is the most important Delta operation.

LAB 3 — Watch OPTIMIZE Fix the Small File Problem

python — editable

# ─── Create a table with MANY small files (simulating streaming) ─────
small_files_path = "/tmp/delta_lab/small_files"
dbutils.fs.rm(small_files_path, recurse=True)

# Write 10 times, one row each time → 10 tiny files
for i in range(10):
    spark.createDataFrame([Row(id=i, val=f"row_{i}")]) \
         .write.format("delta").mode("append").save(small_files_path)

# Count Parquet files
parquet_files = [f.name for f in dbutils.fs.ls(small_files_path) if f.name.endswith(".parquet")]
print(f"🔴 Parquet files BEFORE OPTIMIZE: {len(parquet_files)}")
# EXPECTED: 🔴 Parquet files BEFORE OPTIMIZE: 10
#
# 🧠 Each file is tiny (few hundred bytes) → disaster for query performance

python — editable

# ─── Run OPTIMIZE ───────────────────────────────────
spark.sql(f"OPTIMIZE delta.`{small_files_path}`")

parquet_files = [f.name for f in dbutils.fs.ls(small_files_path) if f.name.endswith(".parquet")]
print(f"🟢 Parquet files AFTER OPTIMIZE: {len(parquet_files)}")
# EXPECTED: 🟢 Parquet files AFTER OPTIMIZE: 11
#
# Wait — 11? Why not 1?
#   → OPTIMIZE created 1 NEW compacted file (with all 10 rows)
#   → The 10 OLD files are marked "removed" in _delta_log but still on disk
#   → VACUUM will clean them up later
#
# Verify: check the log
dt = DeltaTable.forPath(spark, small_files_path)
dt.history().select("version", "operation", "operationMetrics").show(truncate=False)
# Latest row: operation=OPTIMIZE, numFilesAdded=1, numFilesRemoved=10 ✅

VISUAL ANIMATION 1 — Optimistic Concurrency Conflict

What happens when TWO jobs try to update the same table at the same time?

📐 Architecture Diagram

TIME  →  T0              T1               T2               T3
         │               │                │                │
JOB A    ├─ reads v5 ────┼─ writes data ──┼─ commits v6 ─✅ │
         │               │                │                │
JOB B    ├─ reads v5 ────┼─ writes data ──┼────────────────┼─ tries to commit v6 ❌
         │               │                │                │    CONFLICT!
         │               │                │                │    Job A already wrote v6
         │               │                │                │
         │               │                │                ├─ Delta checks:
         │               │                │                │    "Did Job A touch the
         │               │                │                │     same files as me?"
         │               │                │                │
         │               │                │                ├─ IF NO overlap:
         │               │                │                │    retry commit as v7 ✅
         │               │                │                │
         │               │                │                └─ IF YES overlap:
         │               │                │                     throw ConcurrentModification ❌
         │               │                │                     → user must retry the WHOLE job

KEY INSIGHT:
  • Delta does NOT lock the table while writing (unlike traditional DBs)
  • Both jobs write data files in PARALLEL — no slowdown
  • Conflict is detected ONLY at commit time (when writing the JSON)
  • Only ONE commit can "win" version 6 → the other must retry or fail

VISUAL ANIMATION 2 — Time Travel Mechanics

📐 Architecture Diagram

DELTA TABLE /customers/ — Timeline of Operations:

v0: CREATE          v1: INSERT 100      v2: UPDATE 20      v3: DELETE 5
    │                   │                   │                   │
    ▼                   ▼                   ▼                   ▼
┌─────────┐         ┌─────────┐         ┌─────────┐         ┌─────────┐
│ empty   │         │ file-A  │         │ file-A  │ removed │ file-A  │ removed
│         │    →    │         │    →    │ file-B  │         │ file-B  │
│         │         │         │         │ file-C  │ added   │ file-C  │
│         │         │         │         │         │         │ file-D  │ added
└─────────┘         └─────────┘         └─────────┘         └─────────┘
 _delta_log          _delta_log          _delta_log          _delta_log
 /0.json             /0.json             /0.json             /0.json
                     /1.json             /1.json             /1.json
                                         /2.json             /2.json
                                                             /3.json

TIME TRAVEL QUERIES:

  SELECT * ... VERSION AS OF 0  →  reads NO files        (empty table)
  SELECT * ... VERSION AS OF 1  →  reads file-A          (100 rows)
  SELECT * ... VERSION AS OF 2  →  reads file-B, file-C  (80 old + 20 updated)
  SELECT * ... VERSION AS OF 3  →  reads file-B, file-C,  (75 rows after delete)
                                         file-D

🧠 KEY INSIGHT:
   Delta doesn't "roll back" anything for Time Travel.
   It just reads the OLD files that are STILL on disk
   (because VACUUM hasn't removed them yet).
   Time Travel = reading old files according to an old commit.

VISUAL ANIMATION 3 — Z-ORDER Rearranging Data

📋 Overview

BEFORE Z-ORDER on (customer_id):

File 1: [cust_5, cust_102, cust_57, cust_893, cust_12, cust_501] ← random!

File 2: [cust_88, cust_3, cust_421, cust_77, cust_999, cust_15] ← random!

File 3: [cust_66, cust_250, cust_8, cust_180, cust_33, cust_700] ← random!

Query: WHERE customer_id = 102

→ Spark must scan ALL 3 files (any could contain 102) 😢

AFTER Z-ORDER on (customer_id):

File 1: [cust_1, cust_3, cust_5, cust_8, cust_12, cust_15] ← sorted range 1-15

File 2: [cust_33, cust_57, cust_66, cust_77, cust_88, cust_102] ← sorted range 33-102

File 3: [cust_180, cust_250, cust_421, cust_501, cust_700, cust_893] ← sorted range 180-893

Query: WHERE customer_id = 102

→ Delta checks min/max stats in _delta_log:

File 1: min=1, max=15→102 NOT here, skip ⏭️

File 2: min=33, max=102→102 COULD be here, read ✅

File 3: min=180, max=893→102 NOT here, skip ⏭️

→ Only 1 file read instead of 3! 3x faster 🚀

🧠 KEY INSIGHT:

Z-ORDER doesn't change HOW data is stored (still Parquet),

it changes WHICH rows end up in WHICH file.

Data skipping uses min/max stats to avoid reading files that can't match.

GOTCHAS — The Weird Errors You'll Hit (and How to Fix Them)

Gotcha 1: VACUUM RETAIN 0 HOURS — "Invalid retention" Error

Error: IllegalArgumentException: requirement failed: Are you sure you would

like to vacuum files with such a low retention period?

If you're not sure, you can set:

spark.databricks.delta.retentionDurationCheck.enabled = false

Why it happens: Delta's default minimum retention is 7 days. VACUUM RETAIN 0 HOURS would delete files that RUNNING QUERIES might still be reading → data corruption.

Fix:

python — editable

0 HOURS")
# Then TURN IT BACK ON:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")"># DANGER: only disable if you KNOW no concurrent reads are running
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.sql(f"VACUUM delta.`{path}` RETAIN 0 HOURS")
# Then TURN IT BACK ON:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")

Interview trap: If they ask "how would you immediately VACUUM a Delta table?" and you say "RETAIN 0 HOURS," they'll follow up with "What's the risk?" → answer: concurrent readers can get FileNotFoundException.

Gotcha 2: Schema Mismatch — "A schema mismatch detected when writing"

Error: AnalysisException: A schema mismatch detected when writing to the Delta table.

To enable schema migration using DataFrameWriter or DataStreamWriter,

please set: .option("mergeSchema", "true")

Why it happens: Your source DataFrame has a column the target table doesn't have (or different type).

Fix:

python — editable

# Option 1: allow new columns to be added automatically
df.write.format("delta") \
  .option("mergeSchema", "true") \   # ← THIS solves it
  .mode("append") \
  .save(path)

# Option 2 (for MERGE operations):
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Interview trap: "How is this different from schema ENFORCEMENT?" → Enforcement REJECTS bad writes by default. mergeSchema=true opts in to schema EVOLUTION.

Gotcha 3: ConcurrentAppendException

Error: ConcurrentAppendException: Files were added to partition [date=2026-04-05]

by a concurrent update. Please try the operation again.

Why it happens: Two jobs MERGED into the same partition at the same time. Both added files → conflict at commit time.

Fix:

python — editable

# Add a filter in your MERGE condition to restrict partitions
dt.alias("t").merge(
    source.alias("s"),
    "t.date = '2026-04-05' AND t.id = s.id"   # ← partition filter narrows conflict scope
).whenMatchedUpdate(...).execute()

# Or: retry with exponential backoff
import time
for attempt in range(3):
    try:
        # your merge
        break
    except Exception as e:
        if "ConcurrentAppend" in str(e):
            time.sleep(2 ** attempt)
        else:
            raise

Gotcha 4: MERGE with Duplicate Keys in Source

Error: UnsupportedOperationException: Cannot perform Merge as multiple source

rows matched and attempted to modify the same target row.

Why it happens: Your source has 2+ rows with the same join key → Delta can't decide which one wins.

Fix:

python — editable

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, desc

# Deduplicate source BEFORE merging
w = Window.partitionBy("id").orderBy(desc("updated_at"))
source_deduped = source_df.withColumn("rn", row_number().over(w)) \
                          .filter("rn = 1") \
                          .drop("rn")
# Now source has ONE row per id — merge will succeed

Interview trap: This is THE most common MERGE production bug. If you know this fix, you sound senior.

Gotcha 5: Small Files Problem in Streaming

Symptom: Streaming writes create thousands of tiny files (few KB each)

→ queries become slow, $$$ spent on listing files

Fix:

python — editable

# Enable auto-compaction + optimized writes on the table
spark.sql(f"""
  ALTER TABLE delta.`{path}`
  SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact'   = 'true'
  )
""")

# Also: run scheduled OPTIMIZE daily
spark.sql(f"OPTIMIZE delta.`{path}`")

Gotcha 6: "Delta Table not found" when path looks correct

Error: Path does not exist: /mnt/data/my_table

(or: _delta_log does not exist)

Why it happens: Either:

You wrote as Parquet but are trying to read as Delta
You deleted _delta_log/ manually (DON'T)
Path typo

Fix:

python — editable

# Always verify before reading
print(dbutils.fs.ls(path))                      # check path exists
print(dbutils.fs.ls(f"{path}/_delta_log"))      # check _delta_log exists

# Convert plain Parquet to Delta if needed:
from delta.tables import DeltaTable
DeltaTable.convertToDelta(spark, f"parquet.`{path}`")

MOCK INTERVIEW — Live Walkthrough

⚠️ Common Trap

Read this like watching an interview. See the bad answer → good answer → follow-up trap.

Scenario: Senior Data Engineer Round (30 min Delta deep-dive)

INTERVIEWER: "We have a 2 TB Delta table being updated every hour by a MERGE job. Lately, queries are getting slower and slower. Walk me through how you'd debug this."

❌ BAD ANSWER:

"I would run OPTIMIZE to fix it."

(Too short. No reasoning. No investigation. Red flag.)

✅ GOOD ANSWER:

"Three things come to mind, and I'd investigate in order. First, I'd run DESCRIBE HISTORY and DESCRIBE DETAIL to check the file count and size distribution — if we have thousands of small files from hourly MERGEs, that's small-file problem and OPTIMIZE would help. Second, I'd check if any column has high-cardinality filters in queries — if yes, we might need Z-ORDER on that column for data skipping. Third, I'd check the MERGE condition itself — if it's not partition-pruning, every MERGE is rewriting files across the whole table. Let me describe how I'd verify each..."

FOLLOW-UP TRAP: (interviewer nods) "Good. Now — OPTIMIZE is running but taking 4 hours and blocking the pipeline. What do you do?"

❌ BAD ANSWER: "Run it less often."

✅ GOOD ANSWER:

"I'd do three things. First, run OPTIMIZE with a partition filter so it only compacts recent partitions — OPTIMIZE table WHERE date >= current_date() - 7. Second, enable auto-compaction at the table level so small files get compacted incrementally during writes, reducing how much OPTIMIZE has to do. Third, if we're on a recent Databricks runtime, I'd consider switching from partitioning + Z-ORDER to Liquid Clustering, which handles incremental clustering automatically without blocking writes."

INTERVIEWER: "A user accidentally ran DELETE FROM customers WHERE status = 'active' — deleted 4 million rows. It's been 2 hours. Can you recover?"

❌ BAD ANSWER: "No, the data is gone."

❌ BAD ANSWER 2: "Yes, just restore from backup." (This is Delta — there's no separate backup needed)

✅ GOOD ANSWER:

"Yes, absolutely — Delta's time travel makes this easy. Two hours ago is well within default retention. I'd run DESCRIBE HISTORY customers to find the version BEFORE the DELETE, then either: option A — RESTORE TABLE customers TO VERSION AS OF to restore the whole table, or option B — if we want to preserve changes AFTER the bad DELETE, do INSERT INTO customers SELECT * FROM customers VERSION AS OF WHERE status = 'active' to re-insert just the deleted rows. Option B is safer in production because it doesn't rewind other legitimate changes."

FOLLOW-UP TRAP: "What if it had been 10 days ago, not 2 hours?"

✅ GOOD ANSWER:

"Then it depends on retention settings. Default is 7 days for data file retention, 30 days for log retention. If VACUUM ran with default retention, the old data files would already be deleted from storage — even though the transaction log would still reference them. Time Travel would fail with FileNotFoundException. The prevention is setting delta.deletedFileRetentionDuration to something higher like 30 days for critical tables, or taking scheduled shallow clones as point-in-time snapshots."

INTERVIEWER: "Explain what happens on disk when you run UPDATE customers SET email = LOWER(email)."

❌ BAD ANSWER: "It updates the rows."

✅ GOOD ANSWER:

📝 Note

"Delta files are immutable, so UPDATE works in three steps. First, Delta identifies which Parquet files contain rows matching the filter — in this case, ALL files since there's no WHERE clause. Second, for each affected file, it reads the file, applies LOWER to email for every row, and writes a NEW Parquet file. Third, it writes ONE commit JSON to _delta_log that has remove entries for all the old files and add entries for all the new files. The old files stay on disk until VACUUM runs, which is what enables time travel back to the pre-update version. Note: with Deletion Vectors enabled, UPDATE can be faster because it writes a small vector file marking which rows changed, avoiding the full file rewrite."

INTERVIEWER: "We have Job A doing hourly MERGE into table X, and Job B doing hourly OPTIMIZE on table X. They started conflicting. Why, and how do you fix it?"

❌ BAD ANSWER: "Run them at different times."

✅ GOOD ANSWER:

"This is an optimistic concurrency conflict. OPTIMIZE marks a set of files as 'removed' and adds one compacted file. If MERGE reads those same files and also tries to add/remove files in an overlapping set, Delta throws ConcurrentAppendException at commit time. Two fixes: First, OPTIMIZE is idempotent and retryable — wrap it with retry logic so a conflict just triggers a retry after the MERGE finishes. Second, partition the table so OPTIMIZE targets OLD partitions and MERGE targets the CURRENT partition — e.g., OPTIMIZE WHERE date < current_date(). That way they work on disjoint file sets and never conflict."

FINAL READINESS CHECK

If you can do all of these, you're video-free:

Run LAB 1 end-to-end without looking at the answers
Run LAB 2 and explain each MERGE clause from memory
Run LAB 3 and explain why you see 11 files after OPTIMIZE (not 1)
Draw the Optimistic Concurrency animation on a whiteboard
Draw the Z-ORDER animation on a whiteboard
Name all 6 Gotchas and how to fix each
Answer all 4 mock interview questions using the "good answer" framework

If yes → you don't need a single video. You're ready.

Practice Questions 📋