Delta Lake — Complete Interview Guide
Memory Map
SECTION 0: DELTA LAKE vs DELTA TABLE — The #1 Confusion
The Simple Answer
Real-World Analogy
Technical Difference — What Each Actually IS
┌─────────────────────────────────────────────────────────────┐ │ DELTA LAKE (Technology) │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Delta Table: │ │ Delta Table: │ │ │ │ orders │ │ customers │ │ │ │ │ │ │ │ │ │ 📁 _delta_log/ │ │ 📁 _delta_log/ │ ← each table │ │ │ 📄 data.parquet │ │ 📄 data.parquet │ has its OWN │ │ │ │ │ │ transaction │ │ └─────────────────┘ └─────────────────┘ log │ │ │ │ Provides: ACID, Schema Enforcement, Time Travel, │ │ MERGE, OPTIMIZE, VACUUM, Data Skipping │ └─────────────────────────────────────────────────────────────┘
Side-by-Side Comparison
| Aspect | Delta Lake | Delta Table |
|---|---|---|
| What is it? | Open-source storage layer/framework | A single table stored in Delta format |
| Created by | Databricks (open-sourced in 2019) | You, using CREATE TABLE ... USING DELTA |
| How many? | ONE per environment (it's the technology) | MANY — you create hundreds of Delta tables |
| Contains | The protocol, the rules, the engine | Parquet data files + _delta_log/ folder |
| Analogy | The operating system (Windows) | One file on that operating system |
| Lives where? | Runs inside Spark/Databricks runtime | Stored on S3 / ADLS / GCS / DBFS |
| Version | Delta Lake 3.x (the protocol version) | Table has its own version history (0, 1, 2, 3...) |
How You Create a Delta Table (using Delta Lake)
# Method 1: SQL
spark.sql("""
CREATE TABLE orders (
order_id INT,
customer STRING,
amount DOUBLE,
order_date DATE
) USING DELTA -- ← tells Spark: "use Delta Lake format"
LOCATION 's3://my-bucket/orders/'
""")
# Method 2: DataFrame write
df.write.format("delta").save("s3://my-bucket/orders/")
# ↑
# this "delta" = use Delta Lake technology
# Method 3: In Databricks (default since DBR 8.0+)
spark.sql("CREATE TABLE orders (...)")
# ↑ In Databricks, DELTA is the DEFAULT format
# You don't even need to say USING DELTA!
# What gets created on disk:
# s3://my-bucket/orders/
# ├── _delta_log/ ← Delta Lake adds this (transaction log)
# │ └── 00000000000000000000.json ← First commit: "table created with schema..."
# ├── part-00000.snappy.parquet ← Actual data (still Parquet!)
# └── part-00001.snappy.parquet ← More data
The Key Insight — Delta Table = Parquet + Transaction Log
What Delta Lake GIVES to a Delta Table
Interview Answers — What to Say
"Delta Lake is an open-source storage layer that sits on top of cloud storage like S3 or ADLS. It adds ACID transactions, schema enforcement, time travel, and scalable metadata handling to data lakes. Under the hood, data is still stored as Parquet files — Delta Lake adds a transaction log that tracks every change."
"A Delta table is a specific table stored in Delta format. It's a collection of Parquet data files plus a _delta_log folder that contains the transaction history. Every Delta table has its own independent transaction log."
"Delta Lake is the technology — the protocol and engine. A Delta table is one table that uses that technology. It's like MySQL vs a table in MySQL. Delta Lake provides the ACID guarantees, and each Delta table is an individual dataset that benefits from those guarantees."
What NOT to Say:
Quick Memory Trick
SECTION 1: WHY DELTA LAKE EXISTS
Q01 — What is Delta Lake? Why do we need it?
Question: Your interviewer asks: "Why can't we just use Parquet files on a data lake? Why do we need Delta Lake?"
The Problem (Plain Parquet):
The Solution (Delta Lake):
Delta Lake adds a transaction log (_delta_log/) on top of Parquet files. This log tracks every change — which files were added, removed, or modified. It turns a chaotic collection of files into a proper table with ACID transactions, schema enforcement, and version history.
One-liner for interviews: "Delta Lake is an open-source storage layer that adds ACID transactions, scalable metadata handling, and time travel to existing data lakes built on Parquet."
Q02 — What are the key features of Delta Lake? (The Top 8)
Question: "List the main features of Delta Lake and explain why each matters."
| # | Feature | What It Does | Why It Matters |
|---|---|---|---|
| 1 | ACID Transactions | Every write is atomic — all or nothing | No more corrupted tables from failed writes |
| 2 | Time Travel | Query any historical version of the table | Undo mistakes, audit changes, debug issues |
| 3 | Schema Enforcement | Rejects writes that don't match the table schema | Prevents bad data from entering your table |
| 4 | Schema Evolution | Automatically adds new columns when needed | Handles evolving source systems gracefully |
| 5 | MERGE (Upsert) | INSERT + UPDATE + DELETE in one atomic operation | The #1 operation for data engineering pipelines |
| 6 | Data Skipping | Skips files that can't contain query results | Makes queries 10-100x faster on large tables |
| 7 | Unified Batch + Streaming | Same table for both batch and streaming writes | No separate streaming tables needed |
| 8 | Open Format | Data stored as Parquet (open standard) | No vendor lock-in — read with Spark, Trino, Presto, etc. |
Interview Tip: Don't just list features — connect them to PROBLEMS they solve. "We needed MERGE because our booking system sends daily change feeds that contain inserts, updates, and cancellations — all in one file."
What NOT to Say: "Delta Lake is a database." — No, it's a STORAGE LAYER on top of cloud storage. The data still lives as Parquet files on ADLS/S3/GCS.
Q03 — How does Delta Lake ensure ACID transactions?
Question: "Walk me through how each ACID property is implemented in Delta Lake."
| Property | What It Means | How Delta Does It |
|---|---|---|
| Atomicity | All changes in a transaction succeed, or none do | Each commit writes ONE JSON file atomically using put-if-absent (cloud storage rename). If it fails, nothing changes. |
| Consistency | Data always follows schema rules | Schema enforcement rejects writes with wrong column types/names |
| Isolation | Concurrent readers/writers don't interfere | Snapshot isolation — readers see the table as it was when they started reading. Writers use optimistic concurrency (see Q05). |
| Durability | Committed data survives crashes | Data is stored as Parquet on cloud storage (S3/ADLS/GCS) — inherently durable |
Follow-up they'll ask: "What happens if a write fails midway?"
Interview Tip: "The key insight is that the commit JSON file is the SINGLE point of atomicity. Data files can be partially written — it doesn't matter until the commit JSON makes them visible."
SECTION 2: TRANSACTION LOG DEEP DIVE
Q04 — What is the Delta Transaction Log (_delta_log)? What's inside each commit?
Question: "Explain the Delta transaction log in detail. What exactly is stored in each commit file?"
Each JSON commit file contains these actions:
// Example: 00000000000000000005.json (Commit #5)
{
"commitInfo": {
"timestamp": 1711670400000,
"operation": "MERGE",
"operationParameters": {"predicate": "t.id = s.id"},
"operationMetrics": {
"numTargetRowsInserted": "1500",
"numTargetRowsUpdated": "300",
"numTargetRowsDeleted": "50"
}
},
"add": {
"path": "part-00000-new-file.snappy.parquet",
"size": 1073741824,
"partitionValues": {"date": "2026-03-29"},
"stats": "{\"numRecords\":50000,\"minValues\":{\"id\":1},\"maxValues\":{\"id\":50000}}"
},
"remove": {
"path": "part-00000-old-file.snappy.parquet",
"deletionTimestamp": 1711670400000,
"dataChange": true
}
}
| Action | What It Records | Example |
|---|---|---|
add | New Parquet file was added to the table | After INSERT, MERGE, or OPTIMIZE |
remove | File was logically deleted (still physically exists!) | After UPDATE, DELETE, or OPTIMIZE |
metaData | Schema change, table properties changed | After ALTER TABLE or schema evolution |
protocol | Reader/writer version requirements | After enabling new features (e.g., Deletion Vectors) |
commitInfo | Who, when, what operation, and metrics | Every commit has this |
Critical detail: remove is LOGICAL, not physical
Interview Tip: "The transaction log is append-only — you never modify existing commit files, you only add new ones. This is what makes it safe for concurrent readers."
Q05 — What are checkpoint files? Why are they critical?
Question: "Without checkpoints, what happens when a Delta table has 100,000 commits?"
Problem without checkpoints:
Solution — Checkpoints:
Key details:
- Checkpoint interval:
delta.checkpointInterval(default: 10) _last_checkpointfile points to the latest checkpoint- Checkpoints are in Parquet format (columnar, fast to read)
- Multi-part checkpoints exist for very large tables (splits into multiple files)
Q06 — How does Optimistic Concurrency Control work?
Question: "Two pipelines write to the same Delta table at the same time. What happens?"
Two isolation levels:
| Level | Default? | Behavior | Use When |
|---|---|---|---|
| WriteSerializable | Yes | Concurrent writes to different files succeed | Normal ETL workloads |
| Serializable | No | Even reads during writes can conflict | Strict audit/compliance tables |
How to AVOID conflicts in practice:
- Partition writes by date/region so different pipelines touch different files
- Add partition column to MERGE condition:
ON t.id = s.id AND t.date = s.date - Use Liquid Clustering to minimize file overlap
Interview Tip: "The word 'optimistic' is key — Delta assumes no conflict, lets both writers proceed, and only checks at commit time. This is efficient because most concurrent writes DON'T conflict."
SECTION 3: MERGE — THE MOST IMPORTANT COMMAND
Q07 — Write a MERGE from memory (upsert with all clauses)
Question: "Write a MERGE that handles inserts, updates, and deletes in one command."
This is the #1 coding question in Databricks interviews. Memorize it.
-- MERGE = Compare source (new data) with target (existing table), then act
MERGE INTO orders_fact AS t -- t = target (existing table)
USING orders_staging AS s -- s = source (new data)
ON t.order_id = s.order_id -- Match condition
-- Case 1: Record exists AND is cancelled → delete it
WHEN MATCHED AND s.status = 'CANCELLED' THEN
DELETE
-- Case 2: Record exists AND has changed → update it
WHEN MATCHED AND s.updated_at > t.updated_at THEN
UPDATE SET
t.status = s.status,
t.amount = s.amount,
t.updated_at = s.updated_at
-- Case 3: Record doesn't exist in target → insert it
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, status, amount, created_at, updated_at)
VALUES (s.order_id, s.customer_id, s.status, s.amount, s.created_at, s.updated_at)
-- Case 4: Record in target but NOT in source → clean up (Databricks extension)
WHEN NOT MATCHED BY SOURCE AND t.status = 'PENDING' THEN
DELETE
PySpark equivalent:
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, "orders_fact")
target.alias("t").merge(
source_df.alias("s"),
"t.order_id = s.order_id"
).whenMatchedDelete(
condition="s.status = 'CANCELLED'"
).whenMatchedUpdate(
condition="s.updated_at > t.updated_at",
set={"status": "s.status", "amount": "s.amount", "updated_at": "s.updated_at"}
).whenNotMatchedInsertAll(
).execute()
Q08 — How do you handle duplicate keys in source during MERGE?
Question: "Your staging table has duplicate order_ids. MERGE throws an error. How do you fix it?"
Why it fails: MERGE requires each target row to match AT MOST one source row. Duplicate keys = ambiguous → error.
The fix — deduplicate source with ROW_NUMBER:
WITH deduped_source AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY order_id -- Group duplicates together
ORDER BY updated_at DESC -- Keep the LATEST record
) AS rn
FROM orders_staging
)
MERGE INTO orders_fact AS t
USING (SELECT * FROM deduped_source WHERE rn = 1) AS s -- Only latest per key
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
Why duplicates happen in real life:
- Kafka retry delivered the same event twice
- Source system sent the same file twice
- Backfill job reprocessed already-loaded data
Q09 — How to make MERGE faster? (6 techniques)
Question: "Our MERGE takes 3 hours on a 500 GB table. How do you optimize it?"
Why MERGE is slow: It scans the ENTIRE target to find matching records. 500 GB table + 10K source rows = reads all 500 GB just to match 10K rows.
1. Add partition column to match condition:
-- BAD: Scans ALL data across ALL dates
MERGE INTO orders t USING staging s
ON t.order_id = s.order_id -- Scans 500 GB!
-- GOOD: Only scans matching date partitions
MERGE INTO orders t USING staging s
ON t.order_id = s.order_id AND t.order_date = s.order_date -- Scans ~2 GB!
2. Z-ORDER or Liquid Cluster on merge key:
OPTIMIZE orders ZORDER BY (order_id);
-- Data skipping now works for MERGE — skips irrelevant files
3. Filter unchanged rows from source:
MERGE INTO orders t
USING (
SELECT s.* FROM staging s
LEFT JOIN orders t ON s.order_id = t.order_id
WHERE t.order_id IS NULL OR s.hash_value != t.hash_value
) AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
4. Broadcast small source:
100 MB → broadcast to all workers
"t.order_id = s.order_id"
)">from pyspark.sql.functions import broadcast
target.alias("t").merge(
broadcast(source_df).alias("s"), # Source < 100 MB → broadcast to all workers
"t.order_id = s.order_id"
)
5. Compact target first:
OPTIMIZE orders; -- 50,000 small files → 500 large files → fewer tasks
6. Use Photon runtime:
- Select "Photon" when creating cluster — 3-5x faster for MERGE
- No code changes needed
Q10 — MERGE with Schema Evolution
Question: "Source data has a new column that doesn't exist in the target. How do you handle it?"
-- Enable auto schema merge
SET spark.databricks.delta.schema.autoMerge.enabled = true;
-- MERGE as normal — new columns from source auto-added to target
MERGE INTO customers t
USING staging s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET * -- * includes new columns
WHEN NOT MATCHED THEN INSERT * -- New column auto-added to target schema
When to use: Source systems add new fields over time (very common). When NOT to use: Strict regulatory tables where schema changes need approval.
SECTION 4: OPTIMIZE, VACUUM, AND THE SMALL FILE PROBLEM
Q11 — What is OPTIMIZE? When is it good? When is it bad?
Question: "Explain OPTIMIZE. What are its trade-offs?"
What it does: Combines many small files into fewer large files (~1 GB each).
-- Basic OPTIMIZE
OPTIMIZE orders;
-- Example: 10,000 files × 1 MB each → 10 files × 1 GB each
-- OPTIMIZE specific partition
OPTIMIZE orders WHERE order_date >= '2026-03-01';
-- OPTIMIZE with Z-ORDER (organize data within files)
OPTIMIZE orders ZORDER BY (customer_id, order_date);
GOOD things about OPTIMIZE:
| Benefit | Why |
|---|---|
| Faster reads | Fewer files = fewer I/O operations = faster queries |
| Better data skipping | Sorted data has tighter min/max ranges per file |
| Faster MERGE | Fewer files = fewer tasks for Spark |
| Non-destructive | Old files still exist (for time travel) until VACUUM |
BAD things about OPTIMIZE:
| Downside | Why |
|---|---|
| Rewrites ALL files | Even unchanged files get rewritten (Z-ORDER does full rewrite) |
| Expensive for large tables | 1 TB table = reads 1 TB + writes 1 TB = 2 TB of I/O |
| Doubles storage temporarily | Old files + new files exist until VACUUM |
| Not incremental (with Z-ORDER) | OPTIMIZE without Z-ORDER is incremental in newer versions |
| Blocks concurrent writes briefly | Takes a commit lock during the commit phase |
Best practices:
Interview Tip:
"OPTIMIZE is a WRITE operation that benefits READS. Run it during low-traffic windows. For streaming tables, use autoOptimize table property instead of manual OPTIMIZE."
Q12 — What is VACUUM? What are the risks?
Question: "What does VACUUM do? Can I run VACUUM RETAIN 0 HOURS? What happens?"
What VACUUM does:
-- Default: delete files older than 7 days
VACUUM orders;
-- Custom retention
VACUUM orders RETAIN 168 HOURS; -- 168 hours = 7 days
-- DANGEROUS — delete ALL old files immediately
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM orders RETAIN 0 HOURS;
Risks of VACUUM RETAIN 0 HOURS:
| Risk | Explanation |
|---|---|
| Breaks ALL time travel | Can't query any previous version — old files are gone |
| Active readers may fail | A running query that started before VACUUM may reference deleted files → FileNotFoundException |
| No recovery | Deleted files are gone permanently — no undo |
| Concurrent write failures | Writers that started with old versions lose their reference files |
Rule: NEVER use RETAIN 0 HOURS in production. Default 7 days is a safety net.
OPTIMIZE vs VACUUM — they work as a pair:
Q13 — The Small File Problem — Root Causes and Solutions
Question: "Your Delta table has 50,000 small files (each <1 MB). Queries take 45 minutes. How do you fix it?"
Why small files are bad:
Root causes of small files:
| Cause | Example |
|---|---|
| Streaming micro-batches | Processing every 10 seconds → 6 files/minute → 8,640 files/day |
| Over-partitioning | PARTITIONED BY (date, hour, region, category) → millions of tiny partitions |
| Frequent small appends | 100-row inserts every minute |
| Too many shuffle partitions | spark.sql.shuffle.partitions = 200 writes 200 files per write |
Immediate fix:
-- Step 1: Compact files now
OPTIMIZE slow_table;
-- 50,000 × 1 MB → ~50 × 1 GB
-- Step 2: Clean up old files
VACUUM slow_table;
-- Step 3: Refresh statistics
ANALYZE TABLE slow_table COMPUTE STATISTICS;
Prevent future small files:
-- Option A: Auto-optimization table properties
ALTER TABLE my_table SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true', -- Coalesce during write
'delta.autoOptimize.autoCompact' = 'true' -- Mini-OPTIMIZE after each write
);
-- Option B: For streaming — increase trigger interval
# .trigger(processingTime="5 minutes") -- Instead of "10 seconds"
-- Option C: For new tables — use Liquid Clustering
CREATE TABLE orders (...) CLUSTER BY (order_date, customer_id);
-- Handles compaction automatically
-- Option D: Reduce shuffle partitions
spark.conf.set("spark.sql.shuffle.partitions", "auto") -- Adaptive (AQE)
SECTION 5: DATA ORGANIZATION — PARTITIONING vs Z-ORDER vs LIQUID CLUSTERING
Q14 — Partitioning — When is it good? When is it bad?
Question: "When should you partition a Delta table? When should you NOT?"
-- Creating a partitioned table
CREATE TABLE bookings (
booking_id LONG,
passenger_name STRING,
booking_date DATE,
amount DECIMAL(10,2)
) PARTITIONED BY (booking_date);
GOOD — Use partitioning when:
- Column has low cardinality (few unique values): date, country, region
- Each partition has at least 1 GB of data
- Most queries filter on the partition column
- Example:
PARTITIONED BY (year)where each year has 50 GB+ of data
BAD — Don't partition when:
- Column has high cardinality (many unique values): user_id, order_id
- Partitions are too small (< 1 GB each) → small file problem
- You filter on multiple columns → partitioning only helps ONE column
- Example:
PARTITIONED BY (customer_id)with 10 million customers = disaster
The 1 GB rule:
Q15 — Z-ORDER — How does it work internally?
Question: "What is Z-ORDER? How is it different from sorting? How many columns can you Z-ORDER on?"
What Z-ORDER does:
Regular sorting (ORDER BY) sorts by one column, then another. Z-ORDER uses a space-filling curve to interleave multiple columns, so data is co-located on ALL specified columns simultaneously.
-- Z-ORDER is always used WITH OPTIMIZE
OPTIMIZE orders ZORDER BY (customer_id, order_date);
How it helps — data skipping becomes effective:
Limitations of Z-ORDER:
| Limitation | Explanation |
|---|---|
| Max 4 columns effective | More columns = less effective (diminishing returns) |
| Full rewrite every time | OPTIMIZE ZORDER rewrites ALL files — expensive on large tables |
| Not incremental | Can't Z-ORDER just the new data — must redo everything |
| Manual operation | Must run OPTIMIZE manually or schedule it |
| Can't change columns easily | Changing Z-ORDER columns requires full rewrite |
Q16 — Liquid Clustering — The Modern Replacement
Question: "What is Liquid Clustering? Why does Databricks say it replaces both partitioning AND Z-ORDER?"
-- Create table with Liquid Clustering
CREATE TABLE orders (
order_id LONG,
customer_id LONG,
order_date DATE,
amount DECIMAL(10,2)
) CLUSTER BY (order_date, customer_id);
-- Change clustering columns anytime — NO full rewrite!
ALTER TABLE orders CLUSTER BY (region, order_date);
-- OPTIMIZE is now INCREMENTAL — only processes new/changed data
OPTIMIZE orders;
Why Liquid Clustering is better:
| Aspect | Partitioning | Z-Ordering | Liquid Clustering |
|---|---|---|---|
| How it works | Separate folders per value | Sorts data within files | Auto-organizes data incrementally |
| Good for | Low cardinality only | High cardinality | Any cardinality |
| Applied when | On write (rigid) | Manual OPTIMIZE | Automatically on writes |
| Incremental | N/A | No (full rewrite) | Yes (only new/changed data) |
| Change columns | Requires full table rewrite | Must re-OPTIMIZE everything | Just ALTER TABLE |
| Small file handling | Can cause small files | Doesn't help | Prevents small files |
| Replaces others? | — | No | Yes — replaces BOTH |
When to use what (2026 recommendation):
Interview Tip:
"For new tables, I always use CLUSTER BY instead of PARTITIONED BY. It's incremental, handles any cardinality, and I can change the clustering columns without rewriting the table."
Q17
Question: "How does Delta Lake skip files during queries? What are file-level statistics?"
SELECT * FROM orders WHERE order_date = '2026-03-15' AND customer_id = 42
Behind the scenes:
Key details:
- Stats stored for the first 32 columns by default
- Config:
delta.dataSkippingNumIndexedCols(default 32) - Move frequently filtered columns to the first 32 positions in your schema
- Works BEST when data is sorted/clustered (Z-ORDER or Liquid Clustering)
- Stats include: min, max, null count, row count per file
Why Z-ORDER/Clustering improves data skipping:
SECTION 6: TIME TRAVEL, CLONES, AND RECOVERY
Q18 — Time Travel — Query any version of your table
Question: "How does time travel work? Show me all the ways to use it."
-- METHOD 1: Query by version number
SELECT * FROM orders VERSION AS OF 5;
SELECT * FROM orders@v5; -- Shorthand
-- METHOD 2: Query by timestamp
SELECT * FROM orders TIMESTAMP AS OF '2026-03-15 10:30:00';
-- See full history of changes
DESCRIBE HISTORY orders;
-- Shows: version, timestamp, operation, user, metrics
-- RESTORE: Roll back the entire table to a previous version
RESTORE TABLE orders TO VERSION AS OF 5;
-- Creates a NEW version (non-destructive) — you can undo the restore too!
-- Compare two versions
SELECT * FROM orders@v10 EXCEPT ALL SELECT * FROM orders@v5;
-- Shows rows that exist in v10 but not v5
Limits:
- Data retention: 7 days default (VACUUM deletes files older than this)
- Log retention: 30 days default (DESCRIBE HISTORY works for 30 days)
- After VACUUM, time travel is broken for vacuumed versions
Interview Tip: "Time travel works because Delta keeps old Parquet files around. VACUUM is what deletes them. So time travel range = VACUUM retention period."
Q19 — Scenario: Accidental DELETE — How to recover?
Question: "Someone ran DELETE FROM customers WHERE region = 'APAC' by mistake. 2 million rows deleted. How do you recover?"
-- Step 1: Find the version BEFORE the accidental delete
DESCRIBE HISTORY customers;
-- Look for the DELETE operation — say it was at version 42
-- We want version 41 (just before the delete)
-- Option A: FULL RESTORE (simple but blunt)
RESTORE TABLE customers TO VERSION AS OF 41;
-- Table looks exactly like version 41
-- BUT: any legitimate changes between 42 and now are also lost!
-- Option B: SELECTIVE RESTORE (surgical — better answer!)
MERGE INTO customers AS target
USING customers VERSION AS OF 41 AS source
ON target.customer_id = source.customer_id
WHEN NOT MATCHED THEN INSERT *;
-- Only inserts rows that were deleted (exist in v41 but not current)
-- Preserves any legitimate changes that happened after the delete
Interview Tip: "Option B is the better answer — it shows you understand MERGE + time travel together, and it's surgical (doesn't overwrite changes made after the accidental delete)."
Q20 — Deep Clone vs Shallow Clone
Question: "What are clones? When would you use deep vs shallow?"
| Aspect | Deep Clone | Shallow Clone |
|---|---|---|
| Copies data? | YES — full independent copy | NO — references source files |
| Independent? | Fully independent | Depends on source files |
| Speed | Slow (copies all data) | Fast (copies only metadata) |
| Use case | Production backups, migration | Testing, experimentation |
| VACUUM safe? | Yes — owns its data | NO — vacuuming source breaks clone |
| Storage cost | 2x (full copy) | Minimal (metadata only) |
-- Deep clone: Full independent copy
CREATE TABLE orders_backup DEEP CLONE orders;
-- Subsequent runs are INCREMENTAL (only copies changes since last clone)
-- Shallow clone: Fast reference copy
CREATE TABLE orders_test SHALLOW CLONE orders;
-- Perfect for testing — make changes without affecting production
-- WARNING: If source is VACUUMed, shallow clone may break!
SECTION 7: SPECIAL FEATURES
Q21
Question: "How do you track what rows changed in a Delta table? How does downstream consume only the changes?"
-- Step 1: Enable Change Data Feed on the table
ALTER TABLE orders SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
-- Step 2: Read only the changes between two versions
SELECT * FROM table_changes('orders', 5, 10);
-- Returns ALL changes between version 5 and 10 with metadata:
What CDF returns:
| Change Type | Meaning |
|---|---|
insert | New row was added |
update_preimage | Row BEFORE the update (old values) |
update_postimage | Row AFTER the update (new values) |
delete | Row was deleted |
Why CDF matters for data engineering:
PySpark to read changes:
# Read changes since version 5
changes_df = spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 5) \
.table("orders")
# Read changes since a timestamp
changes_df = spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingTimestamp", "2026-03-15") \
.table("orders")
Interview Tip: "CDF is essential for incremental ETL in medallion architecture. Instead of reprocessing all Silver data to update Gold, we read only the changes. This reduces pipeline runtime from hours to minutes."
Q22 — Deletion Vectors — Faster DELETEs and UPDATEs
Question: "How do Deletion Vectors improve UPDATE/DELETE performance?"
Without Deletion Vectors:
With Deletion Vectors:
-- Enable deletion vectors
ALTER TABLE orders SET TBLPROPERTIES (
'delta.enableDeletionVectors' = 'true'
);
Trade-offs:
| Aspect | Without DV | With DV |
|---|---|---|
| Write speed (UPDATE/DELETE) | Slow (full file rewrite) | Fast (tiny marker file) |
| Read speed | Normal | Slightly slower (must check DV files) |
| Storage | Clean | Accumulates DV files until OPTIMIZE |
| Best for | Small tables | Large tables with frequent updates |
Interview Tip: "Deletion Vectors change the write pattern from copy-on-write to a more efficient mark-and-sweep approach. The actual file rewrite is deferred to OPTIMIZE, which runs during off-peak hours."
Q23 — Delta Sharing — Share data without copying
Question: "How can you share Delta tables with external organizations without copying data?"
How it works:
Provider (your org) Recipient (partner org)
┌─────────────────┐ ┌─────────────────┐
│ Delta Table on │ │ Any client: │
│ your storage │───── REST API ────→│ - Spark │
│ (S3/ADLS/GCS) │ (open protocol) │ - pandas │
└─────────────────┘ │ - Power BI │
↑ │ - Databricks │
Data stays HERE └─────────────────┘
(never copied) Reads directly from
your storage
# Recipient reads shared data — one line!
df = spark.read.format("deltaSharing").load("profile.json#share.schema.table")
Key benefits:
- No data copying — recipient reads directly from your storage
- Open protocol — not locked to Databricks (works with any client)
- Provider controls access — revoke anytime, audit all reads
- Live data — recipient always sees the latest version
Q24 — UniForm — One table, all formats
Question: "What is UniForm? Why does it matter?"
Problem: Different tools expect different table formats:
- Databricks → Delta Lake
- Snowflake → Iceberg
- Trino/Presto → Hudi or Iceberg
- BigQuery → Iceberg
UniForm solution: Write once as Delta, automatically generates Iceberg and Hudi metadata.
-- Enable UniForm on a table
ALTER TABLE orders SET TBLPROPERTIES (
'delta.universalFormat.enabledFormats' = 'iceberg'
);
-- Now this Delta table can be read as an Iceberg table by Snowflake, Trino, etc.
-- Data is NOT duplicated — only metadata is generated in Iceberg format
Why it matters:
- One copy of data, readable by ALL engines
- No ETL to copy data between formats
- Eliminates "format wars" (Delta vs Iceberg vs Hudi)
Q25 — Schema Evolution — All the options
Question: "How does Delta handle schema changes? What options are available?"
Option 1: Add new columns (mergeSchema):
df.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.saveAsTable("orders")
# New columns in df are auto-added to the table schema
# Existing columns are preserved
Option 2: Replace entire schema (overwriteSchema):
df.write.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("orders")
# Completely replaces the schema — destructive!
Option 3: Rename/drop columns (requires column mapping):
-- Enable column mapping first
ALTER TABLE orders SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name');
-- Now you can rename and drop
ALTER TABLE orders RENAME COLUMN old_name TO new_name;
ALTER TABLE orders DROP COLUMN unused_column;
Option 4: Type widening (Delta 4.0+):
-- Change column type without rewriting data
ALTER TABLE orders SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true');
ALTER TABLE orders ALTER COLUMN amount TYPE DECIMAL(20,2);
-- INT → BIGINT, FLOAT → DOUBLE, etc. — NO full rewrite needed!
Q26 — Important Delta Table Properties (Know These!)
Question: "What are the most important Delta table properties?"
ALTER TABLE orders SET TBLPROPERTIES (
-- AUTO OPTIMIZATION: Fix small files automatically
'delta.autoOptimize.optimizeWrite' = 'true',
-- Coalesces small output files during write
'delta.autoOptimize.autoCompact' = 'true',
-- Runs mini-OPTIMIZE after each write
-- CHANGE DATA FEED: Track row-level changes for incremental ETL
'delta.enableChangeDataFeed' = 'true',
-- DELETION VECTORS: Faster UPDATE/DELETE
'delta.enableDeletionVectors' = 'true',
-- COLUMN MAPPING: Enable rename/drop columns
'delta.columnMapping.mode' = 'name',
-- RETENTION: How long to keep old data
'delta.logRetentionDuration' = 'interval 30 days',
'delta.deletedFileRetentionDuration' = 'interval 7 days',
-- TYPE WIDENING: Allow type changes without rewrite
'delta.enableTypeWidening' = 'true',
-- UNIFORM: Auto-generate Iceberg metadata
'delta.universalFormat.enabledFormats' = 'iceberg'
);
SECTION 8: DELTA LAKE vs COMPETITORS
Q27 — Delta Lake vs Apache Iceberg vs Apache Hudi
Question: "How does Delta Lake compare to Iceberg and Hudi? Why would you choose one over another?"
| Aspect | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Created by | Databricks | Netflix → Apache | Uber → Apache |
| Best with | Databricks, Spark | Snowflake, Trino, Spark, Flink | Spark, Flink |
| ACID | Yes | Yes | Yes |
| Time Travel | Yes | Yes | Yes (limited) |
| MERGE performance | Excellent (Photon) | Good | Good (MoR tables) |
| Streaming | Excellent | Good | Excellent |
| Schema Evolution | Good | Excellent (best) | Good |
| Partition Evolution | Via Liquid Clustering | Native (hidden partitioning) | Manual |
| Multi-engine support | UniForm bridges gap | Best (truly engine-agnostic) | Good |
| Community/Ecosystem | Large (Databricks-led) | Fastest growing | Smaller |
| Vendor independence | Databricks-optimized | Most vendor-neutral | Less common |
When to choose each:
Interview Tip: "Don't bash other formats. Say: 'Delta is the natural choice on Databricks because of deep integration with Photon, Unity Catalog, and Predictive Optimization. But UniForm lets us expose tables as Iceberg for teams using Snowflake.'"
Q28 — Managed vs External Tables
Question: "What is the difference? When do you use each?"
| Aspect | Managed Table | External Table |
|---|---|---|
| Data location | Databricks-managed (auto) | YOUR storage path (you specify) |
| DROP TABLE | Deletes metadata AND data | Deletes metadata ONLY — data survives |
| Predictive Optimization | Yes (automatic) | No |
| Best for | Most tables (default) | Shared data, legacy migration |
-- Managed (recommended for new tables)
CREATE TABLE catalog.schema.orders (order_id LONG, amount DECIMAL);
-- Data stored in Databricks-managed location
-- DROP TABLE deletes everything
-- External (for data shared with other systems)
CREATE TABLE catalog.schema.legacy_orders (order_id LONG, amount DECIMAL)
LOCATION 'abfss://container@storage.dfs.core.windows.net/legacy/orders/';
-- Data stays at YOUR path even if table is dropped
SECTION 9: PRODUCTION SCENARIOS & GOTCHAS
Q29 — Predictive Optimization — Auto-maintenance
Question: "What is Predictive Optimization? How does it work?"
Predictive Optimization is Databricks' auto-maintenance system. It watches your tables and automatically runs OPTIMIZE, VACUUM, and ANALYZE TABLE at the right time.
Key points:
- Enabled by default on all new Unity Catalog managed tables
- Learns access patterns → optimizes during low-traffic windows
- Automatically runs: OPTIMIZE (compaction), VACUUM (cleanup), ANALYZE TABLE (refresh stats)
- No configuration needed — just use managed tables
- Not available on external tables
Interview Tip: "With Predictive Optimization, I no longer need to schedule nightly OPTIMIZE/VACUUM jobs. The platform handles it. This is one reason I prefer managed tables over external tables."
Q30 — Scenario: Design a Delta Lake pipeline for a real-time order system
Question: "Design a medallion architecture for an e-commerce order system using Delta Lake."
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ BRONZE │ │ SILVER │ │ GOLD │ │ Raw data │────→│ Cleaned │────→│ Aggregated │ │ as-is │ │ validated │ │ business │ └─────────────┘ └─────────────┘ └─────────────┘ BRONZE LAYER: - Auto Loader ingests raw JSON/CSV from cloud storage - CLUSTER BY (ingest_date) - Schema: raw_payload STRING, source STRING, ingest_timestamp TIMESTAMP - enableChangeDataFeed = true (for Silver to read changes) SILVER LAYER: - Reads BRONZE changes via CDF (incremental!) - MERGE to deduplicate and validate - CLUSTER BY (order_date, customer_id) - Schema enforcement + data quality checks (expectations) - enableChangeDataFeed = true (for Gold to read changes) GOLD LAYER: - Reads SILVER changes via CDF - Aggregations: daily revenue, customer metrics, product performance - Optimized for BI queries (pre-computed) - CLUSTER BY (report_date)
Key decisions in this design:
- Liquid Clustering everywhere (not partitioning)
- CDF enabled at each layer for incremental processing
- Managed tables for Predictive Optimization
- Auto Loader for exactly-once ingestion
- MERGE for deduplication at Silver layer
Q31 — Common production gotchas with Delta Lake
Question: "What are the most common issues you've seen with Delta Lake in production?"
| Gotcha | Symptom | Fix |
|---|---|---|
| Small file problem | Queries slow, too many Spark tasks | OPTIMIZE + autoOptimize properties |
| VACUUM too aggressive | Time travel broken, readers fail with FileNotFoundException | Keep default 7-day retention, never RETAIN 0 |
| MERGE without partition filter | MERGE takes hours on large tables | Add partition column to ON clause |
| Over-partitioning | Thousands of tiny files per partition | Switch to Liquid Clustering |
| Schema mismatch | Write fails with schema error | Use mergeSchema or overwriteSchema as appropriate |
| Concurrent write conflicts | ConcurrentModificationException | Partition writes so pipelines touch different files |
| Stale statistics | Data skipping not working, queries slow | ANALYZE TABLE or enable Predictive Optimization |
| Z-ORDER on too many columns | No improvement in query speed | Max 4 columns, choose most filtered ones |
| Not using Photon | MERGE and queries 3-5x slower than needed | Switch cluster to Photon runtime |
| Streaming + OPTIMIZE conflict | OPTIMIZE blocks streaming writes | Use autoCompact instead of manual OPTIMIZE |
SECTION 10: NEW FEATURES (2025-2026)
Q32 — What's new in Delta Lake 4.x?
Question: "What are the latest features? Mention 2-3 to show you're up to date."
| Feature | Version | What It Does |
|---|---|---|
| Variant Data Type | 4.0 | Store semi-structured JSON without schema — query nested fields directly |
| Type Widening | 4.0 | Change column type (INT → BIGINT) without rewriting data |
| Coordinated Commits | 4.0 | Multiple engines can safely write to the same Delta table |
| UniForm | 4.0 | Auto-generate Iceberg/Hudi metadata for cross-engine reads |
| Conflict-Free DV | 4.1 | Enable Deletion Vectors without blocking concurrent writers |
| Atomic CTAS | 4.1 | CREATE TABLE AS SELECT is fully atomic (no partial tables on failure) |
| Lakebase | 2026 | Serverless PostgreSQL-compatible database inside Databricks |
Interview power move: "I'd use Liquid Clustering with Deletion Vectors enabled and CDF turned on — that gives me automatic compaction, fast updates, and incremental ETL. For cross-team sharing with Snowflake users, I'd enable UniForm so they read the same table as Iceberg."
QUICK REVISION CHECKLIST
Can you answer each in 2-3 minutes?
- Why do we need Delta Lake over plain Parquet? (Q01)
- List 8 key features of Delta Lake (Q02)
- How does Delta ensure ACID? What happens if a write fails midway? (Q03)
- What's inside a transaction log commit? (Q04)
- What are checkpoints and why are they critical? (Q05)
- How does optimistic concurrency work? (Q06)
- Write a MERGE from memory with all 4 clauses (Q07)
- How to handle duplicate keys in MERGE? (Q08)
- List 6 ways to make MERGE faster (Q09)
- What is OPTIMIZE? Good and bad? (Q11)
- What is VACUUM? Risks of RETAIN 0? (Q12)
- How to fix the small file problem? (Q13)
- Partitioning vs Z-ORDER vs Liquid Clustering — when to use each? (Q14-Q16)
- How does data skipping work? (Q17)
- How does Time Travel work? Recover from accidental delete? (Q18-Q19)
- What is CDF? How does it help incremental ETL? (Q21)
- What are Deletion Vectors? Trade-offs? (Q22)
- Delta vs Iceberg vs Hudi — when to choose each? (Q27)
- Design a medallion architecture with Delta (Q30)
- Name 3 new features in Delta 4.x (Q32)
APPENDIX: LEARN BY WATCHING + DOING
(This section replaces YouTube videos — read + copy-paste + see the output)
LAB 1 — Watch the Transaction Log Grow in Real Time
Goal: See EXACTLY what Delta writes on every operation. No magic. Where: Databricks Community Edition (free) OR any Databricks workspace Time: 10 minutes
Setup (copy-paste this first)
# ─── SETUP ────────────────────────────────────────────────
# Create a fresh Delta table in a clean location
path = "/tmp/delta_lab/orders"
# Clean up anything from previous runs
dbutils.fs.rm(path, recurse=True) # ← wipes old data (safe, /tmp only)
# Create initial data
from pyspark.sql import Row
data = [
Row(order_id=1, customer="Alice", amount=100.0),
Row(order_id=2, customer="Bob", amount=200.0),
Row(order_id=3, customer="Carol", amount=300.0),
]
df = spark.createDataFrame(data)
# Write as Delta — this creates the table AND the _delta_log
df.write.format("delta").save(path)
print("✅ Delta table created at:", path)
Step 1 — Look at what Delta just created
# List files in the folder
files = dbutils.fs.ls(path)
for f in files:
print(f.name)
# EXPECTED OUTPUT:
# _delta_log/ ← the transaction log folder
# part-00000-abc123.snappy.parquet ← actual data (Parquet)
# part-00001-def456.snappy.parquet
# Now peek inside _delta_log
log_files = dbutils.fs.ls(f"{path}/_delta_log")
for f in log_files:
print(f.name)
# EXPECTED OUTPUT:
# 00000000000000000000.json ← ONE commit file. This is commit #0.
Step 2 — Read commit #0 (see the "birth certificate" of the table)
# Read the first commit JSON
commit_0 = spark.read.json(f"{path}/_delta_log/00000000000000000000.json")
commit_0.show(truncate=False)
# EXPECTED OUTPUT (5 rows, one per action):
# ┌──────────┬────────────────────────────────────────┐
# │ action │ details │
# ├──────────┼────────────────────────────────────────┤
# │ protocol │ minReaderVersion=1, minWriterVersion=2 │ ← compat info
# │ metaData │ schema={order_id INT, customer STR,..} │ ← table schema
# │ add │ path=part-00000-abc.parquet, size=921 │ ← data file 1
# │ add │ path=part-00001-def.parquet, size=874 │ ← data file 2
# │ commitInfo│ operation=WRITE, timestamp=... │ ← who did what
# └──────────┴────────────────────────────────────────┘
Step 3 — Do an UPDATE. Watch a NEW commit appear.
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, path)
# Update Alice's amount from 100 → 150
dt.update(
condition = "customer = 'Alice'",
set = {"amount": "150.0"}
)
# Now check _delta_log AGAIN
log_files = dbutils.fs.ls(f"{path}/_delta_log")
for f in log_files:
print(f.name)
# EXPECTED OUTPUT:
# 00000000000000000000.json ← original commit (unchanged)
# 00000000000000000001.json ← NEW commit! ✨ This is the UPDATE
Step 4 — Read commit #1. See what Delta did behind the scenes.
commit_1 = spark.read.json(f"{path}/_delta_log/00000000000000000001.json")
commit_1.show(truncate=False)
# EXPECTED OUTPUT:
# ┌───────────┬───────────────────────────────────────────┐
# │ action │ details │
# ├───────────┼───────────────────────────────────────────┤
# │ remove │ path=part-00000-abc.parquet (OLD file) │ ← old file marked removed
# │ add │ path=part-00002-xyz.parquet (NEW file) │ ← new file with updated row
# │ commitInfo│ operation=UPDATE, numUpdatedRows=1 │
# └───────────┴───────────────────────────────────────────┘
# 🧠 KEY INSIGHT:
# Delta NEVER modifies a Parquet file in place.
# It writes a NEW file with the updated row,
# and marks the OLD file as "removed" in the log.
# The old file still EXISTS on disk until VACUUM runs!
Step 5 — Check the history (Delta's "audit log")
dt.history().select("version", "timestamp", "operation", "operationMetrics").show(truncate=False)
# EXPECTED OUTPUT:
# ┌───────┬──────────────────┬─────────┬────────────────────────┐
# │version│ timestamp │ operation│ operationMetrics │
# ├───────┼──────────────────┼─────────┼────────────────────────┤
# │ 1 │ 2026-04-05 10:31 │ UPDATE │ {numUpdatedRows: 1} │
# │ 0 │ 2026-04-05 10:30 │ WRITE │ {numOutputRows: 3} │
# └───────┴──────────────────┴─────────┴────────────────────────┘
#
# Every commit = one row in DESCRIBE HISTORY. That's your audit trail.
Step 6 — TIME TRAVEL (go back to BEFORE the update)
# Read version 0 (the original, before Alice got updated)
df_before = spark.read.format("delta").option("versionAsOf", 0).load(path)
df_before.show()
# EXPECTED OUTPUT (Alice is still 100.0):
# +--------+--------+------+
# |order_id|customer|amount|
# +--------+--------+------+
# | 1| Alice| 100.0| ← original value!
# | 2| Bob| 200.0|
# | 3| Carol| 300.0|
# +--------+--------+------+
# Now read current version (version 1)
df_now = spark.read.format("delta").load(path)
df_now.show()
# EXPECTED OUTPUT:
# +--------+--------+------+
# |order_id|customer|amount|
# +--------+--------+------+
# | 1| Alice| 150.0| ← updated value
# | 2| Bob| 200.0|
# | 3| Carol| 300.0|
# +--------+--------+------+
#
# 🎯 Same table. Two different versions. This is Time Travel.
Now you've SEEN the transaction log grow. No video needed.
LAB 2 — Watch MERGE Work (The #1 Delta Operation)
Goal: See insert + update + delete happen in ONE atomic operation.
# ─── SETUP target table ───────────────────────────
target_path = "/tmp/delta_lab/customers"
dbutils.fs.rm(target_path, recurse=True)
target_data = [
Row(id=1, name="Alice", status="active"),
Row(id=2, name="Bob", status="active"),
Row(id=3, name="Carol", status="active"),
]
spark.createDataFrame(target_data).write.format("delta").save(target_path)
print("📋 TARGET (before MERGE):")
spark.read.format("delta").load(target_path).show()
# +---+-----+------+
# | id| name|status|
# +---+-----+------+
# | 1|Alice|active|
# | 2| Bob|active|
# | 3|Carol|active|
# +---+-----+------+
# ─── Build source (daily change feed) ──────────────
source_data = [
Row(id=2, name="Bob", status="inactive"), # UPDATE (Bob deactivated)
Row(id=3, name="Carol", status="DELETE_ME"), # DELETE (special marker)
Row(id=4, name="Dave", status="active"), # INSERT (new customer)
]
source_df = spark.createDataFrame(source_data)
print("📥 SOURCE (daily change feed):")
source_df.show()
# +---+----+---------+
# | id|name| status|
# +---+----+---------+
# | 2| Bob| inactive| ← will UPDATE existing row
# | 3|Carol|DELETE_ME| ← will DELETE existing row
# | 4|Dave| active| ← will INSERT new row
# +---+----+---------+
# ─── THE MERGE ─────────────────────────────────────
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, target_path)
dt.alias("t").merge(
source_df.alias("s"),
"t.id = s.id" # ← match on id
).whenMatchedDelete(
condition = "s.status = 'DELETE_ME'" # ← if marker, delete
).whenMatchedUpdate(
set = {"name": "s.name", "status": "s.status"} # ← else update
).whenNotMatchedInsert(
values = {"id": "s.id", "name": "s.name", "status": "s.status"} # ← new rows: insert
).execute()
print("📋 TARGET (after MERGE):")
spark.read.format("delta").load(target_path).orderBy("id").show()
# +---+-----+--------+
# | id| name| status|
# +---+-----+--------+
# | 1|Alice| active| ← unchanged
# | 2| Bob|inactive| ← UPDATED ✅
# | 4| Dave| active| ← INSERTED ✅
# +---+-----+--------+
# ← Carol (id=3) DELETED ✅
# ─── Verify: one commit for the whole thing ────────
dt.history().select("version", "operation", "operationMetrics").show(truncate=False)
# +-------+---------+─────────────────────────────────────────────────┐
# |version|operation│ operationMetrics │
# +-------+---------+─────────────────────────────────────────────────┤
# | 1 │ MERGE │ numTargetRowsInserted=1 │
# | │ │ numTargetRowsUpdated=1 │
# | │ │ numTargetRowsDeleted=1 │
# | 0 │ WRITE │ numOutputRows=3 │
# +-------+---------+─────────────────────────────────────────────────┘
#
# 🎯 ONE atomic commit handled INSERT + UPDATE + DELETE together.
# That's why MERGE is the most important Delta operation.
LAB 3 — Watch OPTIMIZE Fix the Small File Problem
# ─── Create a table with MANY small files (simulating streaming) ─────
small_files_path = "/tmp/delta_lab/small_files"
dbutils.fs.rm(small_files_path, recurse=True)
# Write 10 times, one row each time → 10 tiny files
for i in range(10):
spark.createDataFrame([Row(id=i, val=f"row_{i}")]) \
.write.format("delta").mode("append").save(small_files_path)
# Count Parquet files
parquet_files = [f.name for f in dbutils.fs.ls(small_files_path) if f.name.endswith(".parquet")]
print(f"🔴 Parquet files BEFORE OPTIMIZE: {len(parquet_files)}")
# EXPECTED: 🔴 Parquet files BEFORE OPTIMIZE: 10
#
# 🧠 Each file is tiny (few hundred bytes) → disaster for query performance
# ─── Run OPTIMIZE ───────────────────────────────────
spark.sql(f"OPTIMIZE delta.`{small_files_path}`")
parquet_files = [f.name for f in dbutils.fs.ls(small_files_path) if f.name.endswith(".parquet")]
print(f"🟢 Parquet files AFTER OPTIMIZE: {len(parquet_files)}")
# EXPECTED: 🟢 Parquet files AFTER OPTIMIZE: 11
#
# Wait — 11? Why not 1?
# → OPTIMIZE created 1 NEW compacted file (with all 10 rows)
# → The 10 OLD files are marked "removed" in _delta_log but still on disk
# → VACUUM will clean them up later
#
# Verify: check the log
dt = DeltaTable.forPath(spark, small_files_path)
dt.history().select("version", "operation", "operationMetrics").show(truncate=False)
# Latest row: operation=OPTIMIZE, numFilesAdded=1, numFilesRemoved=10 ✅
VISUAL ANIMATION 1 — Optimistic Concurrency Conflict
What happens when TWO jobs try to update the same table at the same time?
TIME → T0 T1 T2 T3
│ │ │ │
JOB A ├─ reads v5 ────┼─ writes data ──┼─ commits v6 ─✅ │
│ │ │ │
JOB B ├─ reads v5 ────┼─ writes data ──┼────────────────┼─ tries to commit v6 ❌
│ │ │ │ CONFLICT!
│ │ │ │ Job A already wrote v6
│ │ │ │
│ │ │ ├─ Delta checks:
│ │ │ │ "Did Job A touch the
│ │ │ │ same files as me?"
│ │ │ │
│ │ │ ├─ IF NO overlap:
│ │ │ │ retry commit as v7 ✅
│ │ │ │
│ │ │ └─ IF YES overlap:
│ │ │ throw ConcurrentModification ❌
│ │ │ → user must retry the WHOLE job
KEY INSIGHT:
• Delta does NOT lock the table while writing (unlike traditional DBs)
• Both jobs write data files in PARALLEL — no slowdown
• Conflict is detected ONLY at commit time (when writing the JSON)
• Only ONE commit can "win" version 6 → the other must retry or fail
VISUAL ANIMATION 2 — Time Travel Mechanics
DELTA TABLE /customers/ — Timeline of Operations:
v0: CREATE v1: INSERT 100 v2: UPDATE 20 v3: DELETE 5
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ empty │ │ file-A │ │ file-A │ removed │ file-A │ removed
│ │ → │ │ → │ file-B │ │ file-B │
│ │ │ │ │ file-C │ added │ file-C │
│ │ │ │ │ │ │ file-D │ added
└─────────┘ └─────────┘ └─────────┘ └─────────┘
_delta_log _delta_log _delta_log _delta_log
/0.json /0.json /0.json /0.json
/1.json /1.json /1.json
/2.json /2.json
/3.json
TIME TRAVEL QUERIES:
SELECT * ... VERSION AS OF 0 → reads NO files (empty table)
SELECT * ... VERSION AS OF 1 → reads file-A (100 rows)
SELECT * ... VERSION AS OF 2 → reads file-B, file-C (80 old + 20 updated)
SELECT * ... VERSION AS OF 3 → reads file-B, file-C, (75 rows after delete)
file-D
🧠 KEY INSIGHT:
Delta doesn't "roll back" anything for Time Travel.
It just reads the OLD files that are STILL on disk
(because VACUUM hasn't removed them yet).
Time Travel = reading old files according to an old commit.
VISUAL ANIMATION 3 — Z-ORDER Rearranging Data
GOTCHAS — The Weird Errors You'll Hit (and How to Fix Them)
Gotcha 1: VACUUM RETAIN 0 HOURS — "Invalid retention" Error
Why it happens: Delta's default minimum retention is 7 days. VACUUM RETAIN 0 HOURS would delete files that RUNNING QUERIES might still be reading → data corruption.
Fix:
0 HOURS")
# Then TURN IT BACK ON:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")"># DANGER: only disable if you KNOW no concurrent reads are running
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.sql(f"VACUUM delta.`{path}` RETAIN 0 HOURS")
# Then TURN IT BACK ON:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")
Interview trap: If they ask "how would you immediately VACUUM a Delta table?" and you say "RETAIN 0 HOURS," they'll follow up with "What's the risk?" → answer: concurrent readers can get FileNotFoundException.
Gotcha 2: Schema Mismatch — "A schema mismatch detected when writing"
Why it happens: Your source DataFrame has a column the target table doesn't have (or different type).
Fix:
# Option 1: allow new columns to be added automatically
df.write.format("delta") \
.option("mergeSchema", "true") \ # ← THIS solves it
.mode("append") \
.save(path)
# Option 2 (for MERGE operations):
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
Interview trap: "How is this different from schema ENFORCEMENT?" → Enforcement REJECTS bad writes by default. mergeSchema=true opts in to schema EVOLUTION.
Gotcha 3: ConcurrentAppendException
Why it happens: Two jobs MERGED into the same partition at the same time. Both added files → conflict at commit time.
Fix:
# Add a filter in your MERGE condition to restrict partitions
dt.alias("t").merge(
source.alias("s"),
"t.date = '2026-04-05' AND t.id = s.id" # ← partition filter narrows conflict scope
).whenMatchedUpdate(...).execute()
# Or: retry with exponential backoff
import time
for attempt in range(3):
try:
# your merge
break
except Exception as e:
if "ConcurrentAppend" in str(e):
time.sleep(2 ** attempt)
else:
raise
Gotcha 4: MERGE with Duplicate Keys in Source
Why it happens: Your source has 2+ rows with the same join key → Delta can't decide which one wins.
Fix:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, desc
# Deduplicate source BEFORE merging
w = Window.partitionBy("id").orderBy(desc("updated_at"))
source_deduped = source_df.withColumn("rn", row_number().over(w)) \
.filter("rn = 1") \
.drop("rn")
# Now source has ONE row per id — merge will succeed
Interview trap: This is THE most common MERGE production bug. If you know this fix, you sound senior.
Gotcha 5: Small Files Problem in Streaming
Fix:
# Enable auto-compaction + optimized writes on the table
spark.sql(f"""
ALTER TABLE delta.`{path}`
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
)
""")
# Also: run scheduled OPTIMIZE daily
spark.sql(f"OPTIMIZE delta.`{path}`")
Gotcha 6: "Delta Table not found" when path looks correct
Why it happens: Either:
- You wrote as Parquet but are trying to read as Delta
- You deleted
_delta_log/manually (DON'T) - Path typo
Fix:
# Always verify before reading
print(dbutils.fs.ls(path)) # check path exists
print(dbutils.fs.ls(f"{path}/_delta_log")) # check _delta_log exists
# Convert plain Parquet to Delta if needed:
from delta.tables import DeltaTable
DeltaTable.convertToDelta(spark, f"parquet.`{path}`")
MOCK INTERVIEW — Live Walkthrough
Scenario: Senior Data Engineer Round (30 min Delta deep-dive)
INTERVIEWER: "We have a 2 TB Delta table being updated every hour by a MERGE job. Lately, queries are getting slower and slower. Walk me through how you'd debug this."
❌ BAD ANSWER:
"I would run OPTIMIZE to fix it."
(Too short. No reasoning. No investigation. Red flag.)
✅ GOOD ANSWER:
"Three things come to mind, and I'd investigate in order. First, I'd run DESCRIBE HISTORY and DESCRIBE DETAIL to check the file count and size distribution — if we have thousands of small files from hourly MERGEs, that's small-file problem and OPTIMIZE would help. Second, I'd check if any column has high-cardinality filters in queries — if yes, we might need Z-ORDER on that column for data skipping. Third, I'd check the MERGE condition itself — if it's not partition-pruning, every MERGE is rewriting files across the whole table. Let me describe how I'd verify each..."
FOLLOW-UP TRAP: (interviewer nods) "Good. Now — OPTIMIZE is running but taking 4 hours and blocking the pipeline. What do you do?"
❌ BAD ANSWER: "Run it less often."
✅ GOOD ANSWER:
"I'd do three things. First, run OPTIMIZE with a partition filter so it only compacts recent partitions — OPTIMIZE table WHERE date >= current_date() - 7. Second, enable auto-compaction at the table level so small files get compacted incrementally during writes, reducing how much OPTIMIZE has to do. Third, if we're on a recent Databricks runtime, I'd consider switching from partitioning + Z-ORDER to Liquid Clustering, which handles incremental clustering automatically without blocking writes."
INTERVIEWER: "A user accidentally ran DELETE FROM customers WHERE status = 'active' — deleted 4 million rows. It's been 2 hours. Can you recover?"
❌ BAD ANSWER: "No, the data is gone."
❌ BAD ANSWER 2: "Yes, just restore from backup." (This is Delta — there's no separate backup needed)
✅ GOOD ANSWER:
"Yes, absolutely — Delta's time travel makes this easy. Two hours ago is well within default retention. I'd run DESCRIBE HISTORY customers to find the version BEFORE the DELETE, then either: option A — RESTORE TABLE customers TO VERSION AS OF to restore the whole table, or option B — if we want to preserve changes AFTER the bad DELETE, do INSERT INTO customers SELECT * FROM customers VERSION AS OF to re-insert just the deleted rows. Option B is safer in production because it doesn't rewind other legitimate changes."
FOLLOW-UP TRAP: "What if it had been 10 days ago, not 2 hours?"
✅ GOOD ANSWER:
"Then it depends on retention settings. Default is 7 days for data file retention, 30 days for log retention. If VACUUM ran with default retention, the old data files would already be deleted from storage — even though the transaction log would still reference them. Time Travel would fail with FileNotFoundException. The prevention is setting delta.deletedFileRetentionDuration to something higher like 30 days for critical tables, or taking scheduled shallow clones as point-in-time snapshots."
INTERVIEWER: "Explain what happens on disk when you run UPDATE customers SET email = LOWER(email)."
❌ BAD ANSWER: "It updates the rows."
✅ GOOD ANSWER:
_delta_log that has remove entries for all the old files and add entries for all the new files. The old files stay on disk until VACUUM runs, which is what enables time travel back to the pre-update version. Note: with Deletion Vectors enabled, UPDATE can be faster because it writes a small vector file marking which rows changed, avoiding the full file rewrite."INTERVIEWER: "We have Job A doing hourly MERGE into table X, and Job B doing hourly OPTIMIZE on table X. They started conflicting. Why, and how do you fix it?"
❌ BAD ANSWER: "Run them at different times."
✅ GOOD ANSWER:
"This is an optimistic concurrency conflict. OPTIMIZE marks a set of files as 'removed' and adds one compacted file. If MERGE reads those same files and also tries to add/remove files in an overlapping set, Delta throws ConcurrentAppendException at commit time. Two fixes: First, OPTIMIZE is idempotent and retryable — wrap it with retry logic so a conflict just triggers a retry after the MERGE finishes. Second, partition the table so OPTIMIZE targets OLD partitions and MERGE targets the CURRENT partition — e.g., OPTIMIZE WHERE date < current_date(). That way they work on disjoint file sets and never conflict."
FINAL READINESS CHECK
If you can do all of these, you're video-free:
- Run LAB 1 end-to-end without looking at the answers
- Run LAB 2 and explain each MERGE clause from memory
- Run LAB 3 and explain why you see 11 files after OPTIMIZE (not 1)
- Draw the Optimistic Concurrency animation on a whiteboard
- Draw the Z-ORDER animation on a whiteboard
- Name all 6 Gotchas and how to fix each
- Answer all 4 mock interview questions using the "good answer" framework
If yes → you don't need a single video. You're ready.