Delta Lake Advanced Masterclass -- Never Get Caught Off Guard Again
💡 Interview Tip
Prep Time: 8-10 hours
Purpose: Covers every advanced Delta Lake topic interviewers ask about -- especially the ones that trip up experienced candidates
Covers: OPTIMIZE deep dive, Liquid Clustering vs Z-ORDER vs Partitioning, UniForm, Change Data Feed, Deletion Vectors, Predictive Optimization, Small File Problem, VACUUM Gotchas, Delta vs Iceberg vs Hudi
TABLE OF CONTENTS
- OPTIMIZE -- When It Helps and When It Hurts
- Liquid Clustering vs Z-ORDER vs Partitioning -- The Complete Decision Guide
- Deletion Vectors -- Internal Mechanics
- Change Data Feed (CDF) -- CDC with Delta Lake
- UniForm -- Universal Format
- Predictive Optimization
- The Small File Problem -- Root Causes and Real Solutions
- VACUUM -- Risks, Production Incidents, and Gotchas
- Delta Lake vs Apache Iceberg vs Apache Hudi
- Rapid-Fire Interview Questions with Traps
SECTION 1: OPTIMIZE -- When It Helps and When It Hurts
Q1: What exactly does OPTIMIZE do internally? Walk me through the mechanics.
Answer:
OPTIMIZE is a table maintenance command that compacts small files into larger, optimally-sized files. But understanding the mechanics is what separates a good answer from a great one.
Step-by-step internal process:
📋 Overview
OPTIMIZE my_table
Step 1: Read the transaction log to get the current list of active files
Step 2: Identify "small" files (below the target size threshold)
Default target: 1 GB per file (configurable via spark.databricks.delta.optimize.maxFileSize)
Step 3: Group small files by partition (if partitioned)
Step 4: For each group:
a. Read all small files into memory
b. Rewrite them into fewer, larger files (target ~1 GB each)
c. Write new Parquet files to storage
Step 5: Create a new t