Day 1: Delta Lake & Lakehouse — Quick Recall Guide
- ⚡ = Must remember (95% chance of being asked)
- 🔑 = Key concept (core understanding needed)
- ⚠️ = Common trap (interviewers love to test this)
- 🧠 = Memory Map (mnemonic/acronym — memorize this!)
- 📝 = One-liner (flash-card style — cover answer, test yourself)
🧠 MASTER MEMORY MAP — Day 1
SECTION 1: DELTA LAKE INTERNALS
🧠 Memory Map: Transaction Log
⚡ MUST KNOW DIRECT QUESTIONS (Cover the answer, test yourself!)
An open-source storage layer that adds ACID transactions, schema enforcement, and time travel on top of Parquet files in a data lake (like ADLS Gen2).
In the _delta_log/ folder — a series of JSON commit files + Parquet checkpoint files.
A folder (_delta_log/) that records every change as a numbered JSON file. It's the single source of truth for the table's current state.
Atomicity (all or nothing), Consistency (schema rules enforced), Isolation (readers don't see partial writes), Durability (committed data survives crashes).
Parquet format. Delta Lake = Parquet files + transaction log. The log is what makes it "Delta."
A Parquet summary of the table state, created every 10 commits. Instead of reading 10,000 JSON files, read 1 checkpoint + recent JSONs.
_last_checkpoint?A small file that tells Delta which checkpoint is the latest — so it doesn't have to scan the entire _delta_log/ folder.
When you start reading a table, you see it as it was at that exact moment — even if someone writes new data while you're reading. You always get a consistent view.
When two writers modify the same files. Example: Writer A and B both UPDATE rows in the same partition → one gets a ConcurrentModificationException and must retry.
Delta stores min/max statistics for each data file. When you query WHERE booking_date = '2026-03-15', Delta skips files where min > March 15 or max < March 15. Reads only relevant files.
First 32 columns. Configurable via delta.dataSkippingNumIndexedCols. ⚠️ Put your most-filtered columns FIRST in schema!
🔑 MID-LEVEL QUESTIONS
- Read
_last_checkpoint→ find latest checkpoint - Read that checkpoint file (Parquet) → get base state
- Read all JSON commits AFTER the checkpoint → apply recent changes
- Result: current list of valid Parquet data files
- Read only those Parquet files → return query results
- Spark writes new Parquet file(s) to the table folder
- Delta creates a new JSON commit in
_delta_log/with"add"action pointing to the new file(s) - The commit is atomic — either the JSON file is fully written, or it's not
- Delta identifies which Parquet files contain the rows to delete
- Reads those files, removes matching rows, writes NEW Parquet files with remaining rows
- Creates a commit with
"remove"(old files) +"add"(new files) - Old files are NOT physically deleted — they stay until VACUUM cleans them
NO! DELETE only marks files as "removed" in the log. The physical Parquet files remain on disk. You need VACUUM to physically delete them. This is why time travel works — old files are still there.
- Two writers start at same snapshot (version 5)
- Writer A commits first → version 6 (success)
- Writer B tries to commit → Delta checks: "Did A modify the same files I'm modifying?"
- If NO conflict (different files/partitions) → auto-resolve, Writer B becomes version 7
- If YES conflict (same files) →
ConcurrentModificationException→ retry from version 6
| Action | Meaning |
|---|---|
add | New Parquet file added |
remove | Parquet file logically removed |
metaData | Schema or table properties changed |
commitInfo | Who, when, what operation |
SECTION 2: MERGE INTO (UPSERT)
🧠 Memory Map: MERGE
⚡ MUST KNOW DIRECT QUESTIONS
A single SQL command that does INSERT + UPDATE + DELETE in one atomic operation. Also called "upsert" (update + insert).
Update existing rows if they match, Insert new rows if they don't. MERGE does this in a single pass.
Yes! Add WHEN MATCHED AND source.deleted = true THEN DELETE — three operations in one command.
UnsupportedOperationException — "Cannot perform MERGE as multiple source rows matched." You MUST deduplicate source first.Use ROW_NUMBER() window function to keep only the latest record per key:
WITH deduped AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY booking_id ORDER BY updated_at DESC
) AS rn FROM source_data
)
SELECT * FROM deduped WHERE rn = 1
When source has NEW columns that target doesn't have, MERGE can auto-add them. Enable with:
SET spark.databricks.delta.schema.autoMerge.enabled = true
Or: .option("mergeSchema", "true")
🔑 MID-LEVEL QUESTIONS
- Join: Inner join target with source on the match condition
- Classify: Each row is classified as "matched" or "not matched"
- Write: Rewrite affected data files with updates + append new files for inserts
MERGE must scan the ENTIRE target to find matches. Optimizations:
- Liquid Clustering on merge key → scans only relevant files (biggest win)
- Filter source early →
MERGE INTO target USING (SELECT * FROM source WHERE date = today) - OPTIMIZE before MERGE → fewer small files to scan
- Low shuffle merge → reduces data movement between nodes
- Partition by merge key → prunes irrelevant partitions (old approach)
Problem: In SCD Type 2, when a record changes, you need to:
- Close the old row (set
end_date,is_current = false) - Insert a new row (with new values,
is_current = true)
merge_key column: - For rows to UPDATE:
merge_key = booking_id(matches target) - For rows to INSERT:
merge_key = NULL(never matches → goes to NOT MATCHED → INSERT)
SECTION 3: OPTIMIZE, VACUUM, Z-ORDER, LIQUID CLUSTERING
🧠 Memory Map: File Performance
⚡ MUST KNOW DIRECT QUESTIONS
Compacts many small Parquet files into fewer large files (target: ~1 GB each). Faster reads because fewer files to open.
Physically deletes old Parquet files that are no longer referenced by the transaction log. Frees up storage.
7 days (168 hours). Files older than this are deleted. ⚠️ Setting it lower breaks time travel!
Yes, but DANGEROUS — breaks time travel and concurrent reads. Only use for GDPR "right to be forgotten":
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM bookings RETAIN 0 HOURS;
Sorts data within files by specific columns so that similar values are stored together. When you filter by those columns, Delta reads far fewer files.
The new replacement for partitioning + Z-ORDER (Databricks 2024+). Automatically maintains data layout — no manual OPTIMIZE ZORDER needed. You just define clustering keys at table creation.
NO! They are mutually exclusive. Liquid Clustering replaces partitioning. You must choose one or the other.
Instead of rewriting an entire Parquet file to delete a few rows, Delta marks rows as "deleted" in a small side file. The data file stays untouched. Much faster deletes/updates.
Yes, on Databricks (since 2024). They're a table property: delta.enableDeletionVectors = true.
🔑 MID-LEVEL QUESTIONS
| OPTIMIZE | VACUUM | |
|---|---|---|
| What | Merges small files → big files | Deletes old unused files |
| Goal | Faster reads | Save storage space |
| Data loss? | Never | Old versions become unreadable |
| When to run | After many small writes | After OPTIMIZE (clean up old files) |
| Best practice | Run daily | Run weekly |
| Z-ORDER | Liquid Clustering | |
|---|---|---|
| Era | Old (before 2024) | New (2024+, recommended) |
| Setup | Run manually: OPTIMIZE ... ZORDER BY | Define once: CLUSTER BY at table creation |
| Maintenance | Must re-run after every write | Auto-maintains incrementally |
| With partitioning? | Yes, works with partitions | NO — replaces partitioning |
| Column changes | Must rewrite entire table | ALTER TABLE ... CLUSTER BY (new_cols) — easy |
| Use when | Legacy tables, can't migrate | All new tables (always prefer this) |
- When partition column has HIGH cardinality (>10,000 values) → too many tiny folders
- When table is small (<1 TB) → partitioning adds overhead
- When using Liquid Clustering → they're mutually exclusive
- Rule of thumb: Each partition should be >1 GB. If not, don't partition.
- Each file needs a separate read operation → high I/O overhead
- More files = more metadata in the transaction log → slower planning
- Cloud storage (ADLS) charges per API call → more files = higher cost
- Fix: OPTIMIZE (compaction) or Auto Loader with trigger-based batching
SECTION 4: TIME TRAVEL & RECOVERY
🧠 Memory Map: Time Travel
⚡ MUST KNOW DIRECT QUESTIONS
Ability to query previous versions of a table using version numbers or timestamps. Works because old Parquet files are kept until VACUUM removes them.
SELECT * FROM bookings VERSION AS OF 5; or TIMESTAMP AS OF '2026-03-20';
DESCRIBE HISTORY bookings; — shows every version, timestamp, operation, and user.
RESTORE TABLE bookings TO VERSION AS OF 5; — rolls back to version 5. This creates a NEW version (safe, doesn't rewrite history).
After VACUUM runs — it physically deletes old files. Default retention is 7 days. If you VACUUM with 0 hours, ALL old versions are gone immediately.
SECTION 5: LAKEHOUSE ARCHITECTURE
🧠 Memory Map: Lakehouse
⚡ MUST KNOW DIRECT QUESTIONS
A single platform combining the cheap storage of a data lake with the ACID transactions, schema enforcement, and governance of a data warehouse. No need for separate lake + warehouse.
- Delta Lake → ACID transactions on data lake files
- Photon Engine → Fast SQL queries (warehouse-speed on lake data)
- Unity Catalog → Governance, security, lineage
- Serverless SQL Warehouses → On-demand compute, no cluster management
| Managed | External | |
|---|---|---|
| Data location | Databricks controls | You control (your ADLS path) |
| DROP TABLE | Deletes data + metadata | Deletes metadata ONLY, data stays |
| Use when | Most cases (simpler) | Data shared across platforms |
SECTION 6: NEW 2025-2026 FEATURES
🧠 Memory Map: What's New
⚡ MUST KNOW DIRECT QUESTIONS
Databricks automatically runs OPTIMIZE and VACUUM based on table usage patterns. You don't schedule these manually anymore. Unity Catalog required.
A new feature (GA March 2026) that allows Delta tables to handle OLTP workloads (point lookups, single-row updates) — like a traditional database (PostgreSQL) but built on Delta Lake.
BEGIN ATOMIC ... END — allows changes to multiple tables in a single atomic transaction. Either ALL tables update or NONE do. New in Databricks 2025.Allows external tools (that only speak Iceberg/Hive) to read your Delta tables without conversion. Unity Catalog rewrites metadata on-the-fly.
A new data type in Delta Lake 4.x that stores semi-structured JSON data natively — no need to stringify JSON. Faster queries on nested JSON fields.
🧠 FINAL REVISION — Day 1 Summary Card
┌─────────────────────────────────────────────────────────────┐
│ DAY 1: DELTA LAKE │
├─────────────────────────────────────────────────────────────┤
│ │
│ DELTA = Parquet files + Transaction Log (_delta_log/) │
│ Transaction Log = JSON commits + Checkpoints (every 10) │
│ ACID = Atomicity, Consistency, Isolation, Durability │
│ Concurrency = Optimistic (assume no conflict, check later) │
│ │
│ MERGE = Match→Update, No Match→Insert ("MU-NI") │
│ ⚠️ Source must be deduplicated (ROW_NUMBER trick) │
│ SCD2 merge_key trick = NULL key for new rows → INSERT │
│ 6 optimizations = "FSCPZL" │
│ │
│ OPTIMIZE = compact small files (run daily) │
│ VACUUM = delete old files (run weekly, default 7 days) │
│ Z-ORDER = sort by columns (OLD way) │
│ Liquid Clustering = auto-sort (NEW way, replaces Z+Part) │
│ Deletion Vectors = mark rows deleted without rewriting │
│ │
│ Time Travel = VERSION AS OF / TIMESTAMP AS OF │
│ RESTORE = undo mistakes (creates new version, safe) │
│ ⚠️ Only works within VACUUM retention (default 7 days) │
│ │
│ Lakehouse = Lake PRICE + Warehouse FEATURES │
│ New: Predictive Opt, Lakebase, Multi-table Tx, Variant │
│ │
│ TOP 5 THINGS TO SAY IN INTERVIEW: │
│ 1. "Transaction log ensures ACID on cloud storage" │
│ 2. "MERGE with Liquid Clustering for fast upserts" │
│ 3. "Liquid Clustering replaces partitioning + Z-ORDER" │
│ 4. "VACUUM retention balances storage vs time travel" │
│ 5. "Predictive Optimization auto-manages file layout" │
│ │
└─────────────────────────────────────────────────────────────┘
- First pass (30 min): Read only 🧠 Memory Maps + ⚡ Direct Questions
- Second pass (30 min): Read 🔑 Mid-Level Questions + ⚠️ Traps
- Before interview (15 min): Read ONLY the Final Revision Summary Card