🧱
Databricks
Day 1: Delta Lake & Lakehouse — Quick Recall Guide
🧱
🧱
Databricks · Section 10 of 17

Day 1: Delta Lake & Lakehouse — Quick Recall Guide

Day 1: Delta Lake & Lakehouse — Quick Recall Guide

🗺️ Memory Map
How to use this file:
  • ⚡ = Must remember (95% chance of being asked)
  • 🔑 = Key concept (core understanding needed)
  • ⚠️ = Common trap (interviewers love to test this)
  • 🧠 = Memory Map (mnemonic/acronym — memorize this!)
  • 📝 = One-liner (flash-card style — cover answer, test yourself)
Reading strategy: Read Memory Maps FIRST → then Direct Questions → then Mid-Level. Memory Maps give you the skeleton. Questions fill in the details.

🧠 MASTER MEMORY MAP — Day 1

🧠 DELTA LAKE = "ACTS on your Data Lake"
DELTA LAKE"ACTS on your Data Lake"
AACID transactions (Atomicity, Consistency, Isolation, Durability)
CCheckpoints (summary every 10 commits)
TTransaction log (_delta_log/ folder = the brain)
SSchema enforcement (rejects bad data)
File Performance = "OVZLC" (Old Way → New Way)
OOPTIMIZE (compacts small files into big files)
VVACUUM (deletes old unused files)
ZZ-ORDER (sorts data for fast lookups — OLD way)
LLiquid Clustering (REPLACES partitioning + Z-ORDER — NEW way)
CClustering keys (the columns you cluster by)
Time Travel = "VTA"
VVersion number (VERSION AS OF 5)
TTimestamp (TIMESTAMP AS OF '2026-03-20')
AAction to recover (RESTORE TABLE ... TO VERSION AS OF)
Lakehouse = "Lake + Warehouse = Best of Both"
Lakecheap storage, any format, schema-on-read
WarehouseACID, SQL, schema enforcement, fast queries
Lakehouseall of the above on ONE platform

SECTION 1: DELTA LAKE INTERNALS

🧠 Memory Map: Transaction Log

🗂️_delta_log/ = "The BRAIN of Delta Lake"
_delta_log/ = "The BRAIN of Delta Lake"
JSON files = individual commits (diary entries)
Checkpoint = summary every 10 commits (monthly bank statement)
_last_checkpoint = pointer to latest summary
Remember: "JC-L" = JSON → Checkpoint → Last_checkpoint

⚡ MUST KNOW DIRECT QUESTIONS (Cover the answer, test yourself!)

Q1What is Delta Lake?

An open-source storage layer that adds ACID transactions, schema enforcement, and time travel on top of Parquet files in a data lake (like ADLS Gen2).

Q2Where does Delta Lake store its metadata?

In the _delta_log/ folder — a series of JSON commit files + Parquet checkpoint files.

Q3What is the transaction log?

A folder (_delta_log/) that records every change as a numbered JSON file. It's the single source of truth for the table's current state.

Q4What are the 4 properties of ACID?

Atomicity (all or nothing), Consistency (schema rules enforced), Isolation (readers don't see partial writes), Durability (committed data survives crashes).

Q5What format are Delta data files stored in?

Parquet format. Delta Lake = Parquet files + transaction log. The log is what makes it "Delta."

Q6What is a checkpoint file?

A Parquet summary of the table state, created every 10 commits. Instead of reading 10,000 JSON files, read 1 checkpoint + recent JSONs.

Q7What is _last_checkpoint?

A small file that tells Delta which checkpoint is the latest — so it doesn't have to scan the entire _delta_log/ folder.

Q8What is snapshot isolation?

When you start reading a table, you see it as it was at that exact moment — even if someone writes new data while you're reading. You always get a consistent view.

Q9What is optimistic concurrency control?
Pro Tip
Delta assumes no conflict, lets multiple writers work in parallel, and checks for conflicts only at commit time. If conflict → retry automatically.
⚠️ Q10When does optimistic concurrency FAIL (conflict)?

When two writers modify the same files. Example: Writer A and B both UPDATE rows in the same partition → one gets a ConcurrentModificationException and must retry.

Q11What is data skipping?

Delta stores min/max statistics for each data file. When you query WHERE booking_date = '2026-03-15', Delta skips files where min > March 15 or max < March 15. Reads only relevant files.

Q12How many columns have statistics by default?

First 32 columns. Configurable via delta.dataSkippingNumIndexedCols. ⚠️ Put your most-filtered columns FIRST in schema!

🔑 MID-LEVEL QUESTIONS

Q13How does Delta read a table? (Step by step)

  1. Read _last_checkpoint → find latest checkpoint
  2. Read that checkpoint file (Parquet) → get base state
  3. Read all JSON commits AFTER the checkpoint → apply recent changes
  4. Result: current list of valid Parquet data files
  5. Read only those Parquet files → return query results

Q14What happens internally when you INSERT data?

  1. Spark writes new Parquet file(s) to the table folder
  2. Delta creates a new JSON commit in _delta_log/ with "add" action pointing to the new file(s)
  3. The commit is atomic — either the JSON file is fully written, or it's not

Q15What happens internally when you DELETE data?

  1. Delta identifies which Parquet files contain the rows to delete
  2. Reads those files, removes matching rows, writes NEW Parquet files with remaining rows
  3. Creates a commit with "remove" (old files) + "add" (new files)
  4. Old files are NOT physically deleted — they stay until VACUUM cleans them

⚠️ Q16Does DELETE physically remove files?

NO! DELETE only marks files as "removed" in the log. The physical Parquet files remain on disk. You need VACUUM to physically delete them. This is why time travel works — old files are still there.

Q17How does Delta handle concurrent writes? (Conflict resolution)

  • Two writers start at same snapshot (version 5)
  • Writer A commits first → version 6 (success)
  • Writer B tries to commit → Delta checks: "Did A modify the same files I'm modifying?"
    • If NO conflict (different files/partitions) → auto-resolve, Writer B becomes version 7
    • If YES conflict (same files) → ConcurrentModificationException → retry from version 6

Q18What are the 4 types of actions in a commit JSON?

ActionMeaning
addNew Parquet file added
removeParquet file logically removed
metaDataSchema or table properties changed
commitInfoWho, when, what operation

SECTION 2: MERGE INTO (UPSERT)

🧠 Memory Map: MERGE

🧠 MERGE = "Match → Update, No Match → Insert"
MERGE"Match → Update, No Match → Insert"
MERGE INTO target USING source -- Compare target with source
ON target.id = source.id -- Match condition
WHEN MATCHED THEN UPDATE SET ... -- Found?Update
WHEN NOT MATCHED THEN INSERT ... -- Not found?Insert
Remember: "MU-NI" = MatchedUpdate, NotMatched→Insert
6 MERGE Optimizations = "FSCPZL"
FFilter early (WHERE clause in source subquery)
SSmall file compaction (OPTIMIZE before MERGE)
CCluster by merge key (Liquid Clustering)
PPartition pruning (match on partition column)
ZZ-ORDER on merge key (if not using Liquid Clustering)
LLow shuffle merge (set spark.databricks.delta.merge.lowShuffle.enabled = true)

⚡ MUST KNOW DIRECT QUESTIONS

Q19What is MERGE INTO?

A single SQL command that does INSERT + UPDATE + DELETE in one atomic operation. Also called "upsert" (update + insert).

Q20What is an upsert?

Update existing rows if they match, Insert new rows if they don't. MERGE does this in a single pass.

Q21Can MERGE also DELETE rows?

Yes! Add WHEN MATCHED AND source.deleted = true THEN DELETE — three operations in one command.

⚠️ Q22What happens if the source has duplicate keys in MERGE?
Pro Tip
Error! Delta throws UnsupportedOperationException — "Cannot perform MERGE as multiple source rows matched." You MUST deduplicate source first.
Q23How to fix duplicate keys in source?

Use ROW_NUMBER() window function to keep only the latest record per key:

sql
WITH deduped AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY booking_id ORDER BY updated_at DESC
  ) AS rn FROM source_data
)
SELECT * FROM deduped WHERE rn = 1

Q24What is schema evolution in MERGE?

When source has NEW columns that target doesn't have, MERGE can auto-add them. Enable with: SET spark.databricks.delta.schema.autoMerge.enabled = true Or: .option("mergeSchema", "true")

🔑 MID-LEVEL QUESTIONS

Q25How does MERGE work internally? (3 steps)

  1. Join: Inner join target with source on the match condition
  2. Classify: Each row is classified as "matched" or "not matched"
  3. Write: Rewrite affected data files with updates + append new files for inserts

Q26Why is MERGE slow on large tables? How to optimize?

MERGE must scan the ENTIRE target to find matches. Optimizations:

  • Liquid Clustering on merge key → scans only relevant files (biggest win)
  • Filter source earlyMERGE INTO target USING (SELECT * FROM source WHERE date = today)
  • OPTIMIZE before MERGE → fewer small files to scan
  • Low shuffle merge → reduces data movement between nodes
  • Partition by merge key → prunes irrelevant partitions (old approach)

⚠️ Q27What is the merge_key trick for SCD Type 2?

Problem: In SCD Type 2, when a record changes, you need to:

  1. Close the old row (set end_date, is_current = false)
  2. Insert a new row (with new values, is_current = true)
But standard MERGE can't INSERT and UPDATE for the SAME key! Trick: Create a merge_key column:
  • For rows to UPDATE: merge_key = booking_id (matches target)
  • For rows to INSERT: merge_key = NULL (never matches → goes to NOT MATCHED → INSERT)

SECTION 3: OPTIMIZE, VACUUM, Z-ORDER, LIQUID CLUSTERING

🧠 Memory Map: File Performance

🧠 OPTIMIZE = "Combines small files into big files"
THE SMALL FILE PROBLEM
1000 tiny files (1 MB each) = SLOW to read (too many file opens)
1 huge file (1 TB) = SLOW to read (can't parallelize)
Sweet spot = files between 128 MB-1 GB
OPTIMIZE"Combines small files into big files"
OPTIMIZE bookings; -- Compacts ALL small files
Result: 1000 tiny files10 large files
VACUUM"Deletes old unused files from disk"
VACUUM bookings RETAIN 168 HOURS; -- Delete files older than 7 days
Default retention: 7 days (168 hours)
⚠️VACUUM < 7 days = BREAKS time travel!
Z-ORDER = "Sorts data by columns for fast lookups" (OLD way)
OPTIMIZE bookings ZORDER BY (airport_code, booking_date);
Good for: 1-4 columns that you filter on often
LIQUID CLUSTERING"Smart Z-ORDER that auto-maintains" (NEW way — 2024+)
CREATE TABLE bookings CLUSTER BY (airport_code, booking_date);
⚠️REPLACES both partitioning AND Z-ORDER
⚠️Cannot use Liquid Clustering WITH partitioning or Z-ORDER
Remember: "OV-ZL" = Optimize, Vacuum, Z-OrderLiquid (old→new)

⚡ MUST KNOW DIRECT QUESTIONS

Q28What is OPTIMIZE?

Compacts many small Parquet files into fewer large files (target: ~1 GB each). Faster reads because fewer files to open.

Q29What is VACUUM?

Physically deletes old Parquet files that are no longer referenced by the transaction log. Frees up storage.

Q30What is the default VACUUM retention?

7 days (168 hours). Files older than this are deleted. ⚠️ Setting it lower breaks time travel!

⚠️ Q31Can you VACUUM with 0 hours retention?

Yes, but DANGEROUS — breaks time travel and concurrent reads. Only use for GDPR "right to be forgotten":

sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM bookings RETAIN 0 HOURS;

Q32What is Z-ORDER?

Sorts data within files by specific columns so that similar values are stored together. When you filter by those columns, Delta reads far fewer files.

Q33What is Liquid Clustering?

The new replacement for partitioning + Z-ORDER (Databricks 2024+). Automatically maintains data layout — no manual OPTIMIZE ZORDER needed. You just define clustering keys at table creation.

⚠️ Q34Can you use Liquid Clustering with partitioning?

NO! They are mutually exclusive. Liquid Clustering replaces partitioning. You must choose one or the other.

Q35What are Deletion Vectors?

Instead of rewriting an entire Parquet file to delete a few rows, Delta marks rows as "deleted" in a small side file. The data file stays untouched. Much faster deletes/updates.

Q36Are Deletion Vectors enabled by default?

Yes, on Databricks (since 2024). They're a table property: delta.enableDeletionVectors = true.

🔑 MID-LEVEL QUESTIONS

Q37OPTIMIZE vs VACUUM — what's the difference?

OPTIMIZEVACUUM
WhatMerges small files → big filesDeletes old unused files
GoalFaster readsSave storage space
Data loss?NeverOld versions become unreadable
When to runAfter many small writesAfter OPTIMIZE (clean up old files)
Best practiceRun dailyRun weekly

Q38Z-ORDER vs Liquid Clustering — when to use which?

Z-ORDERLiquid Clustering
EraOld (before 2024)New (2024+, recommended)
SetupRun manually: OPTIMIZE ... ZORDER BYDefine once: CLUSTER BY at table creation
MaintenanceMust re-run after every writeAuto-maintains incrementally
With partitioning?Yes, works with partitionsNO — replaces partitioning
Column changesMust rewrite entire tableALTER TABLE ... CLUSTER BY (new_cols) — easy
Use whenLegacy tables, can't migrateAll new tables (always prefer this)

Q39When should you NOT use partitioning?

  • When partition column has HIGH cardinality (>10,000 values) → too many tiny folders
  • When table is small (<1 TB) → partitioning adds overhead
  • When using Liquid Clustering → they're mutually exclusive
  • Rule of thumb: Each partition should be >1 GB. If not, don't partition.

⚠️ Q40What's wrong with too many small files? (The Small File Problem)

  • Each file needs a separate read operation → high I/O overhead
  • More files = more metadata in the transaction log → slower planning
  • Cloud storage (ADLS) charges per API call → more files = higher cost
  • Fix: OPTIMIZE (compaction) or Auto Loader with trigger-based batching

SECTION 4: TIME TRAVEL & RECOVERY

🧠 Memory Map: Time Travel

🧠 TIME TRAVEL = "Read old versions of a table"
TIME TRAVEL"Read old versions of a table"
Two ways to travel back:
VERSION: SELECT * FROM bookings VERSION AS OF 5;
TIME: SELECT * FROM bookings TIMESTAMP AS OF '2026-03-20';
DESCRIBE HISTORY"See all versions and what changed"
DESCRIBE HISTORY bookings;
Shows: version, timestamp, operation, who did it
RESTORE"Undo mistakes"
RESTORE TABLE bookings TO VERSION AS OF 5;
⚠️This creates a NEW version (doesn't delete history)
Remember: "VTR" = Version, Timestamp, Restore
⚠️TRAP: Time travel only works within VACUUM retention period!
Default = 7 days. After VACUUM runsold versions are GONE forever.

⚡ MUST KNOW DIRECT QUESTIONS

Q41What is time travel in Delta Lake?

Ability to query previous versions of a table using version numbers or timestamps. Works because old Parquet files are kept until VACUUM removes them.

Q42How to query a specific version?

SELECT * FROM bookings VERSION AS OF 5; or TIMESTAMP AS OF '2026-03-20';

Q43How to see all versions of a table?

DESCRIBE HISTORY bookings; — shows every version, timestamp, operation, and user.

Q44How to recover from accidental deletion?

RESTORE TABLE bookings TO VERSION AS OF 5; — rolls back to version 5. This creates a NEW version (safe, doesn't rewrite history).

⚠️ Q45When does time travel stop working?

After VACUUM runs — it physically deletes old files. Default retention is 7 days. If you VACUUM with 0 hours, ALL old versions are gone immediately.

SECTION 5: LAKEHOUSE ARCHITECTURE

🧠 Memory Map: Lakehouse

DATA LAKE DATA WAREHOUSE LAKEHOUSE
───────── ────────────── ─────────
Cheap storage Expensive storage Cheap storage ✓
Any format Structured only Any format ✓
No ACID Full ACID Full ACID ✓
Schema-on-read Schema-on-write Both ✓
Slow queries Fast queries Fast queries (Photon) ✓
No governance Full governance Full governance (Unity Catalog) ✓
Good for ML Bad for ML Good for ML ✓
Bad for BI Good for BI Good for BI ✓
Remember: "Lakehouse = Lake PRICE + Warehouse FEATURES"

⚡ MUST KNOW DIRECT QUESTIONS

Q46What is a Data Lakehouse?

A single platform combining the cheap storage of a data lake with the ACID transactions, schema enforcement, and governance of a data warehouse. No need for separate lake + warehouse.

Q47What makes Lakehouse possible? (4 technologies)

  1. Delta Lake → ACID transactions on data lake files
  2. Photon Engine → Fast SQL queries (warehouse-speed on lake data)
  3. Unity Catalog → Governance, security, lineage
  4. Serverless SQL Warehouses → On-demand compute, no cluster management

Q48Managed Table vs External Table?

ManagedExternal
Data locationDatabricks controlsYou control (your ADLS path)
DROP TABLEDeletes data + metadataDeletes metadata ONLY, data stays
Use whenMost cases (simpler)Data shared across platforms

SECTION 6: NEW 2025-2026 FEATURES

🧠 Memory Map: What's New

NEW FEATURES"PLMCV"
PPredictive Optimization (auto OPTIMIZE + VACUUM)
LLakebase (OLTP on Delta Lake — like PostgreSQL)
MMulti-table Transactions (BEGIN ATOMIC...END)
CCompatibility Mode (Iceberg clients read Delta tables)
VVariant type (store JSON natively in Delta columns)
Remember: "PLM-CV" = Product Lifecycle Management for your CV (career!)

⚡ MUST KNOW DIRECT QUESTIONS

Q49What is Predictive Optimization?

Databricks automatically runs OPTIMIZE and VACUUM based on table usage patterns. You don't schedule these manually anymore. Unity Catalog required.

Q50What is Lakebase?

A new feature (GA March 2026) that allows Delta tables to handle OLTP workloads (point lookups, single-row updates) — like a traditional database (PostgreSQL) but built on Delta Lake.

Q51What are Multi-table Transactions?
Pro Tip
BEGIN ATOMIC ... END — allows changes to multiple tables in a single atomic transaction. Either ALL tables update or NONE do. New in Databricks 2025.
Q52What is Compatibility Mode?

Allows external tools (that only speak Iceberg/Hive) to read your Delta tables without conversion. Unity Catalog rewrites metadata on-the-fly.

Q53What is the Variant type?

A new data type in Delta Lake 4.x that stores semi-structured JSON data natively — no need to stringify JSON. Faster queries on nested JSON fields.

🧠 FINAL REVISION — Day 1 Summary Card

📐 Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│                    DAY 1: DELTA LAKE                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  DELTA = Parquet files + Transaction Log (_delta_log/)      │
│  Transaction Log = JSON commits + Checkpoints (every 10)    │
│  ACID = Atomicity, Consistency, Isolation, Durability       │
│  Concurrency = Optimistic (assume no conflict, check later) │
│                                                             │
│  MERGE = Match→Update, No Match→Insert ("MU-NI")           │
│  ⚠️ Source must be deduplicated (ROW_NUMBER trick)           │
│  SCD2 merge_key trick = NULL key for new rows → INSERT      │
│  6 optimizations = "FSCPZL"                                 │
│                                                             │
│  OPTIMIZE = compact small files (run daily)                 │
│  VACUUM = delete old files (run weekly, default 7 days)     │
│  Z-ORDER = sort by columns (OLD way)                        │
│  Liquid Clustering = auto-sort (NEW way, replaces Z+Part)   │
│  Deletion Vectors = mark rows deleted without rewriting     │
│                                                             │
│  Time Travel = VERSION AS OF / TIMESTAMP AS OF              │
│  RESTORE = undo mistakes (creates new version, safe)        │
│  ⚠️ Only works within VACUUM retention (default 7 days)     │
│                                                             │
│  Lakehouse = Lake PRICE + Warehouse FEATURES                │
│  New: Predictive Opt, Lakebase, Multi-table Tx, Variant    │
│                                                             │
│  TOP 5 THINGS TO SAY IN INTERVIEW:                          │
│  1. "Transaction log ensures ACID on cloud storage"         │
│  2. "MERGE with Liquid Clustering for fast upserts"         │
│  3. "Liquid Clustering replaces partitioning + Z-ORDER"     │
│  4. "VACUUM retention balances storage vs time travel"      │
│  5. "Predictive Optimization auto-manages file layout"      │
│                                                             │
└─────────────────────────────────────────────────────────────┘
🗺️ Memory Map
Study tip: Read this file TWICE:
  1. First pass (30 min): Read only 🧠 Memory Maps + ⚡ Direct Questions
  2. Second pass (30 min): Read 🔑 Mid-Level Questions + ⚠️ Traps
  3. Before interview (15 min): Read ONLY the Final Revision Summary Card