🧱
Databricks
File 03: Delta Lake & Lakehouse Architecture
🧱
🧱
Databricks · Section 4 of 17

File 03: Delta Lake & Lakehouse Architecture

File 03: Delta Lake & Lakehouse Architecture

Level: Senior/Lead (10+ years) — Deep internals and optimization Focus: Transaction log, MERGE scenarios, optimization, architecture decisions

SECTION 1: DELTA LAKE INTERNALS

Q1: What is the Delta Lake transaction log (_delta_log)? Explain how it ensures ACID transactions.

Answer: The _delta_log/ directory is an ordered record of every transaction performed on a Delta table.

Structure:

🗂️my_table/
_delta_log/
00000000000000000000.json ← Commit 0
00000000000000000001.json ← Commit 1
...
00000000000000000010.checkpoint.parquet ← Checkpoint at commit 10
_last_checkpoint ← Points to latest checkpoint
part-00000-abc123.snappy.parquet
part-00001-def456.snappy.parquet
...

Each JSON commit file contains:

  • add actions: New Parquet files added
  • remove actions: Files logically deleted (still physically present until VACUUM)
  • metaData: Schema changes, table properties
  • protocol: Reader/writer version
  • commitInfo: Timestamp, operation, user, metrics

How ACID is ensured:

  • Atomicity: Each commit writes a single JSON file atomically using put-if-absent (filesystem rename)
  • Consistency: Schema enforcement rejects mismatched writes
  • Isolation: Snapshot isolation — readers see a consistent snapshot
  • Durability: Data persisted as Parquet files on durable cloud storage

Q2: What are checkpoint files? Why are they critical?

Answer: Every 10 commits (default), Delta creates a Parquet checkpoint file that consolidates the full table state at that point.

Why needed:

  • Without checkpoints: Reading version 10,000 requires replaying 10,000 JSON files
  • With checkpoints: Read the latest checkpoint + only subsequent JSON files
  • _last_checkpoint file stores the latest checkpoint version

Configurable: delta.checkpointInterval (default: 10)

Q3: Explain optimistic concurrency control in Delta Lake. What happens when two writers conflict?

Answer:

🗂️Writer A reads version 5
Writer B reads version 5
Writer A commits version 6 (succeeds — first to commit)
Writer B attempts to commit version 6:
Checks: version 6 already exists!
Reads version 6 to see what Writer A changed
Checks for LOGICAL conflict:
If Writer A and B touched DIFFERENT files → No conflict → Retries as version 7 ✓
If Writer A and B touched SAME files/predicates → CONFLICT → ConcurrentModificationException ✗
Retry is automatic (default: 3 times, configurable)

Isolation levels:

LevelDefault?Behavior
WriteSerializableYesWriters see consistent snapshots; non-conflicting concurrent writes succeed
SerializableNoStrictest; even reads during writes can cause conflicts

Q4: Explain file-level statistics and data skipping.

Answer: Delta stores min/max statistics for the first 32 columns (by default) in the transaction log.

How data skipping works:

sql
SELECT * FROM orders WHERE order_date = '2025-01-15'
  1. Delta reads file statistics from the transaction log
  2. For each file, checks: min(order_date) <= '2025-01-15' AND max(order_date) >= '2025-01-15'
  3. Files where the condition is impossible are skipped entirely
  4. Only matching files are read

Configuration: delta.dataSkippingNumIndexedCols (default 32)

Q5: What is the difference between OPTIMIZE and VACUUM?

Answer:

AspectOPTIMIZEVACUUM
PurposeCompacts small files into larger onesPhysically deletes old unreferenced files
Performance impactYes — improves readsNo direct performance gain — frees storage
Data safetyNon-destructive (old files still exist)DESTRUCTIVE — files removed permanently
Default retentionN/A7 days (delta.deletedFileRetentionDuration)
Time travel impactNoneBreaks time travel before vacuum threshold
sql
-- Compact files (target ~1 GB per file)
OPTIMIZE orders;

-- Compact + co-locate by column
OPTIMIZE orders ZORDER BY (customer_id);

-- Clean up old files (7-day retention)
VACUUM orders;

-- Clean up with custom retention
VACUUM orders RETAIN 168 HOURS;  -- 7 days

Q6: Can you run VACUUM with retention of 0 hours? What are the risks?

Answer: Yes, but you must disable the safety check:

sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM my_table RETAIN 0 HOURS;

Risks:

  1. Breaks ALL time travel — can't query any previous version
  2. Concurrent readers may fail — readers that started before VACUUM may reference deleted files
  3. No recovery — deleted data is gone permanently

Rule: NEVER do this in production. Use default 7-day retention.

Q7: What happens if a write fails midway in Delta Lake?

Answer:

  • Data files (Parquet) may be partially written to storage
  • But the transaction log entry is never committed (atomic operation)
  • On next read, Delta only considers files referenced in committed log entries
  • The orphaned Parquet files are cleaned up by VACUUM
  • This is the key benefit of ACID — partial writes don't corrupt the table

SECTION 2: MERGE INTO — ALL SCENARIOS

Q8: Explain the MERGE INTO syntax. Write a basic upsert.

Answer:

sql
MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id                         -- Match condition

WHEN MATCHED AND s.op = 'DELETE' THEN
    DELETE                              -- Delete matching rows

WHEN MATCHED AND s.updated_at > t.updated_at THEN
    UPDATE SET *                        -- Update with all source columns

WHEN NOT MATCHED THEN
    INSERT *                            -- Insert all source columns

WHEN NOT MATCHED BY SOURCE THEN
    DELETE                              -- Delete target rows not in source (Databricks extension)

PySpark equivalent:

python — editable
from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "target_table")

target.alias("t").merge(
    source_df.alias("s"),
    "t.id = s.id"
).whenMatchedUpdate(
    condition="s.updated_at > t.updated_at",
    set={"*": "*"}  # or explicit: {"name": "s.name", "email": "s.email"}
).whenNotMatchedInsertAll() \
 .execute()

Q9: How do you handle duplicate keys in the source during MERGE?

Answer: If source has duplicate keys matching the same target row, MERGE throws an error: "Cannot perform Merge as multiple source rows matched...".

Solution: Deduplicate source first:

python — editable
from pyspark.sql.functions import row_number, col
from pyspark.sql import Window

w = Window.partitionBy("id").orderBy(col("updated_at").desc())
deduped_source = source_df \
    .withColumn("rn", row_number().over(w)) \
    .filter(col("rn") == 1) \
    .drop("rn")

# Now merge with deduped source
target.alias("t").merge(deduped_source.alias("s"), "t.id = s.id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

Q10: Design a MERGE for an e-commerce order system (create, update, cancel).

Answer:

sql
MERGE INTO orders_fact t
USING orders_staging s
ON t.order_id = s.order_id

-- Cancel: mark as deleted
WHEN MATCHED AND s.status = 'CANCELLED' THEN DELETE

-- Update: only if newer
WHEN MATCHED AND s.updated_at > t.updated_at THEN
    UPDATE SET
        t.status = s.status,
        t.amount = s.amount,
        t.shipping_address = s.shipping_address,
        t.updated_at = s.updated_at

-- New order
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, status, amount, shipping_address, created_at, updated_at)
    VALUES (s.order_id, s.customer_id, s.status, s.amount, s.shipping_address, s.created_at, s.updated_at)

Q11: What is the performance concern with MERGE? How do you optimize it?

Answer: MERGE scans the entire target table to find matches. For large tables, this is the bottleneck.

Optimization techniques:

1. Partition pruning — add partition columns to ON clause:

sql
MERGE INTO target t
USING source s
ON t.id = s.id AND t.date = s.date  -- date is partition column → prunes partitions

2. Z-ORDER on merge key:

sql
OPTIMIZE target ZORDER BY (id);  -- Data skipping on the merge key

3. Reduce source data before merge:

python — editable
# Don't merge 100M rows if only 1M changed
changed_records = source_df.join(target_df, "id", "left_anti")  # New records
changed_records = changed_records.union(
    source_df.join(target_df, "id", "inner")
    .filter(source_df["hash"] != target_df["hash"])  # Changed records
)

4. Broadcast small source:

python — editable
target.alias("t").merge(
    broadcast(source_df).alias("s"),  # Broadcast if source is small
    "t.id = s.id"
)

5. Compact target first:

sql
OPTIMIZE target;  -- Fewer, larger files = fewer tasks

6. Use Photon runtime — significantly faster MERGE operations.

Q12: MERGE with schema evolution — how does it work?

Answer:

python — editable
# Enable automatic schema merge
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

# Now source with new columns will automatically add them to target
target.alias("t").merge(
    source_df.alias("s"),  # source has columns not in target
    "t.id = s.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()
# New columns from source are added to target table

SECTION 3: TIME TRAVEL & RECOVERY

Q13: How does time travel work? Show all query methods.

Answer:

sql
-- By version number
SELECT * FROM orders VERSION AS OF 5;
SELECT * FROM orders@v5;

-- By timestamp
SELECT * FROM orders TIMESTAMP AS OF '2025-01-15 10:30:00';

-- Check history first
DESCRIBE HISTORY orders;

-- Restore entire table
RESTORE TABLE orders TO VERSION AS OF 5;
RESTORE TABLE orders TO TIMESTAMP AS OF '2025-01-15';

PySpark:

python — editable
# By version
df = spark.read.format("delta").option("versionAsOf", 5).load(path)

# By timestamp
df = spark.read.format("delta").option("timestampAsOf", "2025-01-15").load(path)

Q14: Scenario — You accidentally deleted critical data 3 days ago. How do you recover?

Answer:

sql
-- Step 1: Check history to find the version before the delete
DESCRIBE HISTORY customers;
-- Let's say the DELETE was version 42, so we want version 41

-- Step 2: Option A — Restore entire table (fastest, simplest)
RESTORE TABLE customers TO VERSION AS OF 41;

-- Step 2: Option B — Selectively restore only deleted rows
INSERT INTO customers
SELECT * FROM customers VERSION AS OF 41
WHERE customer_id NOT IN (SELECT customer_id FROM customers);

-- Step 2: Option C — MERGE for surgical recovery
MERGE INTO customers AS target
USING customers VERSION AS OF 41 AS source
ON target.customer_id = source.customer_id
WHEN NOT MATCHED THEN INSERT *;

Time travel limits:

  • Default data retention: 7 days (delta.deletedFileRetentionDuration)
  • Default log retention: 30 days (delta.logRetentionDuration)
  • VACUUM removes old files → breaks time travel for those versions

SECTION 4: Z-ORDERING, LIQUID CLUSTERING, OPTIMIZATION

Q15: What is Z-Ordering? How does it differ from partitioning?

Answer:

AspectPartitioningZ-Ordering
How it worksCreates separate directories per valueCo-locates related data within files using space-filling Z-curve
CardinalityLow cardinality (< 1000 values)High cardinality (user_id, order_id)
Query patternAlmost always filter on partition columnFrequently filter on the Z-ordered column
OverheadOver-partitioning creates small filesRequires running OPTIMIZE
CombinationCan be combinedZ-ORDER should NOT use partition columns
sql
-- Partition by date (low cardinality), Z-order by customer_id (high cardinality)
CREATE TABLE orders (
    order_id LONG,
    customer_id LONG,
    order_date DATE,
    amount DECIMAL(10,2)
) PARTITIONED BY (order_date);

OPTIMIZE orders ZORDER BY (customer_id);

Z-ORDER best practices:

  • Max 4 columns (effectiveness decreases with more)
  • Choose columns in WHERE, JOIN, MERGE conditions
  • High cardinality columns benefit most
  • NOT idempotent — running again rewrites files

Q16: What is Liquid Clustering? How does it improve over Z-Ordering + Partitioning?

Answer: Liquid Clustering (GA in Databricks) is the next-generation data layout that replaces both partitioning and Z-Ordering.

AspectZ-OrderingLiquid Clustering
When appliedManual OPTIMIZE ZORDER BYAutomatic on writes (or OPTIMIZE)
IncrementalNo (rewrites ALL files)Yes (only new/changed data)
Change keysMust re-OPTIMIZE everythingJust ALTER — new writes use new keys
PartitioningSeparate conceptReplaces partitioning
Small filesDoesn't addressHandles automatically
sql
-- Create with liquid clustering (replaces PARTITIONED BY + ZORDER)
CREATE TABLE orders (
    order_id LONG,
    customer_id LONG,
    order_date DATE,
    amount DECIMAL(10,2)
) CLUSTER BY (customer_id, order_date);

-- Change clustering keys without rewriting data
ALTER TABLE orders CLUSTER BY (order_date, region);

-- Trigger optimization
OPTIMIZE orders;

When to use what (2024+ recommendation):

  • New tables: Use Liquid Clustering
  • Existing tables with partitioning: Migrate to Liquid Clustering if possible
  • Legacy tables: Continue with partitioning + Z-Ordering

Q17: What are Deletion Vectors? How do they improve UPDATE/DELETE performance?

Answer: Instead of rewriting entire Parquet files for DELETE/UPDATE/MERGE, deletion vectors mark individual rows as deleted in a separate lightweight file.

Without deletion vectors:

  • DELETE 1 row from a 1 GB Parquet file → rewrite entire 1 GB file
  • MERGE affecting 100 files → rewrite all 100 files

With deletion vectors:

  • DELETE 1 row → write a tiny deletion vector file (~bytes)
  • Reads filter out deleted rows using the deletion vector
  • Physical rewrite is deferred to next OPTIMIZE
sql
-- Enable deletion vectors
ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableDeletionVectors' = 'true');

Trade-off: Slightly slower reads (must apply deletion vectors) but much faster writes. OPTIMIZE reclaims space.

Q18: What are the most important Delta table properties?

Answer:

sql
ALTER TABLE my_table SET TBLPROPERTIES (
    -- Write optimization
    'delta.autoOptimize.optimizeWrite' = 'true',         -- Coalesce small files on write
    'delta.autoOptimize.autoCompact' = 'true',            -- Auto compaction after writes

    -- Change Data Feed
    'delta.enableChangeDataFeed' = 'true',                -- Track row-level changes

    -- Retention
    'delta.logRetentionDuration' = 'interval 30 days',    -- How long to keep commit logs
    'delta.deletedFileRetentionDuration' = 'interval 7 days',  -- VACUUM threshold

    -- Column mapping (enables column rename/drop)
    'delta.columnMapping.mode' = 'name',

    -- Deletion vectors (faster deletes)
    'delta.enableDeletionVectors' = 'true',

    -- Checkpoint interval
    'delta.checkpointInterval' = '10'
);

Q19: What is the difference between managed and external tables in Databricks?

Answer:

AspectManaged TableExternal Table
StorageDatabricks-managed locationUser-specified external location
DROP TABLEDeletes both metadata AND dataDeletes only metadata — data survives
Use caseDefault for most tablesShared data, data must survive table drops
Unity CatalogManaged by metastoreRequires External Location grant
sql
-- Managed (data stored in default warehouse location)
CREATE TABLE managed_orders (id LONG, amount DECIMAL);

-- External (data at your specified location)
CREATE TABLE external_orders (id LONG, amount DECIMAL)
LOCATION 's3://my-bucket/orders/';

Q20: What are clone operations? Explain deep clone vs shallow clone.

Answer:

AspectDeep CloneShallow Clone
Copies data?YES (full copy)NO (references source files)
Independent?Yes — fully independentNo — depends on source data files
SpeedSlow (copies all data)Fast (copies only metadata)
Use caseProduction copies, migrationTesting, experimentation, quick snapshots
VACUUM safe?YesNO — vacuuming source can break clone
sql
-- Deep clone (full copy)
CREATE TABLE orders_backup DEEP CLONE orders;

-- Shallow clone (metadata only)
CREATE TABLE orders_test SHALLOW CLONE orders;

-- Incremental deep clone (only new changes since last clone)
CREATE TABLE orders_backup DEEP CLONE orders;  -- subsequent runs are incremental

Q21: Explain schema evolution in Delta Lake. What are the options?

Answer:

Add new columns during write:

python — editable
# mergeSchema — adds new columns, preserves existing
df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable("my_table")

Replace entire schema:

python — editable
# overwriteSchema — replaces schema completely
df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("my_table")

Column rename/drop (requires column mapping):

sql
-- Enable column mapping first
ALTER TABLE my_table SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name');

-- Now you can rename and drop columns
ALTER TABLE my_table RENAME COLUMN old_name TO new_name;
ALTER TABLE my_table DROP COLUMN unused_column;

SECTION 5: LAKEHOUSE ARCHITECTURE

Q22: What is a data lakehouse? How does it differ from data lake and data warehouse?

Answer:

AspectData LakeData WarehouseData Lakehouse
StorageCheap object storageProprietaryCheap object storage
FormatOpen (Parquet, ORC)ProprietaryOpen (Delta, Iceberg, Hudi)
ACIDNoYesYes
SchemaSchema-on-readSchema-on-writeBoth
PerformanceSlow for BIFast for BIFast (OPTIMIZE, caching, Photon)
ML supportGoodPoorExcellent
GovernanceLimitedStrongStrong (Unity Catalog)
CostLowHighLow-Medium

Q23: What key technologies enable the lakehouse?

Answer:

  1. Delta Lake / Iceberg / Hudi: ACID transactions on data lakes
  2. Photon / vectorized engines: Warehouse-level query performance
  3. Unity Catalog: Unified governance across all data assets
  4. Serverless compute: On-demand, auto-scaling
  5. SQL endpoints / SQL Warehouses: Direct BI tool connectivity
  6. MLflow: Integrated ML lifecycle management

Q24: What problems does the lakehouse solve?

Answer:

  1. Eliminates the "two-tier" architecture (data lake + data warehouse)
    • No more ETL from lake to warehouse
    • Single copy of data serves all workloads
  2. Reduces data duplication and ETL complexity
  3. Single source of truth for BI and ML
  4. Open formats prevent vendor lock-in
  5. Cost-effective storage with warehouse-level performance
  6. Unified governance across all workloads

Q25: Scenario — Your company has a data lake on S3, a Snowflake warehouse, and a separate ML platform. The CEO wants to consolidate. How do you design the lakehouse migration?

Answer:

🗂️Phase 1: Foundation (Weeks 1-4)
Set up Databricks workspace with Unity Catalog
Configure cloud storage (S3/ADLS) as external locations
Establish IAM roles / service principals
Define medallion architecture standards
Phase 2: Data Migration (Weeks 5-12)
Bronze: Point Auto Loader at existing S3 raw data
Silver: Migrate Snowflake transformation logic to Databricks SQL/PySpark
Gold: Recreate Snowflake aggregation tables as Delta tables
Validate data parity between old and new
Phase 3: ML Integration (Weeks 10-14)
Migrate ML feature engineering to Silver/Gold layers
Register models in MLflow (Unity Catalog)
Feature tables accessible to both BI and ML
Phase 4: BI Cutover (Weeks 13-16)
Connect BI tools to Databricks SQL Warehouse
Validate report parity
Decommission Snowflake
Key decisions:
- Use managed tables for new data, external tables for migrated data
- Liquid clustering instead of partitioning for new tables
- DLT for critical pipeline quality guarantees
- Unity Catalog for governance from day 1

Q26: What is Delta Sharing? How does it work?

Answer: Delta Sharing is an open protocol for secure data sharing across organizations.

How it works:

  1. Provider shares a Delta table via a Delta Sharing Server
  2. Provider generates a sharing profile (JSON with endpoint + credentials)
  3. Recipient uses any client (Spark, pandas, Power BI) to read shared data
  4. Data is read directly from the provider's storage — no copying
  5. Recipient gets read-only access with the provider's access controls
python — editable
# Recipient reads shared data
df = spark.read.format("deltaSharing").load("profile.json#share.schema.table")

Key benefits:

  • No data copying (cost-effective)
  • Open protocol (not Databricks-specific)
  • Audit logging on provider side
  • Fine-grained access control

Q27: Scenario — A Delta table has 10,000 small files (each <1 MB). Queries are slow. How do you fix this?

Answer:

sql
-- Step 1: Compact files immediately
OPTIMIZE slow_table;                                    -- Merge into ~1 GB files
OPTIMIZE slow_table ZORDER BY (frequently_filtered_col); -- Plus data skipping

-- Step 2: Clean up old files
VACUUM slow_table RETAIN 168 HOURS;

-- Step 3: Prevent future small files
ALTER TABLE slow_table SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',     -- Coalesce on write
    'delta.autoOptimize.autoCompact' = 'true'         -- Auto compact after writes
);

-- Step 4: For streaming sources, increase trigger interval
# .trigger(processingTime="5 minutes")  -- Instead of "10 seconds"

-- Step 5: For new tables, use Liquid Clustering
-- CREATE TABLE ... CLUSTER BY (col)  -- Handles compaction automatically

Q28: What is the difference between DataFrame.write.mode("overwrite") and REPLACE TABLE in Delta?

Answer:

Aspectmode("overwrite")REPLACE TABLE
ScopeOverwrites data (optionally per partition)Replaces entire table definition
SchemaKeeps existing schema (unless overwriteSchema=true)Can change schema
HistoryMaintains history (time travel works)Maintains history
Partition overwriteSupports replaceWhere for surgical overwritesN/A
python — editable
# Overwrite specific partitions only
df.write.format("delta") \
    .mode("overwrite") \
    .option("replaceWhere", "date = '2025-01-15'") \
    .saveAsTable("orders")

# This is IDEMPOTENT — safe for retry/re-run

Q29: How do you implement write-audit-publish (WAP) pattern with Delta Lake?

Answer: WAP ensures data quality before making data visible to consumers.

python — editable
# Step 1: Write to a staging area (or use table clones)
staging = "staging_orders"
df.write.format("delta").mode("overwrite").saveAsTable(staging)

# Step 2: Audit — run quality checks
quality_check = spark.sql(f"""
    SELECT
        COUNT(*) as total_rows,
        SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) as null_ids,
        SUM(CASE WHEN amount < 0 THEN 1 ELSE 0 END) as negative_amounts
    FROM {staging}
""").collect()[0]

assert quality_check["null_ids"] == 0, "Null order IDs found!"
assert quality_check["negative_amounts"] == 0, "Negative amounts found!"

# Step 3: Publish — atomically swap
spark.sql(f"""
    INSERT OVERWRITE TABLE production_orders
    SELECT * FROM {staging}
""")

Alternative with Delta's RESTORE: If quality check fails after writing to production, RESTORE TABLE production_orders TO VERSION AS OF .