Azure Databricks — Question Bank (L1/L2/L3)
TOPIC 1: DELTA LAKE (Internals, Transaction Log, ACID, MERGE, OPTIMIZE, VACUUM, Z-ORDER, Liquid Clustering)
L1 — Direct / Simple Questions
What is Delta Lake and why was it created?
Open-source storage layer that brings ACID transactions, schema enforcement, and time travel to data lakes. Created to solve the reliability problems of raw data lakes (no transactions, no schema control, corrupt reads from concurrent writes).
What file format does Delta Lake use under the hood?
Apache Parquet files plus a JSON-based transaction log (
_delta_log). Data is stored as Parquet; the log tracks which Parquet files are valid for each table version.What is the
_delta_logdirectory and what does it contain?A directory inside every Delta table that stores the transaction log — a sequence of JSON files (one per commit) recording every add/remove of Parquet files. It is the single source of truth for the table's state.
What are the four ACID properties and how does Delta Lake guarantee them?
Atomicity (commits are all-or-nothing via the transaction log), Consistency (schema enforcement rejects bad writes), Isolation (optimistic concurrency control with conflict detection), Durability (data stored on cloud storage like ADLS Gen2).
What is a checkpoint file in the Delta transaction log?
A Parquet file created every 10 commits that snapshots the entire table state. It avoids reading all previous JSON commits from scratch — readers start from the latest checkpoint and replay only newer commits.
What is schema enforcement in Delta Lake?
Delta Lake rejects writes that don't match the table's schema (wrong column names, types, or missing required columns). It prevents silent data corruption by failing the write immedia