Databricks + PySpark Interview Preparation — Master Index & Mind Map
💡 Interview Tip
Covers: Beginner to Architect level — everything you need for any data engineering interview
Last Updated: 2026-02-27
Total Questions: 200+ across all files
MIND MAP — Topics & Coverage
DATABRICKS INTERVIEW
1SPARK ARCHITECTURE & INTERNALSFile: 01
Application Lifecycle (Driver, Executors, Cluster Manager)
DAG Scheduler vs Task Scheduler
Narrow vs Wide Dependencies
Catalyst Optimizer (4 Phases)
Tungsten & Whole-Stage CodeGen
Adaptive Query Execution (AQE)
Memory Management (Unified Model)
Execution vs Storage Memory
On-Heap vs Off-Heap
OOM Debugging (Driver vs Executor)
Shuffle Deep Dive
Shuffle Write (Map Side)
Shuffle Read (Reduce Side)
External Shuffle Service
FetchFailedException
Serialization (Java vs Kryo vs Tungsten vs Arrow)
Fault Tolerance & Speculation
2PYSPARK ADVANCED OPERATIONSFile: 02
DataFrame API Deep Dive
select vs withColumn vs selectExpr
withColumn Anti-Pattern
Lazy Evaluation Benefits
Joins (5 Physical Strategies)
Broadcast Hash Join
Sort-Merge Join
Shuffle Hash Join
BroadcastNestedLoop Join
Cartesian Product
Window Functions
row_number vs rank vs dense_rank
Running Totals, Moving Averages
rowsBetween vs rangeBetween
Data Skew Handling
Salting Technique
AQE Skew Join
Isolate-and-Union
Two-Phase Aggregation
UDFs & Performance
Row-at-a-Time UDF (Slow)
Pandas UDF / Vectorized UDF
mapInPandas
Partitioning
repartition vs coalesce
Hash vs Range Partitioning
Bucketing
Caching & Checkpointing
Structured Streaming
Micro-Batch vs Continuous
Output Modes (Append/Complete/Update)
Watermarking
Stream-Stream Joins
foreachBatch Pattern
Exactly-Once Semantics
State Store Management
Coding Challenges (15+ Problems)
3DELTA LAKE & LAKEHOUSEFile: 03
Delta Lake Internals
Transaction Log (_delta_log)
Checkpoint Files
Optimistic Concurrency Control
File-Level Statistics & Data Skipping
Isolation Levels
MERGE INTO — All Scenarios
Basic MERGE
Handling Duplicate Keys
MERGE Performance Optimization
Schema Evolution with MERGE
Time Travel & Recovery
OPTIMIZE, VACUUM, ZORDER
Z-Ordering vs Liquid Clustering
Deletion Vectors
Schema Evolution
Clone Operations (Deep vs Shallow)
Table Properties (Important Settings)
Managed vs External Tables
Lakehouse Architecture
Data Lake vs Warehouse vs Lakehouse
Key Enabling Technologies
4ETL SCENARIOS & PIPELINE DESIGNFile: 04
SCD Implementations
SCD Type 1 (Overwrite)
SCD Type 2 (History — Full Code)
SCD Type 3 (Previous Value)
Merge Key Trick Explained
Change Data Capture (CDC)
Debezium CDC Pipeline Design
Delta Lake Change Data Feed (CDF)
CDC Event Types
Medallion Architecture (Bronze/Silver/Gold)
Design Decisions per Layer
Data Quality Between Layers
When to Deviate
Late-Arriving Data Handling
Auto Loader
Directory Listing vs File Notification
Schema Evolution Modes
Auto Loader vs COPY INTO
Delta Live Tables (DLT)
Expectations (3 Levels)
SCD Type 2 with apply_changes
Materialized View vs Streaming Table
COMPLETE vs TRIGGERED Mode
Scenario-Based Pipeline Design (10+ Scenarios)
Clickstream 10B events/day
8-hour Pipeline Optimization
Data Mesh on Databricks
Oracle CDC to Delta Lake
Batch + Streaming Unified
Data Quality Framework
500K Small Files Fix
Streaming Pipeline Falling Behind
2-Year Backfill
Multi-Tenant Architecture
3-Hour MERGE Optimization
Real-Time Fraud Detection
Pipeline Health Monitoring
5PERFORMANCE TUNING & PRODUCTIONFile: 05
Reading Spark Execution Plans
Spark UI Debugging Methodology
Key Metrics to Monitor
Partition Tuning
spark.sql.shuffle.partitions
Optimal Partition Size
Small File Problem
Join Optimization
All Join Strategies Compared
Broadcast Threshold Tuning
Dynamic Partition Pruning
Spill Debugging
Predicate Pushdown Verification
File Format Selection
Unity Catalog
Three-Level Namespace
Row-Level & Column-Level Security
Data Lineage
Storage Credentials & External Locations
Delta Sharing
Hive Metastore Migration
Photon Engine
When It Helps
When It Doesn't
Databricks Workflows & Orchestration
Workflows vs Airflow
Job Cluster vs All-Purpose
Task Parameter Passing
Databricks Asset Bundles (CI/CD)
Cost Management Strategies
CI/CD for Databricks Pipelines
Data Governance at Scale
System Design Questions (Leadership Level)
6DELTA LAKE ADVANCED MASTERCLASSFile: DB_06
OPTIMIZE Deep Dive
Internal Mechanics (bin-packing vs Z-ORDER)
When OPTIMIZE Hurts (5 anti-patterns)
Cost Analysis
Liquid Clustering vs Z-ORDER vs Partitioning
Complete Comparison Table
Incremental vs Full Rewrite (key insight)
Hilbert Curve vs Z-Order Curve
Migration Path
Deletion Vectors Internals
RoaringBitmap Mechanics
Read/Write Trade-offs
DV + MERGE Interaction
Change Data Feed (CDF/CDC)
4 Change Types
Streaming CDF Patterns
5 Gotchas (not retroactive, INSERT OVERWRITE, etc.)
UniForm (Universal Format)
Delta + Iceberg + Hudi Metadata
Feature Asymmetry Table
One-Way Sync Limitation
Predictive Optimization
3 Levels: Optimized Writes vs Auto Compact vs PO
Unity Catalog Requirement
VACUUM Retention Gotcha
Small File Problem (Complete Playbook)
6 Root Causes
7 Solutions
VACUUM Risks & Production Incidents
4 Real Incident Scenarios
Production Best Practices Checklist
Delta vs Iceberg vs Hudi
Feature Comparison Matrix
Decision Framework
Hidden Partitioning Deep Dive
Rapid-Fire Interview Questions (10+)
Protocol Versions (irreversible upgrades)
Schema Enforcement vs Evolution
WAP Pattern, Row Tracking, Type Widening, VARIANT
Gotcha Questions with Traps
TOPIC FREQUENCY IN REAL INTERVIEWS
🔴 Very High: Delta MERGE/SCD2, Performance Tuning, Medallion Architecture
🟠 High: Data Skew, Streaming, Unity Catalog, Auto Loader, AQE, OPTIMIZE/ZORDER
🟡 Medium-High: DLT, Window Functions, CI/CD, Photon
🟢 Medium: Liquid Clustering, Deletion Vectors, Delta Sharing
QUESTION TRACKER (Prevents Duplicates)
Already Covered Topics (DO NOT RE-ADD):
- Spark application lifecycle
- DAG Scheduler vs Task Scheduler
- Narrow vs Wide dependencies
- Catalyst optimizer (4 phases)
- AQE (3 optimizations)
- Whole-stage codegen & Tungsten
- Unified memory management
- OOM debugging (driver vs executor)
- Shuffle process (write/read)
- External shuffle service
- FetchFailedException
- Serialization options (Java/Kryo/Arrow/Tungsten)
- Accumulators & Broadcast variables
- Speculative execution
- Block Manager
- DataFrame select/withColumn/selectExpr
- withColumn anti-pattern
- All 5 join strategies
- Data skew handling (4 techniques)
- Window functions (row_number/rank/dense_rank)
- Running total, moving average, pct_of_total
- rowsBetween vs rangeBetween
- repartition vs coalesce
- Hash vs Range partitioning
- Bucketing
- cache/persist/checkpoint
- UDF types & performance
- Pandas UDF / mapInPandas
- Structured Streaming model
- Output modes (append/complete/update)
- Watermarking
- Stream-stream joins
- foreachBatch pattern
- Exactly-once semantics
- State store management
- Delta transaction log
- Checkpoint files
- Optimistic concurrency control
- File-level statistics
- MERGE INTO (all scenarios)
- Time travel
- OPTIMIZE / VACUUM / ZORDER
- Z-Ordering vs Liquid Clustering
- Deletion vectors
- Schema evolution
- Deep clone vs Shallow clone
- Managed vs External tables
- Delta table properties
- SCD Type 1, 2, 3 (full code)
- Merge key trick
- CDC patterns (Debezium, CDF)
- Medallion architecture
- Auto Loader (modes, schema evolution)
- DLT (expectations, apply_changes)
- 12+ scenario-based pipeline designs
- Spark UI debugging
- Partition tuning
- Small file problem
- Predicate pushdown
- Dynamic partition pruning
- Spill debugging
- Unity Catalog (all aspects)
- Photon engine
- Databricks Workflows
- Asset Bundles / CI/CD
- Cost management
- Data governance
- System design (leadership)
- Py4J gateway architecture
- Cost-based optimization (CBO)
- Coding challenges (15+ problems)
HOW TO USE THIS PREPARATION
If you have 1 week:
- Day 1-2: File 01 (Architecture) + File 03 (Delta Lake) — understand the foundation
- Day 3-4: File 04 (ETL Scenarios) — practice SCD Type 2 code, CDC pipelines
- Day 5-6: File 02 (PySpark) — coding challenges, streaming, joins
- Day 7: File 05 (Production) — Unity Catalog, tuning, system design
If you have 3 days:
- Day 1: File 03 + File 04 (Delta Lake + ETL — most asked topics)
- Day 2: File 01 + File 02 (Architecture + PySpark coding)
- Day 3: File 05 (Production + system design)
If you have 1 day:
Focus on: SCD Type 2 code, MERGE scenarios, Medallion architecture, Data skew handling, AQE, Unity Catalog, and the scenario-based questions in File 04.
KEY INTERVIEW PATTERNS FOR 10+ YEARS EXPERIENCE
- They won't ask "What is Spark?" — They'll ask "Walk me through how you'd debug a 4-hour job that's failing on a 10 TB dataset"
- Scenario > Definition — "Design a CDC pipeline from Oracle to Delta Lake" > "What is CDC?"
- Trade-offs > Single Answer — "When would you NOT use broadcast join?" > "What is a broadcast join?"
- Architecture > Code — "How would you design a multi-tenant data platform?" > "Write a SELECT query"
- Depth > Breadth — They'll drill into YOUR answers. If you mention Z-Ordering, expect "How does it work internally with space-filling curves?"