🧱
Databricks
Databricks + PySpark Interview Preparation — Master Index & Mind Map
🧱
🧱
Databricks · Section 1 of 17

Databricks + PySpark Interview Preparation — Master Index & Mind Map

Databricks + PySpark Interview Preparation — Master Index & Mind Map

MIND MAP — Topics & Coverage

🗂️DATABRICKS INTERVIEW
1SPARK ARCHITECTURE & INTERNALSFile: 01
Application Lifecycle (Driver, Executors, Cluster Manager)
DAG Scheduler vs Task Scheduler
Narrow vs Wide Dependencies
Catalyst Optimizer (4 Phases)
Tungsten & Whole-Stage CodeGen
Adaptive Query Execution (AQE)
Memory Management (Unified Model)
Execution vs Storage Memory
On-Heap vs Off-Heap
OOM Debugging (Driver vs Executor)
Shuffle Deep Dive
Shuffle Write (Map Side)
Shuffle Read (Reduce Side)
External Shuffle Service
FetchFailedException
Serialization (Java vs Kryo vs Tungsten vs Arrow)
Fault Tolerance & Speculation
2PYSPARK ADVANCED OPERATIONSFile: 02
DataFrame API Deep Dive
select vs withColumn vs selectExpr
withColumn Anti-Pattern
Lazy Evaluation Benefits
Joins (5 Physical Strategies)
Broadcast Hash Join
Sort-Merge Join
Shuffle Hash Join
BroadcastNestedLoop Join
Cartesian Product
Window Functions
row_number vs rank vs dense_rank
Running Totals, Moving Averages
rowsBetween vs rangeBetween
Data Skew Handling
Salting Technique
AQE Skew Join
Isolate-and-Union
Two-Phase Aggregation
UDFs & Performance
Row-at-a-Time UDF (Slow)
Pandas UDF / Vectorized UDF
mapInPandas
Partitioning
repartition vs coalesce
Hash vs Range Partitioning
Bucketing
Caching & Checkpointing
Structured Streaming
Micro-Batch vs Continuous
Output Modes (Append/Complete/Update)
Watermarking
Stream-Stream Joins
foreachBatch Pattern
Exactly-Once Semantics
State Store Management
Coding Challenges (15+ Problems)
3DELTA LAKE & LAKEHOUSEFile: 03
Delta Lake Internals
Transaction Log (_delta_log)
Checkpoint Files
Optimistic Concurrency Control
File-Level Statistics & Data Skipping
Isolation Levels
MERGE INTO — All Scenarios
Basic MERGE
Handling Duplicate Keys
MERGE Performance Optimization
Schema Evolution with MERGE
Time Travel & Recovery
OPTIMIZE, VACUUM, ZORDER
Z-Ordering vs Liquid Clustering
Deletion Vectors
Schema Evolution
Clone Operations (Deep vs Shallow)
Table Properties (Important Settings)
Managed vs External Tables
Lakehouse Architecture
Data Lake vs Warehouse vs Lakehouse
Key Enabling Technologies
4ETL SCENARIOS & PIPELINE DESIGNFile: 04
SCD Implementations
SCD Type 1 (Overwrite)
SCD Type 2 (History — Full Code)
SCD Type 3 (Previous Value)
Merge Key Trick Explained
Change Data Capture (CDC)
Debezium CDC Pipeline Design
Delta Lake Change Data Feed (CDF)
CDC Event Types
Medallion Architecture (Bronze/Silver/Gold)
Design Decisions per Layer
Data Quality Between Layers
When to Deviate
Late-Arriving Data Handling
Auto Loader
Directory Listing vs File Notification
Schema Evolution Modes
Auto Loader vs COPY INTO
Delta Live Tables (DLT)
Expectations (3 Levels)
SCD Type 2 with apply_changes
Materialized View vs Streaming Table
COMPLETE vs TRIGGERED Mode
Scenario-Based Pipeline Design (10+ Scenarios)
Clickstream 10B events/day
8-hour Pipeline Optimization
Data Mesh on Databricks
Oracle CDC to Delta Lake
Batch + Streaming Unified
Data Quality Framework
500K Small Files Fix
Streaming Pipeline Falling Behind
2-Year Backfill
Multi-Tenant Architecture
3-Hour MERGE Optimization
Real-Time Fraud Detection
Pipeline Health Monitoring
5PERFORMANCE TUNING & PRODUCTIONFile: 05
Reading Spark Execution Plans
Spark UI Debugging Methodology
Key Metrics to Monitor
Partition Tuning
spark.sql.shuffle.partitions
Optimal Partition Size
Small File Problem
Join Optimization
All Join Strategies Compared
Broadcast Threshold Tuning
Dynamic Partition Pruning
Spill Debugging
Predicate Pushdown Verification
File Format Selection
Unity Catalog
Three-Level Namespace
Row-Level & Column-Level Security
Data Lineage
Storage Credentials & External Locations
Delta Sharing
Hive Metastore Migration
Photon Engine
When It Helps
When It Doesn't
Databricks Workflows & Orchestration
Workflows vs Airflow
Job Cluster vs All-Purpose
Task Parameter Passing
Databricks Asset Bundles (CI/CD)
Cost Management Strategies
CI/CD for Databricks Pipelines
Data Governance at Scale
System Design Questions (Leadership Level)
TOPIC FREQUENCY IN REAL INTERVIEWS
🔴 Very High: Delta MERGE/SCD2, Performance Tuning, Medallion Architecture
🟠 High: Data Skew, Streaming, Unity Catalog, Auto Loader, AQE, OPTIMIZE/ZORDER
🟡 Medium-High: DLT, Window Functions, CI/CD, Photon
🟢 Medium: Liquid Clustering, Deletion Vectors, Delta Sharing

FILE STRUCTURE

FileTopicQuestions
01_Spark_Architecture_and_Internals.mdSpark core, Catalyst, Memory, Shuffle, Serialization~40
02_PySpark_Advanced_Operations.mdDataFrames, Joins, Windows, Streaming, Coding~50
03_Delta_Lake_and_Lakehouse.mdDelta internals, MERGE, Time Travel, Optimization~40
04_ETL_Scenarios_and_Design.mdSCD, CDC, Medallion, DLT, Pipeline Design~45
05_Performance_Tuning_and_Production.mdTuning, Unity Catalog, Photon, Governance, Leadership~40

QUESTION TRACKER (Prevents Duplicates)

Already Covered Topics (DO NOT RE-ADD):

  • Spark application lifecycle
  • DAG Scheduler vs Task Scheduler
  • Narrow vs Wide dependencies
  • Catalyst optimizer (4 phases)
  • AQE (3 optimizations)
  • Whole-stage codegen & Tungsten
  • Unified memory management
  • OOM debugging (driver vs executor)
  • Shuffle process (write/read)
  • External shuffle service
  • FetchFailedException
  • Serialization options (Java/Kryo/Arrow/Tungsten)
  • Accumulators & Broadcast variables
  • Speculative execution
  • Block Manager
  • DataFrame select/withColumn/selectExpr
  • withColumn anti-pattern
  • All 5 join strategies
  • Data skew handling (4 techniques)
  • Window functions (row_number/rank/dense_rank)
  • Running total, moving average, pct_of_total
  • rowsBetween vs rangeBetween
  • repartition vs coalesce
  • Hash vs Range partitioning
  • Bucketing
  • cache/persist/checkpoint
  • UDF types & performance
  • Pandas UDF / mapInPandas
  • Structured Streaming model
  • Output modes (append/complete/update)
  • Watermarking
  • Stream-stream joins
  • foreachBatch pattern
  • Exactly-once semantics
  • State store management
  • Delta transaction log
  • Checkpoint files
  • Optimistic concurrency control
  • File-level statistics
  • MERGE INTO (all scenarios)
  • Time travel
  • OPTIMIZE / VACUUM / ZORDER
  • Z-Ordering vs Liquid Clustering
  • Deletion vectors
  • Schema evolution
  • Deep clone vs Shallow clone
  • Managed vs External tables
  • Delta table properties
  • SCD Type 1, 2, 3 (full code)
  • Merge key trick
  • CDC patterns (Debezium, CDF)
  • Medallion architecture
  • Auto Loader (modes, schema evolution)
  • DLT (expectations, apply_changes)
  • 12+ scenario-based pipeline designs
  • Spark UI debugging
  • Partition tuning
  • Small file problem
  • Predicate pushdown
  • Dynamic partition pruning
  • Spill debugging
  • Unity Catalog (all aspects)
  • Photon engine
  • Databricks Workflows
  • Asset Bundles / CI/CD
  • Cost management
  • Data governance
  • System design (leadership)
  • Py4J gateway architecture
  • Cost-based optimization (CBO)
  • Coding challenges (15+ problems)

HOW TO USE THIS PREPARATION

If you have 1 week:

  1. Day 1-2: File 01 (Architecture) + File 03 (Delta Lake) — understand the foundation
  2. Day 3-4: File 04 (ETL Scenarios) — practice SCD Type 2 code, CDC pipelines
  3. Day 5-6: File 02 (PySpark) — coding challenges, streaming, joins
  4. Day 7: File 05 (Production) — Unity Catalog, tuning, system design

If you have 3 days:

  1. Day 1: File 03 + File 04 (Delta Lake + ETL — most asked topics)
  2. Day 2: File 01 + File 02 (Architecture + PySpark coding)
  3. Day 3: File 05 (Production + system design)

If you have 1 day:

Focus on: SCD Type 2 code, MERGE scenarios, Medallion architecture, Data skew handling, AQE, Unity Catalog, and the scenario-based questions in File 04.

KEY INTERVIEW PATTERNS FOR 10+ YEARS EXPERIENCE

  1. They won't ask "What is Spark?" — They'll ask "Walk me through how you'd debug a 4-hour job that's failing on a 10 TB dataset"
  2. Scenario > Definition — "Design a CDC pipeline from Oracle to Delta Lake" > "What is CDC?"
  3. Trade-offs > Single Answer — "When would you NOT use broadcast join?" > "What is a broadcast join?"
  4. Architecture > Code — "How would you design a multi-tenant data platform?" > "Write a SELECT query"
  5. Depth > Breadth — They'll drill into YOUR answers. If you mention Z-Ordering, expect "How does it work internally with space-filling curves?"