🧱

Databricks

Databricks + PySpark Interview Preparation — Master Index & Mind Map

🧱

Databricks · Section 1 of 18

Databricks + PySpark Interview Preparation — Master Index & Mind Map

💡 Interview Tip

Covers: Beginner to Architect level — everything you need for any data engineering interview Last Updated: 2026-02-27 Total Questions: 200+ across all files

MIND MAP — Topics & Coverage

🗂️DATABRICKS INTERVIEW

1SPARK ARCHITECTURE & INTERNALSFile: 01

Application Lifecycle (Driver, Executors, Cluster Manager)

DAG Scheduler vs Task Scheduler

Narrow vs Wide Dependencies

Catalyst Optimizer (4 Phases)

Tungsten & Whole-Stage CodeGen

Adaptive Query Execution (AQE)

Memory Management (Unified Model)

Execution vs Storage Memory

On-Heap vs Off-Heap

OOM Debugging (Driver vs Executor)

Shuffle Deep Dive

Shuffle Write (Map Side)

Shuffle Read (Reduce Side)

External Shuffle Service

FetchFailedException

Serialization (Java vs Kryo vs Tungsten vs Arrow)

Fault Tolerance & Speculation

2PYSPARK ADVANCED OPERATIONSFile: 02

DataFrame API Deep Dive

select vs withColumn vs selectExpr

withColumn Anti-Pattern

Lazy Evaluation Benefits

Joins (5 Physical Strategies)

Broadcast Hash Join

Sort-Merge Join

Shuffle Hash Join

BroadcastNestedLoop Join

Cartesian Product

Window Functions

row_number vs rank vs dense_rank

Running Totals, Moving Averages

rowsBetween vs rangeBetween

Data Skew Handling

Salting Technique

AQE Skew Join

Isolate-and-Union

Two-Phase Aggregation

UDFs & Performance

Row-at-a-Time UDF (Slow)

Pandas UDF / Vectorized UDF

mapInPandas

Partitioning

repartition vs coalesce

Hash vs Range Partitioning

Bucketing

Caching & Checkpointing

Structured Streaming

Micro-Batch vs Continuous

Output Modes (Append/Complete/Update)

Watermarking

Stream-Stream Joins

foreachBatch Pattern

Exactly-Once Semantics

State Store Management

Coding Challenges (15+ Problems)

3DELTA LAKE & LAKEHOUSEFile: 03

Delta Lake Internals

Transaction Log (_delta_log)

Checkpoint Files

Optimistic Concurrency Control

File-Level Statistics & Data Skipping

Isolation Levels

MERGE INTO — All Scenarios

Basic MERGE

Handling Duplicate Keys

MERGE Performance Optimization

Schema Evolution with MERGE

Time Travel & Recovery

OPTIMIZE, VACUUM, ZORDER

Z-Ordering vs Liquid Clustering

Deletion Vectors

Schema Evolution

Clone Operations (Deep vs Shallow)

Table Properties (Important Settings)

Managed vs External Tables

Lakehouse Architecture

Data Lake vs Warehouse vs Lakehouse

Key Enabling Technologies

4ETL SCENARIOS & PIPELINE DESIGNFile: 04

SCD Implementations

SCD Type 1 (Overwrite)

SCD Type 2 (History — Full Code)

SCD Type 3 (Previous Value)

Merge Key Trick Explained

Change Data Capture (CDC)

Debezium CDC Pipeline Design

Delta Lake Change Data Feed (CDF)

CDC Event Types

Medallion Architecture (Bronze/Silver/Gold)

Design Decisions per Layer

Data Quality Between Layers

When to Deviate

Late-Arriving Data Handling

Auto Loader

Directory Listing vs File Notification

Schema Evolution Modes

Auto Loader vs COPY INTO

Delta Live Tables (DLT)

Expectations (3 Levels)

SCD Type 2 with apply_changes

Materialized View vs Streaming Table

COMPLETE vs TRIGGERED Mode

Scenario-Based Pipeline Design (10+ Scenarios)

Clickstream 10B events/day

8-hour Pipeline Optimization

Data Mesh on Databricks

Oracle CDC to Delta Lake

Batch + Streaming Unified

Data Quality Framework

500K Small Files Fix

Streaming Pipeline Falling Behind

2-Year Backfill

Multi-Tenant Architecture

3-Hour MERGE Optimization

Real-Time Fraud Detection

Pipeline Health Monitoring

5PERFORMANCE TUNING & PRODUCTIONFile: 05

Reading Spark Execution Plans

Spark UI Debugging Methodology

Key Metrics to Monitor

Partition Tuning

spark.sql.shuffle.partitions

Optimal Partition Size

Small File Problem

Join Optimization

All Join Strategies Compared

Broadcast Threshold Tuning

Dynamic Partition Pruning

Spill Debugging

Predicate Pushdown Verification

File Format Selection

Unity Catalog

Three-Level Namespace

Row-Level & Column-Level Security

Data Lineage

Storage Credentials & External Locations

Delta Sharing

Hive Metastore Migration

Photon Engine

When It Helps

When It Doesn't

Databricks Workflows & Orchestration

Workflows vs Airflow

Job Cluster vs All-Purpose

Task Parameter Passing

Databricks Asset Bundles (CI/CD)

Cost Management Strategies

CI/CD for Databricks Pipelines

Data Governance at Scale

System Design Questions (Leadership Level)

6DELTA LAKE ADVANCED MASTERCLASSFile: DB_06

OPTIMIZE Deep Dive

Internal Mechanics (bin-packing vs Z-ORDER)

When OPTIMIZE Hurts (5 anti-patterns)

Cost Analysis

Liquid Clustering vs Z-ORDER vs Partitioning

Complete Comparison Table

Incremental vs Full Rewrite (key insight)

Hilbert Curve vs Z-Order Curve

Migration Path

Deletion Vectors Internals

RoaringBitmap Mechanics

Read/Write Trade-offs

DV + MERGE Interaction

Change Data Feed (CDF/CDC)

4 Change Types

Streaming CDF Patterns

5 Gotchas (not retroactive, INSERT OVERWRITE, etc.)

UniForm (Universal Format)

Delta + Iceberg + Hudi Metadata

Feature Asymmetry Table

One-Way Sync Limitation

Predictive Optimization

3 Levels: Optimized Writes vs Auto Compact vs PO

Unity Catalog Requirement

VACUUM Retention Gotcha

Small File Problem (Complete Playbook)

6 Root Causes

7 Solutions

VACUUM Risks & Production Incidents

4 Real Incident Scenarios

Production Best Practices Checklist

Delta vs Iceberg vs Hudi

Feature Comparison Matrix

Decision Framework

Hidden Partitioning Deep Dive

Rapid-Fire Interview Questions (10+)

Protocol Versions (irreversible upgrades)

Schema Enforcement vs Evolution

WAP Pattern, Row Tracking, Type Widening, VARIANT

Gotcha Questions with Traps

TOPIC FREQUENCY IN REAL INTERVIEWS

🔴 Very High: Delta MERGE/SCD2, Performance Tuning, Medallion Architecture

🟠 High: Data Skew, Streaming, Unity Catalog, Auto Loader, AQE, OPTIMIZE/ZORDER

🟡 Medium-High: DLT, Window Functions, CI/CD, Photon

🟢 Medium: Liquid Clustering, Deletion Vectors, Delta Sharing

QUESTION TRACKER (Prevents Duplicates)

Already Covered Topics (DO NOT RE-ADD):

Spark application lifecycle
DAG Scheduler vs Task Scheduler
Narrow vs Wide dependencies
Catalyst optimizer (4 phases)
AQE (3 optimizations)
Whole-stage codegen & Tungsten
Unified memory management
OOM debugging (driver vs executor)
Shuffle process (write/read)
External shuffle service
FetchFailedException
Serialization options (Java/Kryo/Arrow/Tungsten)
Accumulators & Broadcast variables
Speculative execution
Block Manager
DataFrame select/withColumn/selectExpr
withColumn anti-pattern
All 5 join strategies
Data skew handling (4 techniques)
Window functions (row_number/rank/dense_rank)
Running total, moving average, pct_of_total
rowsBetween vs rangeBetween
repartition vs coalesce
Hash vs Range partitioning
Bucketing
cache/persist/checkpoint
UDF types & performance
Pandas UDF / mapInPandas
Structured Streaming model
Output modes (append/complete/update)
Watermarking
Stream-stream joins
foreachBatch pattern
Exactly-once semantics
State store management
Delta transaction log
Checkpoint files
Optimistic concurrency control
File-level statistics
MERGE INTO (all scenarios)
Time travel
OPTIMIZE / VACUUM / ZORDER
Z-Ordering vs Liquid Clustering
Deletion vectors
Schema evolution
Deep clone vs Shallow clone
Managed vs External tables
Delta table properties
SCD Type 1, 2, 3 (full code)
Merge key trick
CDC patterns (Debezium, CDF)
Medallion architecture
Auto Loader (modes, schema evolution)
DLT (expectations, apply_changes)
12+ scenario-based pipeline designs
Spark UI debugging
Partition tuning
Small file problem
Predicate pushdown
Dynamic partition pruning
Spill debugging
Unity Catalog (all aspects)
Photon engine
Databricks Workflows
Asset Bundles / CI/CD
Cost management
Data governance
System design (leadership)
Py4J gateway architecture
Cost-based optimization (CBO)
Coding challenges (15+ problems)

HOW TO USE THIS PREPARATION

If you have 1 week:

Day 1-2: File 01 (Architecture) + File 03 (Delta Lake) — understand the foundation
Day 3-4: File 04 (ETL Scenarios) — practice SCD Type 2 code, CDC pipelines
Day 5-6: File 02 (PySpark) — coding challenges, streaming, joins
Day 7: File 05 (Production) — Unity Catalog, tuning, system design

If you have 3 days:

Day 1: File 03 + File 04 (Delta Lake + ETL — most asked topics)
Day 2: File 01 + File 02 (Architecture + PySpark coding)
Day 3: File 05 (Production + system design)

If you have 1 day:

Focus on: SCD Type 2 code, MERGE scenarios, Medallion architecture, Data skew handling, AQE, Unity Catalog, and the scenario-based questions in File 04.

KEY INTERVIEW PATTERNS FOR 10+ YEARS EXPERIENCE

They won't ask "What is Spark?" — They'll ask "Walk me through how you'd debug a 4-hour job that's failing on a 10 TB dataset"
Scenario > Definition — "Design a CDC pipeline from Oracle to Delta Lake" > "What is CDC?"
Trade-offs > Single Answer — "When would you NOT use broadcast join?" > "What is a broadcast join?"
Architecture > Code — "How would you design a multi-tenant data platform?" > "Write a SELECT query"
Depth > Breadth — They'll drill into YOUR answers. If you mention Z-Ordering, expect "How does it work internally with space-filling curves?"

4-Day Azure Databricks Interview PrepNext →