Optimization + Performance
Senior-level deep dive: Catalyst, Tungsten, AQE, Join Strategies, Skew, Configs This is the #1 differentiator topic at Senior Data Engineer level
THE SENIOR ENGINEER MINDSET
🧠 Memory Map
When someone says "the Spark job is slow", I don't guess.
I open Spark UI:
1. Stages tab→skewed tasks (one task 10x slower than median?)
2. SQL tab→physical plan (Sort-Merge Joins that should be Broadcast?)
3. Executors tab→GC time > 10%? → memory pressure
4. Storage tab→missing cache that's recomputed?
5. Environment tab→shuffle.partitions = 200 for 10 GB data?
The answer is ALWAYS one of:
Data skew → salting or AQE skew handling
Wrong join strategy → broadcast hint or AQE
Partitions wrong → repartition / coalesce / shuffle.partitions
DataFrame recomputed multiple times → cache()
GC pressure → G1GC + memory config tuning
PART 1: CATALYST OPTIMIZER
The 4 Phases
🧠 Memory Map
USER CODE (DataFrame/SQL)
↓
1. ANALYSIS PHASE
Parse query → Unresolved Logical Plan
Resolve columns/tables against Catalog
Result: Resolved Logical Plan
↓
2. LOGICAL OPTIMIZATION PHASE
Apply rule-based optimizations:
• Predicate Pushdown→move filters as early as possible
• Column Pruning→only read needed columns
• Constant Folding→1 + 1 → 2 at plan time
• Join Reordering→smaller table first
Result: Optimized Logical Plan
↓
3. PHYSICAL PLANNING PHASE
Generate multiple Physical Plans (join strategies, scan strategies)
Cost-Based Optimizer (CBO) picks cheapest plan
Result: Selected Physical Plan
↓
4. CODE GENERATION (Tungsten Whole-Stage CodeGen)
Generates Java bytecode for the physical plan
Eliminates virtual function calls
Result: Executed compiled code
Predicate Pushdown — When It Works
python — editable
# WORKS — Spark pushes filter into Parquet reader
df = spark.read.parquet("data/")