PySpark
Optimization + Performance
PySpark · Section 7 of 9

Optimization + Performance

🔒

This section is locked

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

View Plans — from ₹299/month

Already have a plan? Sign in

Optimization + Performance

Senior-level deep dive: Catalyst, Tungsten, AQE, Join Strategies, Skew, Configs This is the #1 differentiator topic at Senior Data Engineer level

THE SENIOR ENGINEER MINDSET

🧠 Memory Map
When someone says "the Spark job is slow", I don't guess.
I open Spark UI:
1. Stages tabskewed tasks (one task 10x slower than median?)
2. SQL tabphysical plan (Sort-Merge Joins that should be Broadcast?)
3. Executors tabGC time > 10%? → memory pressure
4. Storage tabmissing cache that's recomputed?
5. Environment tabshuffle.partitions = 200 for 10 GB data?
The answer is ALWAYS one of:
Data skew → salting or AQE skew handling
Wrong join strategy → broadcast hint or AQE
Partitions wrong → repartition / coalesce / shuffle.partitions
DataFrame recomputed multiple times → cache()
GC pressure → G1GC + memory config tuning

PART 1: CATALYST OPTIMIZER

The 4 Phases

🧠 Memory Map
USER CODE (DataFrame/SQL)
1. ANALYSIS PHASE
Parse query → Unresolved Logical Plan
Resolve columns/tables against Catalog
Result: Resolved Logical Plan
2. LOGICAL OPTIMIZATION PHASE
Apply rule-based optimizations:
• Predicate Pushdownmove filters as early as possible
• Column Pruningonly read needed columns
• Constant Folding1 + 1 → 2 at plan time
• Join Reorderingsmaller table first
Result: Optimized Logical Plan
3. PHYSICAL PLANNING PHASE
Generate multiple Physical Plans (join strategies, scan strategies)
Cost-Based Optimizer (CBO) picks cheapest plan
Result: Selected Physical Plan
4. CODE GENERATION (Tungsten Whole-Stage CodeGen)
Generates Java bytecode for the physical plan
Eliminates virtual function calls
Result: Executed compiled code

Predicate Pushdown — When It Works

python — editable
# WORKS — Spark pushes filter into Parquet reader
df = spark.read.parquet("data/")