⚡

PySpark

Optimization + Performance

⚡

PySpark · Section 7 of 9

Optimization + Performance

🔒

This section is locked

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

View Plans — from ₹299/month

Already have a plan? Sign in

Optimization + Performance

Senior-level deep dive: Catalyst, Tungsten, AQE, Join Strategies, Skew, Configs This is the #1 differentiator topic at Senior Data Engineer level

THE SENIOR ENGINEER MINDSET

🧠 Memory Map

When someone says "the Spark job is slow", I don't guess.

I open Spark UI:

1. Stages tab→skewed tasks (one task 10x slower than median?)

2. SQL tab→physical plan (Sort-Merge Joins that should be Broadcast?)

3. Executors tab→GC time > 10%? → memory pressure

4. Storage tab→missing cache that's recomputed?

5. Environment tab→shuffle.partitions = 200 for 10 GB data?

The answer is ALWAYS one of:

Data skew → salting or AQE skew handling

Wrong join strategy → broadcast hint or AQE

Partitions wrong → repartition / coalesce / shuffle.partitions

DataFrame recomputed multiple times → cache()

GC pressure → G1GC + memory config tuning

PART 1: CATALYST OPTIMIZER

The 4 Phases

🧠 Memory Map

USER CODE (DataFrame/SQL)

↓

1. ANALYSIS PHASE

Parse query → Unresolved Logical Plan

Resolve columns/tables against Catalog

Result: Resolved Logical Plan

↓

2. LOGICAL OPTIMIZATION PHASE

Apply rule-based optimizations:

• Predicate Pushdown→move filters as early as possible

• Column Pruning→only read needed columns

• Constant Folding→1 + 1 → 2 at plan time

• Join Reordering→smaller table first

Result: Optimized Logical Plan

↓

3. PHYSICAL PLANNING PHASE

Generate multiple Physical Plans (join strategies, scan strategies)

Cost-Based Optimizer (CBO) picks cheapest plan

Result: Selected Physical Plan

↓

4. CODE GENERATION (Tungsten Whole-Stage CodeGen)

Generates Java bytecode for the physical plan

Eliminates virtual function calls

Result: Executed compiled code

Predicate Pushdown — When It Works

python — editable

# WORKS — Spark pushes filter into Parquet reader
df = spark.read.parquet("data/")

← Optimization + Performance — Quick RecallPrevious PySpark — Confusions, Labs, Gotchas & Mock InterviewNext →