PySpark
Optimization + Performance — Quick Recall
PySpark · Section 6 of 9

Optimization + Performance — Quick Recall

🔒

This section is locked

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

View Plans — from ₹299/month

Already have a plan? Sign in

Optimization + Performance — Quick Recall

Must know⚠️Trap🧠Memory map📝One-liner

🧠 MASTER MNEMONICS

🧠 CATALYST 4 PHASES → "ALPC"
CATALYST 4 PHASES"ALPC"
AAnalysis (resolve columns, parse SQL → Unresolved Logical Plan)
LLogical Opt (predicate pushdown, column pruning, constant folding)
PPhysical Plan (pick join strategy, CBO picks cheapest)
CCode Generation (Tungsten whole-stage CodeGen → JVM bytecode)
TUNGSTEN 3 FEATURES"OCW"
OOff-heap memory (bypass GC, binary format)
CCache-aware ops (CPU L1/L2 cache efficiency)
WWhole-stage CodeGen (collapse operators → compiled JVM function)
JOIN STRATEGIES"BSS-BC" (fastest → slowest)
BBroadcast Hash Join (NO shuffle, < threshold or hint)
SShuffle Hash Join (shuffle, medium tables, hash map in memory)
SSort-Merge Join (shuffle + sort, large-large, DEFAULT)
BBroadcast Nested Loop (non-equi joins, O(n×m))
CCartesian Join (CROSS JOIN, no condition)
AQE 3 FEATURES"CDS"
CCoalesce shuffle partitions (merge tiny post-shuffle partitions)
DDynamic join switching (SMJ → BHJ at runtime if table shrinks)
SSkew join optimization (split hot partitions automatically)
DATA SKEW SOLUTIONS"LABS"
LLet AQE handle it (spark 3.0+, automatic)
AAdd broadcast hint (if small side → broadcast(df))
BBuild salted keys (random salt → distribute hot keys)
SSplit aggregation (2-phase) (partial agg → final agg)

SECTION 1: CATALYST FLASH CARDS

QWhat are the 4 phases of Catalyst? (ALPC)

🧠 Memory Map
1. AnalysisParse + resolve column names/types against Catalog
2. Logical OptRule-based: predicate pushdown, column pruning, constant folding
3. PhysicalGenerate N physical plans, CBO picks cheapest
4. CodeGenTungsten whole-stage codegen → JVM bytecode

**Q: