Optimization + Performance — Quick Recall
Must knowTrapMemory mapOne-liner
🧠 MASTER MNEMONICS
🧠 CATALYST 4 PHASES → "ALPC"
CATALYST 4 PHASES"ALPC"
AAnalysis (resolve columns, parse SQL → Unresolved Logical Plan)
LLogical Opt (predicate pushdown, column pruning, constant folding)
PPhysical Plan (pick join strategy, CBO picks cheapest)
CCode Generation (Tungsten whole-stage CodeGen → JVM bytecode)
TUNGSTEN 3 FEATURES"OCW"
OOff-heap memory (bypass GC, binary format)
CCache-aware ops (CPU L1/L2 cache efficiency)
WWhole-stage CodeGen (collapse operators → compiled JVM function)
JOIN STRATEGIES"BSS-BC" (fastest → slowest)
BBroadcast Hash Join (NO shuffle, < threshold or hint)
SShuffle Hash Join (shuffle, medium tables, hash map in memory)
SSort-Merge Join (shuffle + sort, large-large, DEFAULT)
BBroadcast Nested Loop (non-equi joins, O(n×m))
CCartesian Join (CROSS JOIN, no condition)
AQE 3 FEATURES"CDS"
CCoalesce shuffle partitions (merge tiny post-shuffle partitions)
DDynamic join switching (SMJ → BHJ at runtime if table shrinks)
SSkew join optimization (split hot partitions automatically)
DATA SKEW SOLUTIONS"LABS"
LLet AQE handle it (spark 3.0+, automatic)
AAdd broadcast hint (if small side → broadcast(df))
BBuild salted keys (random salt → distribute hot keys)
SSplit aggregation (2-phase) (partial agg → final agg)
SECTION 1: CATALYST FLASH CARDS
QWhat are the 4 phases of Catalyst? (ALPC)
🧠 Memory Map
1. Analysis→Parse + resolve column names/types against Catalog
2. Logical Opt→Rule-based: predicate pushdown, column pruning, constant folding
3. Physical→Generate N physical plans, CBO picks cheapest
4. CodeGen→Tungsten whole-stage codegen → JVM bytecode
**Q: