Optimization + Performance — Quick Recall

🔒

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

Already have a plan? Sign in

⚡Must know⚠️Trap🧠Memory map📝One-liner

🧠 MASTER MNEMONICS

🧠 CATALYST 4 PHASES → "ALPC"

CATALYST 4 PHASES"ALPC"

AAnalysis (resolve columns, parse SQL → Unresolved Logical Plan)

LLogical Opt (predicate pushdown, column pruning, constant folding)

PPhysical Plan (pick join strategy, CBO picks cheapest)

CCode Generation (Tungsten whole-stage CodeGen → JVM bytecode)

TUNGSTEN 3 FEATURES"OCW"

OOff-heap memory (bypass GC, binary format)

CCache-aware ops (CPU L1/L2 cache efficiency)

WWhole-stage CodeGen (collapse operators → compiled JVM function)

JOIN STRATEGIES"BSS-BC" (fastest → slowest)

BBroadcast Hash Join (NO shuffle, < threshold or hint)

SShuffle Hash Join (shuffle, medium tables, hash map in memory)

SSort-Merge Join (shuffle + sort, large-large, DEFAULT)

BBroadcast Nested Loop (non-equi joins, O(n×m))

CCartesian Join (CROSS JOIN, no condition)

AQE 3 FEATURES"CDS"

CCoalesce shuffle partitions (merge tiny post-shuffle partitions)

DDynamic join switching (SMJ → BHJ at runtime if table shrinks)

SSkew join optimization (split hot partitions automatically)

DATA SKEW SOLUTIONS"LABS"

LLet AQE handle it (spark 3.0+, automatic)

AAdd broadcast hint (if small side → broadcast(df))

BBuild salted keys (random salt → distribute hot keys)

SSplit aggregation (2-phase) (partial agg → final agg)

QWhat are the 4 phases of Catalyst? (ALPC)

🧠 Memory Map

1. Analysis→Parse + resolve column names/types against Catalog

2. Logical Opt→Rule-based: predicate pushdown, column pruning, constant folding

3. Physical→Generate N physical plans, CBO picks cheapest

4. CodeGen→Tungsten whole-stage codegen → JVM bytecode

**Q: