DataFrame + SparkSQL — Quick Recall

🔒

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

Already have a plan? Sign in

⚡Must know⚠️Trap🧠Memory map📝One-liner

🧠 MASTER MNEMONICS

🧠 RDD → Low-level, no optimizer, use only when DF can't express logic

DATAFRAME vs RDD vs DATASET→"TYPE-OPT"

DataFrame→No type safety, Catalyst optimized, use in PySpark always

Dataset→Type safe (Scala/Java only), NOT available in Python

RDDLow-level, no optimizer, use only when DF can't express logic

READ FORMATS"CPJ-OAD"

CCSV (inferSchema → false in prod, always specify schema)

PParquet (columnar, default Spark format, predicate pushdown)

JJSON (multiline option for wrapped arrays)

OORC (Hive-native, great for Hive tables)

AAvro (row-based, streaming, schema evolution)

DDelta (ACID, time travel, upserts — production standard)

WINDOW FUNCTION SKELETON"POF"

PpartitionBy (what resets the window)

OorderBy (within partition ordering)

FFrame (ROWS BETWEEN ... — what rows to include)

UDF PERFORMANCE ORDER (fastest → slowest)

Built-in Spark functions >> Pandas UDF (Arrow) >> Python UDF >> RDD operations

WRITE MODES"OACEI"

Ooverwrite (replace entire output)

Aappend (add to existing)

Cignore (skip if exists)

Eerror (fail if exists — default)

QWhen do you use RDD over DataFrame?

Use RDD when:

1. Complex functional transformations not expressible in DataFrame

2. Unstructured data (text, binary) where schema doesn't apply

3. Need fine-grained control over partitioning

4. Custom partitioners

Use DataFrame otherwise (Catalyst optimization = much faster)

QWhy is Dataset NOT available in PySpark?

Dataset requires compile-time type checking (JVM generics). Python is dynamically typed — no compile-time types.

QWhat gives DataFrame its performance advantage over RDD?

Catalyst Optimizer (logica