DataFrame + SparkSQL — Quick Recall
Must knowTrapMemory mapOne-liner
🧠 MASTER MNEMONICS
🧠 RDD → Low-level, no optimizer, use only when DF can't express logic
DATAFRAME vs RDD vs DATASET→"TYPE-OPT"
DataFrame→No type safety, Catalyst optimized, use in PySpark always
Dataset→Type safe (Scala/Java only), NOT available in Python
RDDLow-level, no optimizer, use only when DF can't express logic
READ FORMATS"CPJ-OAD"
CCSV (inferSchema → false in prod, always specify schema)
PParquet (columnar, default Spark format, predicate pushdown)
JJSON (multiline option for wrapped arrays)
OORC (Hive-native, great for Hive tables)
AAvro (row-based, streaming, schema evolution)
DDelta (ACID, time travel, upserts — production standard)
WINDOW FUNCTION SKELETON"POF"
PpartitionBy (what resets the window)
OorderBy (within partition ordering)
FFrame (ROWS BETWEEN ... — what rows to include)
UDF PERFORMANCE ORDER (fastest → slowest)
Built-in Spark functions >> Pandas UDF (Arrow) >> Python UDF >> RDD operations
WRITE MODES"OACEI"
Ooverwrite (replace entire output)
Aappend (add to existing)
Cignore (skip if exists)
Eerror (fail if exists — default)
SECTION 1: RDD vs DataFrame vs Dataset FLASH CARDS
QWhen do you use RDD over DataFrame?
Use RDD when:
1. Complex functional transformations not expressible in DataFrame
2. Unstructured data (text, binary) where schema doesn't apply
3. Need fine-grained control over partitioning
4. Custom partitioners
Use DataFrame otherwise (Catalyst optimization = much faster)
QWhy is Dataset NOT available in PySpark?
Dataset requires compile-time type checking (JVM generics). Python is dynamically typed — no compile-time types.
QWhat gives DataFrame its performance advantage over RDD?
Catalyst Optimizer (logica