PySpark
DataFrame + SparkSQL — Quick Recall
PySpark · Section 4 of 9

DataFrame + SparkSQL — Quick Recall

🔒

This section is locked

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

View Plans — from ₹299/month

Already have a plan? Sign in

DataFrame + SparkSQL — Quick Recall

Must know⚠️Trap🧠Memory map📝One-liner

🧠 MASTER MNEMONICS

🧠 RDD → Low-level, no optimizer, use only when DF can't express logic
DATAFRAME vs RDD vs DATASET"TYPE-OPT"
DataFrameNo type safety, Catalyst optimized, use in PySpark always
DatasetType safe (Scala/Java only), NOT available in Python
RDDLow-level, no optimizer, use only when DF can't express logic
READ FORMATS"CPJ-OAD"
CCSV (inferSchema → false in prod, always specify schema)
PParquet (columnar, default Spark format, predicate pushdown)
JJSON (multiline option for wrapped arrays)
OORC (Hive-native, great for Hive tables)
AAvro (row-based, streaming, schema evolution)
DDelta (ACID, time travel, upserts — production standard)
WINDOW FUNCTION SKELETON"POF"
PpartitionBy (what resets the window)
OorderBy (within partition ordering)
FFrame (ROWS BETWEEN ... — what rows to include)
UDF PERFORMANCE ORDER (fastest → slowest)
Built-in Spark functions >> Pandas UDF (Arrow) >> Python UDF >> RDD operations
WRITE MODES"OACEI"
Ooverwrite (replace entire output)
Aappend (add to existing)
Cignore (skip if exists)
Eerror (fail if exists — default)

SECTION 1: RDD vs DataFrame vs Dataset FLASH CARDS

QWhen do you use RDD over DataFrame?

Use RDD when:
1. Complex functional transformations not expressible in DataFrame
2. Unstructured data (text, binary) where schema doesn't apply
3. Need fine-grained control over partitioning
4. Custom partitioners
Use DataFrame otherwise (Catalyst optimization = much faster)
QWhy is Dataset NOT available in PySpark?

Dataset requires compile-time type checking (JVM generics). Python is dynamically typed — no compile-time types.

QWhat gives DataFrame its performance advantage over RDD?

Catalyst Optimizer (logica