DataFrame + SparkSQL — Deep Interview Guide
💡 Interview Tip
DataFrames are the primary API for PySpark. Know these inside out.
Code questions here are very common — interviewers will ask you to write live.
MASTER MEMORY MAP — Day 2
🧠 SPARKSESSION = "Entry point for EVERYTHING"
RDD vs DataFrame vs Dataset = "Low -> High -> Typed"
RDD: Low-level, no schema, no optimizer, Python/Java/Scala
DataFrame: High-level, schema, Catalyst/Tungsten optimized, SQL-like
Dataset: High-level, TYPED (compile-time safety), JVM only (Scala/Java)
PySpark: Only RDD + DataFrame (no Dataset in Python!)
SPARKSESSION"Entry point for EVERYTHING"
spark.read.* -> read data
spark.sql("...") -> run SQL
spark.sparkContext -> access RDD API
spark.createDataFrame() -> create DF from RDD or list
spark.catalog.* -> manage databases, tables, functions
READING FILES"Format.Option.Schema.Load"
format: csv, json, parquet, orc, delta, jdbc, avro
option: header, inferSchema, delimiter, mode
schema: StructType([StructField(...)]) — ALWAYS explicit in production!
load: .load("/path/") or .(path) shortcut
READ MULTIPLE FILES"Glob/List/Directory"
Glob: spark.read.csv("/data/2024/*") <- all files matching pattern
List: spark.read.csv(["/f1", "/f2"]) <- specific files
Dir: spark.read.parquet("/data/") <- all parquet in dir
TRANSFORMATIONS"SWGJW-NAU"
Sselect() / withColumn()
Wwhere() / filter()
GgroupBy() + agg()
Jjoin()
WwithColumnRenamed() / drop()
Nna.fill() / na.drop()
Aalias() / cast()
Uunion() / unionByName()
WINDOW FUNCTIONS"PARTITION + ORDER + FRAME"
Window.partitionBy("col").orderBy("col")
rowsBetween(Window.unboundedPreceding, Window.currentRow)
COLUMN OPERATIONS"col / lit / when / cast"
col("name") -> reference a column
lit(100) -> create a constant column
when().otherwise() -> conditional logic (CASE WHEN)
c