⚡

PySpark

DataFrame + SparkSQL — Deep Interview Guide

⚡

PySpark · Section 5 of 9

DataFrame + SparkSQL — Deep Interview Guide

🔒

This section is locked

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

View Plans — from ₹299/month

Already have a plan? Sign in

DataFrame + SparkSQL — Deep Interview Guide

💡 Interview Tip

DataFrames are the primary API for PySpark. Know these inside out. Code questions here are very common — interviewers will ask you to write live.

MASTER MEMORY MAP — Day 2

🧠 SPARKSESSION = "Entry point for EVERYTHING"

RDD vs DataFrame vs Dataset = "Low -> High -> Typed"

RDD: Low-level, no schema, no optimizer, Python/Java/Scala

DataFrame: High-level, schema, Catalyst/Tungsten optimized, SQL-like

Dataset: High-level, TYPED (compile-time safety), JVM only (Scala/Java)

PySpark: Only RDD + DataFrame (no Dataset in Python!)

SPARKSESSION"Entry point for EVERYTHING"

spark.read.* -> read data

spark.sql("...") -> run SQL

spark.sparkContext -> access RDD API

spark.createDataFrame() -> create DF from RDD or list

spark.catalog.* -> manage databases, tables, functions

READING FILES"Format.Option.Schema.Load"

format: csv, json, parquet, orc, delta, jdbc, avro

option: header, inferSchema, delimiter, mode

schema: StructType([StructField(...)]) — ALWAYS explicit in production!

load: .load("/path/") or .(path) shortcut

READ MULTIPLE FILES"Glob/List/Directory"

Glob: spark.read.csv("/data/2024/*") <- all files matching pattern

List: spark.read.csv(["/f1", "/f2"]) <- specific files

Dir: spark.read.parquet("/data/") <- all parquet in dir

TRANSFORMATIONS"SWGJW-NAU"

Sselect() / withColumn()

Wwhere() / filter()

GgroupBy() + agg()

Jjoin()

WwithColumnRenamed() / drop()

Nna.fill() / na.drop()

Aalias() / cast()

Uunion() / unionByName()

WINDOW FUNCTIONS"PARTITION + ORDER + FRAME"

Window.partitionBy("col").orderBy("col")

rowsBetween(Window.unboundedPreceding, Window.currentRow)

COLUMN OPERATIONS"col / lit / when / cast"

col("name") -> reference a column

lit(100) -> create a constant column

when().otherwise() -> conditional logic (CASE WHEN)

← DataFrame + SparkSQL — Quick RecallPrevious Optimization + Performance — Quick RecallNext →