⚡

PySpark

PySpark Interview Prep — Master Plan

⚡

PySpark · Section 1 of 9

PySpark Interview Prep — Master Plan

📝 Note

Scope: Spark Architecture, RDD, DataFrame, SparkSQL, Optimizations NOT covered: Spark ML, Spark GraphX, Spark Streaming (unless noted) Language: PySpark (Python API)

WHY PYSPARK MATTERS FOR YOUR INTERVIEW

Almost every data engineering role today requires Spark knowledge
Senior engineers asked: internals (DAG, stages, shuffle) not just "what is Spark"
Code questions are common: they will ask you to write PySpark code live
Optimization questions are the #1 differentiator at senior level
AQE (Adaptive Query Execution) is a hot 2024-2026 topic

3-DAY SCHEDULE

🗺️Memory Map

DAY 15-6 hoursARCHITECTURE + RDD

Spark Architecture (Driver, Executor, Cluster Manager)

SparkContext vs SparkSession

DAG (Directed Acyclic Graph) — stages, tasks

Lazy Evaluation — transformations vs actions

Narrow vs Wide Transformations

RDD: what it is, 5 properties

RDD Transformations (map, flatMap, filter, reduceByKey, groupByKey)

RDD Actions (collect, count, take, reduce, foreach)

RDD Persistence (cache, persist, storage levels)

RDD Lineage + Checkpointing

Broadcast Variables + Accumulators

map () vs flatMap() vs mapPartitions()

DAY 25-6 hoursDATAFRAME + SPARKSQL

RDD vs DataFrame vs Dataset — when to use each

SparkSession creation + config

Reading multiple file formats (CSV, JSON, Parquet, ORC, JDBC, Delta)

Reading multiple files from multiple sources at once

Schema: inferSchema vs explicit StructType

Core transformations: select, filter, groupBy, agg, join, union

withColumn, withColumnRenamed, drop, alias

Handling NULLs: na.drop, na.fill, isNull, isNotNull

Window Functions (ROW_NUMBER, LAG, LEAD, RANK, running totals)

UDFs: Python UDF vs Pandas UDF (Arrow-based)

Joins: types and hints

Writing: partitionBy, bucketBy, saveAsTable, write modes

explode, flatten nested JSON, struct, array columns

Spark SQL: createTempView, sql()

DAY 35-6 hoursOPTIMIZATION + PERFORMANCE

Catalyst Optimizer: 4 phases (Analysis → Logical Opt → Physical Plan → CodeGen)

Tungsten Engine (off-heap, codegen, vectorized execution)

Predicate Pushdown — when it works and when it doesn't

repartition () vs coalesce()

spark.sql.shuffle.partitions — tuning

Join Strategies: Broadcast, Shuffle Hash, Sort-Merge, BNLJ, Cartesian

Broadcast join — threshold, hints

Data Skew — detection + solutions (salting, AQE, broadcast)

AQE (Adaptive Query Execution) — 3 main features

cache() vs persist() — storage levels

Checkpointing vs caching

groupByKey vs reduceByKey vs aggregateByKey

Small files problem and solutions

Spark configurations (memory, parallelism, serialization)

Spark UI — how to read it for debugging

Dynamic Allocation

PRIORITY MATRIX

MUST KNOW — Will definitely be asked (60%)

DAG, stages, tasks, lazy evaluation
Narrow vs Wide transformations
repartition() vs coalesce()
groupByKey vs reduceByKey (performance!)
cache() vs persist() storage levels
Broadcast join — threshold, when to use
AQE — 3 features (coalesce, join switching, skew)
Python UDF vs Pandas UDF (performance)
RDD vs DataFrame vs Dataset
Data skew detection + salting fix

SHOULD KNOW — High probability (30%)

Catalyst 4 phases
Tungsten engine
Spark UI reading
Window functions + code
Reading multiple sources at once
Schema inference vs explicit StructType
Predicate pushdown
Checkpointing vs caching
spark.sql.shuffle.partitions
Broadcast variables + accumulators

NICE TO KNOW — Differentiators (10%)

mapPartitions() vs map()
CombineByKey vs aggregateByKey internals
Dynamic allocation configs
Kryo serialization
Off-heap memory (Tungsten)

THE 10-YEAR ENGINEER FRAMING

"I've used PySpark in production for [X] years. When someone says

'the job is slow', I don't guess — I open Spark UI, look at the

Stages tab for skewed tasks, check GC time on the Executors tab,

look at the SQL plan for Sort-Merge Joins that should be Broadcast,

and check if AQE is enabled. The answer is almost always one of:

data skew, too many/few partitions, wrong join strategy, or

DataFrame recomputed multiple times instead of being cached."

SPARK ECOSYSTEM OVERVIEW

📐 Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│                    SPARK ECOSYSTEM                       │
├─────────────────────────────────────────────────────────┤
│  LANGUAGE APIs:  Python (PySpark), Scala, Java, R       │
├─────────────────────────────────────────────────────────┤
│  HIGH-LEVEL APIs:                                        │
│  Spark SQL / DataFrame API (structured)                 │
│  Streaming (Structured Streaming)                       │
├─────────────────────────────────────────────────────────┤
│  CORE ENGINE:                                            │
│  Catalyst Optimizer + Tungsten Execution Engine         │
│  RDD API (low-level)                                    │
├─────────────────────────────────────────────────────────┤
│  STORAGE:  HDFS, S3, ADLS, Delta Lake, Iceberg          │
├─────────────────────────────────────────────────────────┤
│  CLUSTER:  YARN, Kubernetes, Mesos, Standalone          │
└─────────────────────────────────────────────────────────┘

Architecture + RDD — Quick RecallNext →