3-Day PySpark Interview Prep
WHY PYSPARK MATTERS FOR YOUR INTERVIEW
- Almost every data engineering role today requires Spark knowledge
- Senior engineers asked: internals (DAG, stages, shuffle) not just "what is Spark"
- Code questions are common: they will ask you to write PySpark code live
- Optimization questions are the #1 differentiator at senior level
- AQE (Adaptive Query Execution) is a hot 2024-2026 topic
3-DAY SCHEDULE
Memory Map
DAY 15-6 hoursARCHITECTURE + RDD
Spark Architecture (Driver, Executor, Cluster Manager)
SparkContext vs SparkSession
DAG (Directed Acyclic Graph) — stages, tasks
Lazy Evaluation — transformations vs actions
Narrow vs Wide Transformations
RDD: what it is, 5 properties
RDD Transformations (map, flatMap, filter, reduceByKey, groupByKey)
RDD Actions (collect, count, take, reduce, foreach)
RDD Persistence (cache, persist, storage levels)
RDD Lineage + Checkpointing
Broadcast Variables + Accumulators
map () vs flatMap() vs mapPartitions()
DAY 25-6 hoursDATAFRAME + SPARKSQL
RDD vs DataFrame vs Dataset — when to use each
SparkSession creation + config
Reading multiple file formats (CSV, JSON, Parquet, ORC, JDBC, Delta)
Reading multiple files from multiple sources at once
Schema: inferSchema vs explicit StructType
Core transformations: select, filter, groupBy, agg, join, union
withColumn, withColumnRenamed, drop, alias
Handling NULLs: na.drop, na.fill, isNull, isNotNull
Window Functions (ROW_NUMBER, LAG, LEAD, RANK, running totals)
UDFs: Python UDF vs Pandas UDF (Arrow-based)
Joins: types and hints
Writing: partitionBy, bucketBy, saveAsTable, write modes
explode, flatten nested JSON, struct, array columns
Spark SQL: createTempView, sql()
DAY 35-6 hoursOPTIMIZATION + PERFORMANCE
Catalyst Optimizer: 4 phases (Analysis → Logical Opt → Physical Plan → CodeGen)
Tungsten Engine (off-heap, codegen, vectorized execution)
Predicate Pushdown — when it works and when it doesn't
repartition () vs coalesce()
spark.sql.shuffle.partitions — tuning
Join Strategies: Broadcast, Shuffle Hash, Sort-Merge, BNLJ, Cartesian
Broadcast join — threshold, hints
Data Skew — detection + solutions (salting, AQE, broadcast)
AQE (Adaptive Query Execution) — 3 main features
cache() vs persist() — storage levels
Checkpointing vs caching
groupByKey vs reduceByKey vs aggregateByKey
Small files problem and solutions
Spark configurations (memory, parallelism, serialization)
Spark UI — how to read it for debugging
Dynamic Allocation
FILES STRUCTURE
| File | Content |
|---|---|
PySpark_00_PLAN.md | This plan |
PySpark_01_Architecture_RDD.md | Deep: Architecture + RDD |
PySpark_01_Quick_Recall.md | Flash cards: Architecture + RDD |
PySpark_02_DataFrame_SparkSQL.md | Deep: DataFrame + SparkSQL + Reading/Writing |
PySpark_02_Quick_Recall.md | Flash cards: DataFrame + SparkSQL |
PySpark_03_Optimization.md | Deep: ALL optimizations + configs |
PySpark_03_Quick_Recall.md | Flash cards + Ultra Cheat Sheet |
PRIORITY MATRIX
MUST KNOW — Will definitely be asked (60%)
- DAG, stages, tasks, lazy evaluation
- Narrow vs Wide transformations
- repartition() vs coalesce()
- groupByKey vs reduceByKey (performance!)
- cache() vs persist() storage levels
- Broadcast join — threshold, when to use
- AQE — 3 features (coalesce, join switching, skew)
- Python UDF vs Pandas UDF (performance)
- RDD vs DataFrame vs Dataset
- Data skew detection + salting fix
SHOULD KNOW — High probability (30%)
- Catalyst 4 phases
- Tungsten engine
- Spark UI reading
- Window functions + code
- Reading multiple sources at once
- Schema inference vs explicit StructType
- Predicate pushdown
- Checkpointing vs caching
- spark.sql.shuffle.partitions
- Broadcast variables + accumulators
NICE TO KNOW — Differentiators (10%)
- mapPartitions() vs map()
- CombineByKey vs aggregateByKey internals
- Dynamic allocation configs
- Kryo serialization
- Off-heap memory (Tungsten)
THE 10-YEAR ENGINEER FRAMING
"I've used PySpark in production for [X] years. When someone says
'the job is slow', I don't guess — I open Spark UI, look at the
Stages tab for skewed tasks, check GC time on the Executors tab,
look at the SQL plan for Sort-Merge Joins that should be Broadcast,
and check if AQE is enabled. The answer is almost always one of:
data skew, too many/few partitions, wrong join strategy, or
DataFrame recomputed multiple times instead of being cached."
SPARK ECOSYSTEM OVERVIEW
📐 Architecture Diagram
┌─────────────────────────────────────────────────────────┐ │ SPARK ECOSYSTEM │ ├─────────────────────────────────────────────────────────┤ │ LANGUAGE APIs: Python (PySpark), Scala, Java, R │ ├─────────────────────────────────────────────────────────┤ │ HIGH-LEVEL APIs: │ │ Spark SQL / DataFrame API (structured) │ │ Streaming (Structured Streaming) │ ├─────────────────────────────────────────────────────────┤ │ CORE ENGINE: │ │ Catalyst Optimizer + Tungsten Execution Engine │ │ RDD API (low-level) │ ├─────────────────────────────────────────────────────────┤ │ STORAGE: HDFS, S3, ADLS, Delta Lake, Iceberg │ ├─────────────────────────────────────────────────────────┤ │ CLUSTER: YARN, Kubernetes, Mesos, Standalone │ └─────────────────────────────────────────────────────────┘