PySpark
3-Day PySpark Interview Prep
PySpark · Section 1 of 8

3-Day PySpark Interview Prep

3-Day PySpark Interview Prep

WHY PYSPARK MATTERS FOR YOUR INTERVIEW

  • Almost every data engineering role today requires Spark knowledge
  • Senior engineers asked: internals (DAG, stages, shuffle) not just "what is Spark"
  • Code questions are common: they will ask you to write PySpark code live
  • Optimization questions are the #1 differentiator at senior level
  • AQE (Adaptive Query Execution) is a hot 2024-2026 topic

3-DAY SCHEDULE

🗺️Memory Map
DAY 15-6 hoursARCHITECTURE + RDD
Spark Architecture (Driver, Executor, Cluster Manager)
SparkContext vs SparkSession
DAG (Directed Acyclic Graph) — stages, tasks
Lazy Evaluation — transformations vs actions
Narrow vs Wide Transformations
RDD: what it is, 5 properties
RDD Transformations (map, flatMap, filter, reduceByKey, groupByKey)
RDD Actions (collect, count, take, reduce, foreach)
RDD Persistence (cache, persist, storage levels)
RDD Lineage + Checkpointing
Broadcast Variables + Accumulators
map () vs flatMap() vs mapPartitions()
DAY 25-6 hoursDATAFRAME + SPARKSQL
RDD vs DataFrame vs Dataset — when to use each
SparkSession creation + config
Reading multiple file formats (CSV, JSON, Parquet, ORC, JDBC, Delta)
Reading multiple files from multiple sources at once
Schema: inferSchema vs explicit StructType
Core transformations: select, filter, groupBy, agg, join, union
withColumn, withColumnRenamed, drop, alias
Handling NULLs: na.drop, na.fill, isNull, isNotNull
Window Functions (ROW_NUMBER, LAG, LEAD, RANK, running totals)
UDFs: Python UDF vs Pandas UDF (Arrow-based)
Joins: types and hints
Writing: partitionBy, bucketBy, saveAsTable, write modes
explode, flatten nested JSON, struct, array columns
Spark SQL: createTempView, sql()
DAY 35-6 hoursOPTIMIZATION + PERFORMANCE
Catalyst Optimizer: 4 phases (Analysis → Logical Opt → Physical Plan → CodeGen)
Tungsten Engine (off-heap, codegen, vectorized execution)
Predicate Pushdown — when it works and when it doesn't
repartition () vs coalesce()
spark.sql.shuffle.partitions — tuning
Join Strategies: Broadcast, Shuffle Hash, Sort-Merge, BNLJ, Cartesian
Broadcast join — threshold, hints
Data Skew — detection + solutions (salting, AQE, broadcast)
AQE (Adaptive Query Execution) — 3 main features
cache() vs persist() — storage levels
Checkpointing vs caching
groupByKey vs reduceByKey vs aggregateByKey
Small files problem and solutions
Spark configurations (memory, parallelism, serialization)
Spark UI — how to read it for debugging
Dynamic Allocation

FILES STRUCTURE

FileContent
PySpark_00_PLAN.mdThis plan
PySpark_01_Architecture_RDD.mdDeep: Architecture + RDD
PySpark_01_Quick_Recall.mdFlash cards: Architecture + RDD
PySpark_02_DataFrame_SparkSQL.mdDeep: DataFrame + SparkSQL + Reading/Writing
PySpark_02_Quick_Recall.mdFlash cards: DataFrame + SparkSQL
PySpark_03_Optimization.mdDeep: ALL optimizations + configs
PySpark_03_Quick_Recall.mdFlash cards + Ultra Cheat Sheet

PRIORITY MATRIX

MUST KNOW — Will definitely be asked (60%)

  1. DAG, stages, tasks, lazy evaluation
  2. Narrow vs Wide transformations
  3. repartition() vs coalesce()
  4. groupByKey vs reduceByKey (performance!)
  5. cache() vs persist() storage levels
  6. Broadcast join — threshold, when to use
  7. AQE — 3 features (coalesce, join switching, skew)
  8. Python UDF vs Pandas UDF (performance)
  9. RDD vs DataFrame vs Dataset
  10. Data skew detection + salting fix

SHOULD KNOW — High probability (30%)

  1. Catalyst 4 phases
  2. Tungsten engine
  3. Spark UI reading
  4. Window functions + code
  5. Reading multiple sources at once
  6. Schema inference vs explicit StructType
  7. Predicate pushdown
  8. Checkpointing vs caching
  9. spark.sql.shuffle.partitions
  10. Broadcast variables + accumulators

NICE TO KNOW — Differentiators (10%)

  1. mapPartitions() vs map()
  2. CombineByKey vs aggregateByKey internals
  3. Dynamic allocation configs
  4. Kryo serialization
  5. Off-heap memory (Tungsten)

THE 10-YEAR ENGINEER FRAMING

"I've used PySpark in production for [X] years. When someone says
'the job is slow', I don't guess — I open Spark UI, look at the
Stages tab for skewed tasks, check GC time on the Executors tab,
look at the SQL plan for Sort-Merge Joins that should be Broadcast,
and check if AQE is enabled. The answer is almost always one of:
data skew, too many/few partitions, wrong join strategy, or
DataFrame recomputed multiple times instead of being cached."

SPARK ECOSYSTEM OVERVIEW

📐 Architecture Diagram
┌─────────────────────────────────────────────────────────┐
│                    SPARK ECOSYSTEM                       │
├─────────────────────────────────────────────────────────┤
│  LANGUAGE APIs:  Python (PySpark), Scala, Java, R       │
├─────────────────────────────────────────────────────────┤
│  HIGH-LEVEL APIs:                                        │
│  Spark SQL / DataFrame API (structured)                 │
│  Streaming (Structured Streaming)                       │
├─────────────────────────────────────────────────────────┤
│  CORE ENGINE:                                            │
│  Catalyst Optimizer + Tungsten Execution Engine         │
│  RDD API (low-level)                                    │
├─────────────────────────────────────────────────────────┤
│  STORAGE:  HDFS, S3, ADLS, Delta Lake, Iceberg          │
├─────────────────────────────────────────────────────────┤
│  CLUSTER:  YARN, Kubernetes, Mesos, Standalone          │
└─────────────────────────────────────────────────────────┘