PySpark — Confusions, Labs, Gotchas & Mock Interview
💡 Interview Tip
Goal: After this page, you should NEVER struggle with PySpark interview questions.
Where to run labs: Databricks Community Edition (free), local PySpark install, or Google Colab with
!pip install pyspark.Memory Map
🧠 PYSPARK MASTERY → LAZY-SHUFFLE-PLAN
PYSPARK MASTERYLAZY-SHUFFLE-PLAN
───────────────────────────────────
LLazy evaluation (transformations vs actions)
AAbstractions (RDD vs DataFrame vs Dataset)
ZZones of shuffle (narrow vs wide)
Y"Yes, cache" (when and why)
SStrategies for joins (broadcast / sort-merge / shuffle-hash)
HHandling skew, small files, OOM
PPartitioning (repartition vs coalesce)
LLabs + mock interview
AAQE (adaptive query execution)
NNEVER use UDFs if you can help it
SECTION 0: TOP 8 PYSPARK CONFUSIONS — Cleared Forever
Confusion 1: RDD vs DataFrame vs Dataset
🧠 RDD → low-level, typed, NO optimization (Spark Core)
RDDlow-level, typed, NO optimization (Spark Core)
DataFrame→tabular, optimized via Catalyst, preferred default (Python has NO Dataset)
Dataset→DataFrame + compile-time types (Scala/Java only)
Real-world analogy:
- RDD = raw Python list. You control everything, Spark doesn't help.
- DataFrame = pandas-style table. Spark optimizes queries for you.
- Dataset = DataFrame + type safety (only Scala/Java).
When to use:
- DataFrame: 95% of the time. Default choice.
- RDD: only for low-level control (custom partitioning, binary data, legacy code).
Interview one-liner: "I always use DataFrame API because it goes through Catalyst optimizer and Tungsten execution engine — I get query optimization, code generation, and off-heap memory for free. RDDs lose all that."
What NOT to say: "RDDs are faster." — FALSE. DataFrames are almost always faster due to Catalyst optimizations.
Confusion 2: Transformation vs Action (Lazy Evaluation)
Transformations→LAZY — build the DAG, don't execute
Examples: select