Question Bank

PySpark Interview Questions

All 47 questions — full access

✓ Full Access 5 questions shown
← Study Guide
1 Filter employees earning > 50k
Problem: Given an employees table with (emp_id, name, dept, salary),
return all employees with salary > 50000, sorted by salary desc.
python — editable
from pyspark.sql.functions import col

employees = spark.read.parquet("employees/")

result = employees \
    .filter(col("salary") > 50000) \
    .select("emp_id", "name", "dept", "salary") \
    .orderBy(col("salary").desc())

result.show()
2 Count orders per customer
Problem: Given orders(order_id, customer_id, amount, order_date),
find total orders and total amount spent per customer.
python — editable
from pyspark.sql.functions import count, sum, col

result = orders.groupBy("customer_id") \
    .agg(
        count("order_id").alias("total_orders"),
        sum("amount").alias("total_spent")
    ) \
    .orderBy(col("total_spent").desc())

result.show()
3 Find duplicate rows by email
Problem: Given users(user_id, email, created_at),
find all emails that appear more than once.
python — editable
from pyspark.sql.functions import count, col

# Method 1: groupBy + filter
duplicates = users.groupBy("email") \
    .agg(count("*").alias("cnt")) \
    .filter(col("cnt") > 1)

# Method 2: join back to get full rows
result = users.join(duplicates, "email", "inner") \
    .select("user_id", "email", "created_at", "cnt") \
    .orderBy("email")

result.show()
4 Total sales by date
Problem: Given transactions(txn_id, date, store_id, amount),
compute daily total sales, ordered by date.
python — editable
from pyspark.sql.functions import sum, col, to_date

result = transactions \
    .withColumn("date", to_date(col("date"))) \
    .groupBy("date") \
    .agg(sum("amount").alias("daily_sales")) \
    .orderBy("date")

result.show()
5 Add derived column (categorize salary)
Problem: Given employees(emp_id, name, salary),
add a "salary_band" column: High(>100k), Mid(50k-100k), Low(<50k)
python — editable
from pyspark.sql.functions import col, when

result = employees.withColumn(
    "salary_band",
    when(col("salary") > 100000, "High")
    .when(col("salary") >= 50000, "Mid")
    .otherwise("Low")
)

result.show()
🔒 5 of 47 questions shown

Unlock All 47 Questions

Get full access to every question, flash card, and guide across all 5 topics.

Get Full Access — from ₹299/month

₹499 quarterly · ₹799 for 6 months · ₹3,999 lifetime.