PySpark Interview Questions

All 47 questions — full access

✓ Full Access 5 questions shown

← Study Guide

1 Q01 Filter employees earning > 50k ▼

Problem: Given an employees table with (emp_id, name, dept, salary),

return all employees with salary > 50000, sorted by salary desc.

python — editable

from pyspark.sql.functions import col

employees = spark.read.parquet("employees/")

result = employees \
    .filter(col("salary") > 50000) \
    .select("emp_id", "name", "dept", "salary") \
    .orderBy(col("salary").desc())

result.show()

2 Q02 Count orders per customer ▼

Problem: Given orders(order_id, customer_id, amount, order_date),

find total orders and total amount spent per customer.

python — editable

from pyspark.sql.functions import count, sum, col

result = orders.groupBy("customer_id") \
    .agg(
        count("order_id").alias("total_orders"),
        sum("amount").alias("total_spent")
    ) \
    .orderBy(col("total_spent").desc())

result.show()

3 Q03 Find duplicate rows by email ▼

Problem: Given users(user_id, email, created_at),

find all emails that appear more than once.

python — editable

from pyspark.sql.functions import count, col

# Method 1: groupBy + filter
duplicates = users.groupBy("email") \
    .agg(count("*").alias("cnt")) \
    .filter(col("cnt") > 1)

# Method 2: join back to get full rows
result = users.join(duplicates, "email", "inner") \
    .select("user_id", "email", "created_at", "cnt") \
    .orderBy("email")

result.show()

4 Q04 Total sales by date ▼

Problem: Given transactions(txn_id, date, store_id, amount),

compute daily total sales, ordered by date.

python — editable

from pyspark.sql.functions import sum, col, to_date

result = transactions \
    .withColumn("date", to_date(col("date"))) \
    .groupBy("date") \
    .agg(sum("amount").alias("daily_sales")) \
    .orderBy("date")

result.show()

5 Q05 Add derived column (categorize salary) ▼

Problem: Given employees(emp_id, name, salary),

add a "salary_band" column: High(>100k), Mid(50k-100k), Low(<50k)

python — editable

from pyspark.sql.functions import col, when

result = employees.withColumn(
    "salary_band",
    when(col("salary") > 100000, "High")
    .when(col("salary") >= 50000, "Mid")
    .otherwise("Low")
)

result.show()

🔒 5 of 47 questions shown

Unlock All 47 Questions

Get full access to every question, flash card, and guide across all 5 topics.

Get Full Access — from ₹299/month

₹499 quarterly · ₹799 for 6 months · ₹3,999 lifetime.