File 05: Performance Tuning & Production Systems
Level: Senior/Lead (10+ years) — Production debugging and system design Focus: Spark UI debugging, Unity Catalog, Photon, CI/CD, cost management, governance
SECTION 1: SPARK UI DEBUGGING & EXECUTION PLANS
Q1: How do you read a Spark execution plan? Walk through a real example.
Answer:
df = spark.table("orders") \
.filter(col("date") == "2025-01-15") \
.join(spark.table("customers"), "customer_id") \
.groupBy("region").agg(sum("amount"))
df.explain(True)
Reading the plan (BOTTOM UP):
Key things to look for:
| Element | Meaning |
|---|---|
Exchange | SHUFFLE — stage boundary, expensive |
BroadcastHashJoin | Small table broadcast — good |
SortMergeJoin | Both sides shuffled — check if broadcast possible |
PushedFilters | Predicate pushdown is working |
FileScan ... PartitionFilters | Partition pruning is working |
WholeStageCodegen (*) | Codegen active — good |
HashAggregate (partial_sum) | Map-side combine — reduces shuffle data |
Q2: What are the key metrics to check in Spark UI for performance issues?
Answer:
Q3: Scenario — You see a stage with 200 tasks, 199 complete in 30 seconds, 1 takes 40 minutes. Diagnose and fix.
Answer: Diagnosis: Data Skew
Step 1: Confirm skew
- Spark UI → Stage detail → Tasks tab
- Sort by Shuffle Read Size → the slow task reads 50 GB while others read 100 MB
- Or sort by Duration → one task at 40 min, median at 30 sec
Step 2: Identify the skewed key
# Find hot keys
df.groupBy("join_key").count().orderBy(col("count").desc()).show(20)
# If one key has 50M rows and others have 100 rows → that's the culprit
Step 3: Fix (in order of preference)
# Option 1: Enable AQE (easiest)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# Option 2: Broadcast if other side is small
result = large_df.join(broadcast(small_df), "key")
# Option 3: Salt the skewed key
# (see File 02, Q5 for full implementation)
# Option 4: Isolate hot keys
hot_keys = ["key_with_50M_rows"]
hot = large_df.filter(col("key").isin(hot_keys)).join(broadcast(small_df), "key")
normal = large_df.filter(~col("key").isin(hot_keys)).join(small_df, "key")
result = hot.union(normal)
Q4: How do you determine the optimal number of partitions?
Answer:
Q5: How do you handle the small file problem?
Answer:
-- Detect: Check file count and sizes
DESCRIBE DETAIL my_table; -- numFiles, sizeInBytes
-- Fix 1: Compact existing files
OPTIMIZE my_table;
-- Fix 2: Prevent future small files
ALTER TABLE my_table SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true', -- Coalesce on write
'delta.autoOptimize.autoCompact' = 'true' -- Auto-compact after writes
);
-- Fix 3: Repartition before write (batch)
df.repartition(10).write.format("delta").mode("append").saveAsTable("my_table")
-- Fix 4: For streaming, increase trigger interval
.trigger(processingTime="5 minutes") -- Instead of "10 seconds"
-- Fix 5: Use Liquid Clustering for new tables
CREATE TABLE my_table (...) CLUSTER BY (col1, col2);
Q6: What is the difference between spark.sql.files.maxPartitionBytes and spark.sql.files.openCostInBytes?
Answer:
| Config | Default | Purpose |
|---|---|---|
maxPartitionBytes | 128 MB | Max size of a partition when reading files |
openCostInBytes | 4 MB | Estimated "cost" of opening a file |
Key insight: openCostInBytes inflates the apparent size of small files. With 10,000 files of 1 KB each:
- With
openCostInBytes=4 MB: Each file appears as 4 MB → 10,000 partitions (too many!) - With
openCostInBytes=0: Files combined aggressively → fewer partitions
256 MB">spark.conf.set("spark.sql.files.openCostInBytes", "0")
spark.conf.set("spark.sql.files.maxPartitionBytes", "268435456") # 256 MB
SECTION 2: UNITY CATALOG
Q7: What is Unity Catalog? Explain the object hierarchy.
Answer: Unity Catalog = centralized governance for all data and AI assets in Databricks.
Three-level namespace:
SELECT * FROM catalog_name.schema_name.table_name;
Q8: How do you implement row-level and column-level security in Unity Catalog?
Answer:
-- COLUMN MASKING: Different users see different values
CREATE FUNCTION mask_ssn(ssn STRING)
RETURN CASE
WHEN is_member('hr_team') THEN ssn
WHEN is_member('manager') THEN 'XXX-XX-' || right(ssn, 4)
ELSE '***-**-****'
END;
ALTER TABLE employees ALTER COLUMN ssn SET MASK mask_ssn;
-- ROW-LEVEL SECURITY: Users only see their own data
CREATE FUNCTION region_access(region STRING)
RETURN CASE
WHEN is_member('admin') THEN true
WHEN is_member('us_team') THEN region = 'US'
WHEN is_member('eu_team') THEN region = 'EU'
ELSE false
END;
ALTER TABLE sales SET ROW FILTER region_access ON (region);
Q9: How does data lineage work in Unity Catalog?
Answer: Unity Catalog automatically tracks lineage at the column level:
- Which tables feed into which tables
- Which columns are derived from which source columns
- Lineage captured from notebooks, jobs, DLT pipelines
- Viewable in the Catalog Explorer UI
Use cases:
- Impact analysis: "If I change column X, what downstream tables are affected?"
- Root cause analysis: "This gold table has wrong values — where does this column come from?"
- Compliance: "Show me all tables that process PII data"
- Data quality debugging: "Trace the source of these null values"
Q10: What are Storage Credentials and External Locations?
Answer:
-- Create storage credential (admin)
CREATE STORAGE CREDENTIAL my_s3_cred
WITH IAM_ROLE = 'arn:aws:iam::123456789:role/databricks-access';
-- Create external location
CREATE EXTERNAL LOCATION my_data_location
URL 's3://my-bucket/data/'
WITH (STORAGE CREDENTIAL my_s3_cred);
-- Create external table at that location
CREATE TABLE my_catalog.my_schema.orders
LOCATION 's3://my-bucket/data/orders/';
Q11: How do you migrate from Hive Metastore to Unity Catalog?
Answer:
SECTION 3: PHOTON ENGINE
Q12: What is the Photon engine? When does it help and when doesn't it?
Answer: Photon is Databricks' native C++ vectorized query engine that replaces Spark's JVM-based execution.
When it helps (significantly):
| Operation | Improvement |
|---|---|
| Table scans | Up to 12x faster |
| Aggregations | 3-10x faster |
| Joins (hash, sort-merge) | 3-8x faster |
| Filters | 5-10x faster |
| MERGE/UPDATE/DELETE | 3-5x faster |
| Delta Lake writes | 2-4x faster |
When it does NOT help:
| Operation | Why |
|---|---|
| Python UDFs | UDFs run in Python, bypassing Photon entirely |
| RDD-based operations | Photon only optimizes DataFrame/SQL |
| Very small data | Photon overhead not worth it for < 100 MB |
| ML training (MLlib) | Use GPU clusters instead |
| Unsupported operations | Falls back to Spark gracefully (no error) |
How to enable:
# Select Photon runtime when creating cluster (e.g., 14.3 LTS Photon)
# Or set:
spark.conf.set("spark.databricks.photon.enabled", "true")
# No code changes needed!
SECTION 4: DATABRICKS WORKFLOWS & ORCHESTRATION
Q13: Compare Databricks Workflows vs Apache Airflow. When would you use each?
Answer:
| Aspect | Databricks Workflows | Apache Airflow |
|---|---|---|
| Integration | Native to Databricks | External orchestrator |
| Setup | Zero setup | Requires deployment/maintenance |
| Task types | Notebook, Python, SQL, DLT, dbt, JAR | Any operator (cloud APIs, Spark, Python, etc.) |
| Repair/Retry | Built-in (re-run only failed tasks) | Task-level retry, no native repair |
| Multi-platform | Databricks only | Any cloud/tool (heterogeneous) |
| Cost | Included in Databricks | Separate infrastructure |
| Complexity | Simple DAGs | Complex DAGs, branching, sensors |
| Versioning | Databricks Asset Bundles | Git-synced DAGs |
Use Workflows when: Pipeline is Databricks-centric Use Airflow when: Orchestrating across multiple platforms (Databricks + Snowflake + APIs + dbt)
Q14: What is the difference between Job Cluster and All-Purpose Cluster?
Answer:
| Aspect | Job Cluster | All-Purpose Cluster |
|---|---|---|
| Lifecycle | Created per job run, destroyed after | Long-running, manually managed |
| Cost | Cheaper (job compute pricing) | More expensive (all-purpose pricing) |
| Sharing | Single job | Multiple users/notebooks |
| Startup | Slower (cluster creation) | Instant (already running) |
| Use case | Production jobs | Development, ad-hoc queries |
| Auto-termination | After job completes | Configurable timeout |
Best practice: Always use Job Clusters for production. Use All-Purpose only for development.
Q15: How do you pass parameters between tasks in a Workflow?
Answer:
# Task 1: Set task values
dbutils.jobs.taskValues.set(key="record_count", value=15000)
dbutils.jobs.taskValues.set(key="max_date", value="2025-06-15")
dbutils.jobs.taskValues.set(key="status", value="success")
# Task 2: Get values from Task 1
count = dbutils.jobs.taskValues.get(taskKey="task_1", key="record_count")
max_date = dbutils.jobs.taskValues.get(taskKey="task_1", key="max_date")
status = dbutils.jobs.taskValues.get(taskKey="task_1", key="status")
# Conditional logic based on upstream task
if count > 10000:
# Process large batch differently
run_large_batch_processing()
Q16: What are Databricks Asset Bundles (DABs)? How do they enable CI/CD?
Answer: DABs define Databricks resources as YAML configuration that can be version-controlled and deployed across environments.
# databricks.yml
bundle:
name: etl_pipeline
workspace:
host: https://my-workspace.databricks.com
resources:
jobs:
daily_etl:
name: "Daily ETL Pipeline"
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "UTC"
tasks:
- task_key: bronze_ingestion
notebook_task:
notebook_path: ./notebooks/bronze_ingestion.py
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 4
- task_key: silver_transformation
depends_on:
- task_key: bronze_ingestion
notebook_task:
notebook_path: ./notebooks/silver_transformation.py
targets:
dev:
workspace:
host: https://dev-workspace.databricks.com
staging:
workspace:
host: https://staging-workspace.databricks.com
prod:
workspace:
host: https://prod-workspace.databricks.com
# Deploy to different environments
databricks bundle deploy --target dev
databricks bundle deploy --target staging
databricks bundle deploy --target prod
SECTION 5: COST MANAGEMENT
Q17: How do you manage costs in Databricks? What strategies have you used?
Answer:
Q18: How do you implement CI/CD for Databricks pipelines?
Answer:
Example GitHub Actions:
name: Deploy Databricks Pipeline
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle validate --target staging
- run: databricks bundle deploy --target staging
- run: databricks bundle run daily_etl --target staging # Integration test
- run: databricks bundle deploy --target prod
if: github.ref == 'refs/heads/main'
SECTION 6: DATA GOVERNANCE AT SCALE
Q19: How do you handle data governance for a large organization?
Answer:
Q20: How do you handle GDPR "right to be forgotten" in Delta Lake?
Answer:
0 HOURS")
spark.sql("VACUUM fact_orders RETAIN 0 HOURS")
spark.sql("VACUUM silver_interactions RETAIN 0 HOURS")
# Step 3: Log the deletion for compliance
spark.sql(f"""
INSERT INTO gdpr_deletion_log VALUES
('{customer_id_to_delete}', current_timestamp(), 'completed', 'tables: dim_customer, fact_orders, silver_interactions')
""")
# Alternative: Use pseudonymization instead of deletion
# Replace PII with hashed/random values — preserves analytics while removing identity"># Challenge: Delta Lake retains history. GDPR requires permanent deletion.
# Step 1: Delete the customer's data from all tables
customer_id_to_delete = "CUST-12345"
# Delete from all relevant tables
spark.sql(f"DELETE FROM dim_customer WHERE customer_id = '{customer_id_to_delete}'")
spark.sql(f"DELETE FROM fact_orders WHERE customer_id = '{customer_id_to_delete}'")
spark.sql(f"DELETE FROM silver_interactions WHERE customer_id = '{customer_id_to_delete}'")
# Step 2: VACUUM to physically remove old files (including pre-delete versions)
# Must set retention to 0 for GDPR compliance
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.sql("VACUUM dim_customer RETAIN 0 HOURS")
spark.sql("VACUUM fact_orders RETAIN 0 HOURS")
spark.sql("VACUUM silver_interactions RETAIN 0 HOURS")
# Step 3: Log the deletion for compliance
spark.sql(f"""
INSERT INTO gdpr_deletion_log VALUES
('{customer_id_to_delete}', current_timestamp(), 'completed', 'tables: dim_customer, fact_orders, silver_interactions')
""")
# Alternative: Use pseudonymization instead of deletion
# Replace PII with hashed/random values — preserves analytics while removing identity
SECTION 7: SYSTEM DESIGN (LEADERSHIP-LEVEL)
Q21: How would you design a data platform on Databricks for a company with 200+ data engineers?
Answer:
Q22: Scenario — You're brought in to rescue a failing Databricks migration. The team has been working for 6 months but pipelines are slow, data quality is poor, and costs are 3x budget. What do you do?
Answer:
Q23: How do you approach data modeling in a lakehouse?
Answer:
Q24: Scenario — A production pipeline that was running fine for 6 months suddenly takes 5x longer. Nothing changed in the code. What happened?
Answer:
Q25: What is Serverless in Databricks? How does it differ from standard compute?
Answer:
| Aspect | Standard Clusters | Serverless |
|---|---|---|
| Management | You manage cluster config | Databricks manages everything |
| Startup time | 3-10 minutes | <10 seconds |
| Scaling | Manual or autoscale config | Automatic |
| Billing | Running time (even idle) | Query/task execution time only |
| Cost control | Cluster policies | Automatic |
| Available for | Notebooks, jobs | SQL Warehouses, notebooks, jobs |
When to use Serverless:
- SQL analytics (SQL Warehouses)
- Bursty workloads (intermittent jobs)
- Development (fast startup, no idle costs)
- Teams without Spark expertise (no tuning needed)
When NOT to use:
- Heavy streaming (24/7 processing — predictable load)
- GPU workloads (ML training)
- Very large clusters with specific hardware requirements
Q26: Quick-fire: Common production issues and their solutions.
| Issue | Cause | Solution |
|---|---|---|
ConcurrentModificationException | Concurrent writes to same table | Partition writes by time, use WAP pattern |
VACUUM retention too short | Concurrent readers fail | Keep default 7-day retention |
| Pipeline silently produces wrong data | No data quality checks | Add DLT expectations, row count validation |
| Costs spike on weekends | Developers leave clusters running | Enforce auto-termination in cluster policies |
| Query suddenly slow | Statistics outdated | Run ANALYZE TABLE COMPUTE STATISTICS |
| Streaming job keeps restarting | State store corruption | Delete checkpoint, reprocess from source |
SchemaEvolutionException | Source schema changed | Use Auto Loader rescue mode |
FileNotFoundException during query | VACUUM ran while query executing | Increase VACUUM retention |
| Cluster OOM on startup | Too many libraries | Use init scripts, minimize dependencies |
| Job works in dev, fails in prod | Different cluster config | Use Databricks Asset Bundles for consistency |