🧱
Databricks
File 05: Performance Tuning & Production Systems
🧱
🧱
Databricks · Section 6 of 17

File 05: Performance Tuning & Production Systems

File 05: Performance Tuning & Production Systems

Level: Senior/Lead (10+ years) — Production debugging and system design Focus: Spark UI debugging, Unity Catalog, Photon, CI/CD, cost management, governance

SECTION 1: SPARK UI DEBUGGING & EXECUTION PLANS

Q1: How do you read a Spark execution plan? Walk through a real example.

Answer:

python — editable
df = spark.table("orders") \
    .filter(col("date") == "2025-01-15") \
    .join(spark.table("customers"), "customer_id") \
    .groupBy("region").agg(sum("amount"))

df.explain(True)

Reading the plan (BOTTOM UP):

== Physical Plan ==
*(3) HashAggregate(keys=[region], functions=[sum(amount)]) ← Final aggregation
+- Exchange hashpartitioning(region, 200) ← SHUFFLE (stage boundary)
+- *(2) HashAggregate(keys=[region], functions=[partial_sum(amount)]) ← Partial (map-side) aggregation
+- *(2) Project [region, amount] ← Column pruning
+- *(2) BroadcastHashJoin [customer_id], [customer_id] ← JOIN (broadcast = no shuffle!)
:- *(2) Filter (date = 2025-01-15) ← Predicate
: +- *(2) FileScan delta [customer_id, amount, date] ← Table scan
: PushedFilters: [EqualTo(date, 2025-01-15)] ← PREDICATE PUSHDOWN ✓
+- BroadcastExchange ← Customer table broadcast
+- *(1) FileScan delta [customer_id, region] ← Customer table scan

Key things to look for:

ElementMeaning
ExchangeSHUFFLE — stage boundary, expensive
BroadcastHashJoinSmall table broadcast — good
SortMergeJoinBoth sides shuffled — check if broadcast possible
PushedFiltersPredicate pushdown is working
FileScan ... PartitionFiltersPartition pruning is working
WholeStageCodegen (*)Codegen active — good
HashAggregate (partial_sum)Map-side combine — reduces shuffle data

Q2: What are the key metrics to check in Spark UI for performance issues?

Answer:

🗂️Spark UI Debugging Checklist:
1. JOBS TAB:
Which job is slow? (match to your code)
2. STAGES TAB:
Sort by Duration — find the bottleneck stage
Shuffle Read/Write Size — large = expensive
Check Input/Output sizes — data amplification?
Spill (Memory/Disk) — if > 0, memory pressure
3. TASKS TAB (within slow stage):
Sort by Duration — look for outliers
Max Task Duration vs Median — if max >> median = DATA SKEW
GC Time — if > 10% of task time = memory pressure
Shuffle Read Size per task — should be roughly equal
Locality Level — PROCESS_LOCAL is best
4. SQL TAB:
Check physical plan — join strategies correct?
Rows output at each node — data amplification?
Time spent per operator
5. STORAGE TAB:
Cached RDDs — are they persisted as expected?
6. EXECUTORS TAB:
Active/Dead executors
Memory usage
Task counts — balanced distribution?

Q3: Scenario — You see a stage with 200 tasks, 199 complete in 30 seconds, 1 takes 40 minutes. Diagnose and fix.

Answer: Diagnosis: Data Skew

Step 1: Confirm skew

  • Spark UI → Stage detail → Tasks tab
  • Sort by Shuffle Read Size → the slow task reads 50 GB while others read 100 MB
  • Or sort by Duration → one task at 40 min, median at 30 sec

Step 2: Identify the skewed key

python — editable
# Find hot keys
df.groupBy("join_key").count().orderBy(col("count").desc()).show(20)
# If one key has 50M rows and others have 100 rows → that's the culprit

Step 3: Fix (in order of preference)

python — editable
# Option 1: Enable AQE (easiest)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Option 2: Broadcast if other side is small
result = large_df.join(broadcast(small_df), "key")

# Option 3: Salt the skewed key
# (see File 02, Q5 for full implementation)

# Option 4: Isolate hot keys
hot_keys = ["key_with_50M_rows"]
hot = large_df.filter(col("key").isin(hot_keys)).join(broadcast(small_df), "key")
normal = large_df.filter(~col("key").isin(hot_keys)).join(small_df, "key")
result = hot.union(normal)

Q4: How do you determine the optimal number of partitions?

Answer:

🗂️Rule of thumb:
Each partition should be 128 MB - 200 MB
Number of partitions = total_data_size / target_partition_size
Minimum: 2x total cores (keep all cores busy)
Maximum: 100K partitions (scheduler overhead)
Examples:
10 GB data → 10 GB / 200 MB = 50 partitions
1 TB data → 1 TB / 200 MB = 5,000 partitions
100 GB shuffle → 100 GB / 200 MB = 500 partitions
Configuration:
spark.sql.shuffle.partitions = 200 (default, often too low)
With AQE: set high (e.g., 2000), let AQE coalesce
For file reads: spark.sql.files.maxPartitionBytes = 128 MB

Q5: How do you handle the small file problem?

Answer:

sql
-- Detect: Check file count and sizes
DESCRIBE DETAIL my_table;  -- numFiles, sizeInBytes

-- Fix 1: Compact existing files
OPTIMIZE my_table;

-- Fix 2: Prevent future small files
ALTER TABLE my_table SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',      -- Coalesce on write
    'delta.autoOptimize.autoCompact' = 'true'           -- Auto-compact after writes
);

-- Fix 3: Repartition before write (batch)
df.repartition(10).write.format("delta").mode("append").saveAsTable("my_table")

-- Fix 4: For streaming, increase trigger interval
.trigger(processingTime="5 minutes")  -- Instead of "10 seconds"

-- Fix 5: Use Liquid Clustering for new tables
CREATE TABLE my_table (...) CLUSTER BY (col1, col2);

Q6: What is the difference between spark.sql.files.maxPartitionBytes and spark.sql.files.openCostInBytes?

Answer:

ConfigDefaultPurpose
maxPartitionBytes128 MBMax size of a partition when reading files
openCostInBytes4 MBEstimated "cost" of opening a file

Key insight: openCostInBytes inflates the apparent size of small files. With 10,000 files of 1 KB each:

  • With openCostInBytes=4 MB: Each file appears as 4 MB → 10,000 partitions (too many!)
  • With openCostInBytes=0: Files combined aggressively → fewer partitions
python — editable
256 MB">spark.conf.set("spark.sql.files.openCostInBytes", "0")
spark.conf.set("spark.sql.files.maxPartitionBytes", "268435456")  # 256 MB

SECTION 2: UNITY CATALOG

Q7: What is Unity Catalog? Explain the object hierarchy.

Answer: Unity Catalog = centralized governance for all data and AI assets in Databricks.

🗂️Metastore (top level — one per region)
Catalog (logical grouping)
Schema (like a database schema)
Table (managed or external)
View
Function
Model (ML models)
Volume (unstructured data — files, images, etc.)

Three-level namespace:

sql
SELECT * FROM catalog_name.schema_name.table_name;

Q8: How do you implement row-level and column-level security in Unity Catalog?

Answer:

sql
-- COLUMN MASKING: Different users see different values
CREATE FUNCTION mask_ssn(ssn STRING)
RETURN CASE
    WHEN is_member('hr_team') THEN ssn
    WHEN is_member('manager') THEN 'XXX-XX-' || right(ssn, 4)
    ELSE '***-**-****'
END;

ALTER TABLE employees ALTER COLUMN ssn SET MASK mask_ssn;

-- ROW-LEVEL SECURITY: Users only see their own data
CREATE FUNCTION region_access(region STRING)
RETURN CASE
    WHEN is_member('admin') THEN true
    WHEN is_member('us_team') THEN region = 'US'
    WHEN is_member('eu_team') THEN region = 'EU'
    ELSE false
END;

ALTER TABLE sales SET ROW FILTER region_access ON (region);

Q9: How does data lineage work in Unity Catalog?

Answer: Unity Catalog automatically tracks lineage at the column level:

  • Which tables feed into which tables
  • Which columns are derived from which source columns
  • Lineage captured from notebooks, jobs, DLT pipelines
  • Viewable in the Catalog Explorer UI

Use cases:

  1. Impact analysis: "If I change column X, what downstream tables are affected?"
  2. Root cause analysis: "This gold table has wrong values — where does this column come from?"
  3. Compliance: "Show me all tables that process PII data"
  4. Data quality debugging: "Trace the source of these null values"

Q10: What are Storage Credentials and External Locations?

Answer:

🗂️Storage Credential → External Location → External Table
Storage Credential:
Encapsulates a cloud credential (IAM role, service principal)
Grants Databricks access to cloud storage
Created by account admin
External Location:
Maps a cloud storage PATH to a storage credential
Example: s3://my-bucket/data/ → uses credential X
Enables Unity Catalog to govern access to that path
Created by metastore admin
External Table:
Table stored at an external location
Unity Catalog governs access
Data survives table drop
sql
-- Create storage credential (admin)
CREATE STORAGE CREDENTIAL my_s3_cred
WITH IAM_ROLE = 'arn:aws:iam::123456789:role/databricks-access';

-- Create external location
CREATE EXTERNAL LOCATION my_data_location
URL 's3://my-bucket/data/'
WITH (STORAGE CREDENTIAL my_s3_cred);

-- Create external table at that location
CREATE TABLE my_catalog.my_schema.orders
LOCATION 's3://my-bucket/data/orders/';

Q11: How do you migrate from Hive Metastore to Unity Catalog?

Answer:

🗂️Migration Steps:
1. PREPARATION
Audit existing Hive tables (managed vs external)
Map Hive databases → UC catalogs/schemas
Review existing permissions
Plan naming conventions
2. INFRASTRUCTURE
Create metastore and assign to workspace
Create storage credentials for cloud storage
Create external locations
Create catalogs and schemas
3. TABLE MIGRATION
For managed tables: CREATE TABLE ... AS SELECT * FROM hive_metastore.db.table
For external tables: CREATE TABLE ... LOCATION 'existing_path'
Or use SYNC command (Databricks utility)
Validate row counts and checksums
4. PERMISSION MIGRATION
Map Hive privileges → UC GRANT statements
Apply row/column security where needed
Test with each user group
5. CODE UPDATES
Update all notebooks: db.table → catalog.schema.table
Update all jobs and pipelines
Update BI tool connections
6. VALIDATION & CUTOVER
Run parallel for 1-2 weeks
Compare query results
Deprecate Hive metastore references
Tool: Databricks UCX (Unity Catalog Migration) automates much of this.

SECTION 3: PHOTON ENGINE

Q12: What is the Photon engine? When does it help and when doesn't it?

Answer: Photon is Databricks' native C++ vectorized query engine that replaces Spark's JVM-based execution.

When it helps (significantly):

OperationImprovement
Table scansUp to 12x faster
Aggregations3-10x faster
Joins (hash, sort-merge)3-8x faster
Filters5-10x faster
MERGE/UPDATE/DELETE3-5x faster
Delta Lake writes2-4x faster

When it does NOT help:

OperationWhy
Python UDFsUDFs run in Python, bypassing Photon entirely
RDD-based operationsPhoton only optimizes DataFrame/SQL
Very small dataPhoton overhead not worth it for < 100 MB
ML training (MLlib)Use GPU clusters instead
Unsupported operationsFalls back to Spark gracefully (no error)

How to enable:

python — editable
# Select Photon runtime when creating cluster (e.g., 14.3 LTS Photon)
# Or set:
spark.conf.set("spark.databricks.photon.enabled", "true")
# No code changes needed!

SECTION 4: DATABRICKS WORKFLOWS & ORCHESTRATION

Q13: Compare Databricks Workflows vs Apache Airflow. When would you use each?

Answer:

AspectDatabricks WorkflowsApache Airflow
IntegrationNative to DatabricksExternal orchestrator
SetupZero setupRequires deployment/maintenance
Task typesNotebook, Python, SQL, DLT, dbt, JARAny operator (cloud APIs, Spark, Python, etc.)
Repair/RetryBuilt-in (re-run only failed tasks)Task-level retry, no native repair
Multi-platformDatabricks onlyAny cloud/tool (heterogeneous)
CostIncluded in DatabricksSeparate infrastructure
ComplexitySimple DAGsComplex DAGs, branching, sensors
VersioningDatabricks Asset BundlesGit-synced DAGs

Use Workflows when: Pipeline is Databricks-centric Use Airflow when: Orchestrating across multiple platforms (Databricks + Snowflake + APIs + dbt)

Q14: What is the difference between Job Cluster and All-Purpose Cluster?

Answer:

AspectJob ClusterAll-Purpose Cluster
LifecycleCreated per job run, destroyed afterLong-running, manually managed
CostCheaper (job compute pricing)More expensive (all-purpose pricing)
SharingSingle jobMultiple users/notebooks
StartupSlower (cluster creation)Instant (already running)
Use caseProduction jobsDevelopment, ad-hoc queries
Auto-terminationAfter job completesConfigurable timeout

Best practice: Always use Job Clusters for production. Use All-Purpose only for development.

Q15: How do you pass parameters between tasks in a Workflow?

Answer:

python — editable
# Task 1: Set task values
dbutils.jobs.taskValues.set(key="record_count", value=15000)
dbutils.jobs.taskValues.set(key="max_date", value="2025-06-15")
dbutils.jobs.taskValues.set(key="status", value="success")

# Task 2: Get values from Task 1
count = dbutils.jobs.taskValues.get(taskKey="task_1", key="record_count")
max_date = dbutils.jobs.taskValues.get(taskKey="task_1", key="max_date")
status = dbutils.jobs.taskValues.get(taskKey="task_1", key="status")

# Conditional logic based on upstream task
if count > 10000:
    # Process large batch differently
    run_large_batch_processing()

Q16: What are Databricks Asset Bundles (DABs)? How do they enable CI/CD?

Answer: DABs define Databricks resources as YAML configuration that can be version-controlled and deployed across environments.

yaml
# databricks.yml
bundle:
  name: etl_pipeline

workspace:
  host: https://my-workspace.databricks.com

resources:
  jobs:
    daily_etl:
      name: "Daily ETL Pipeline"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "UTC"
      tasks:
        - task_key: bronze_ingestion
          notebook_task:
            notebook_path: ./notebooks/bronze_ingestion.py
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 4
        - task_key: silver_transformation
          depends_on:
            - task_key: bronze_ingestion
          notebook_task:
            notebook_path: ./notebooks/silver_transformation.py

targets:
  dev:
    workspace:
      host: https://dev-workspace.databricks.com
  staging:
    workspace:
      host: https://staging-workspace.databricks.com
  prod:
    workspace:
      host: https://prod-workspace.databricks.com
bash
# Deploy to different environments
databricks bundle deploy --target dev
databricks bundle deploy --target staging
databricks bundle deploy --target prod

SECTION 5: COST MANAGEMENT

Q17: How do you manage costs in Databricks? What strategies have you used?

Answer:

🗂️Cost Management Strategies:
1. COMPUTE OPTIMIZATION
Use Job Clusters (not All-Purpose) for production
Enable auto-termination (default: 120 min idle)
Use Spot/Preemptible instances (up to 90% savings)
Right-size clusters (analyze utilization metrics)
Use Serverless for bursty workloads (no idle cost)
Instance pools for faster startup (shared warm instances)
2. CLUSTER POLICIES (enforce governance)
Max nodes per cluster
Allowed instance types
Auto-termination minimum
Required tags (cost allocation)
Restrict to Photon runtimes (more efficient)
3. STORAGE OPTIMIZATION
VACUUM unused files (free storage)
OPTIMIZE (compact files, reduce I/O)
Use appropriate compression (ZSTD for Delta)
Lifecycle policies on raw data
Delete unnecessary table versions
4. QUERY OPTIMIZATION
Enable Photon (more queries per DBU)
Use SQL Warehouse for BI (auto-scale, auto-suspend)
Partition pruning + Z-ORDER (reduce scanned data)
Cache frequently accessed Gold tables
5. MONITORING & CHARGEBACK
Tag clusters and jobs by team/project
Use Databricks Account Console for cost analysis
Set up budget alerts
Implement showback/chargeback per team
Review weekly: unused clusters, oversized instances

Q18: How do you implement CI/CD for Databricks pipelines?

Answer:

🗂️CI/CD Pipeline:
1. SOURCE CONTROL
Databricks Repos (Git integration)
Notebooks, Python modules, SQL scripts in Git
Databricks Asset Bundles (YAML configs) in Git
2. CONTINUOUS INTEGRATION (on PR)
Run unit tests (pytest on local Spark or connect-mode)
Run linting (ruff, black, mypy)
Validate DAB configs: databricks bundle validate
Run integration tests on dev workspace
Code review
3. CONTINUOUS DEPLOYMENT
Dev: Auto-deploy on merge to develop branch
Staging: Deploy on merge to main, run integration tests
Prod: Manual approval gate, then deploy
Rollback: Redeploy previous DAB version
4. TESTING STRATEGIES
Unit tests: Mock Spark, test transformations
Integration tests: Nutter framework or pytest with real Spark
Data quality tests: Compare output vs expected
Performance tests: Benchmark on staging data

Example GitHub Actions:

yaml
name: Deploy Databricks Pipeline
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: databricks/setup-cli@main
      - run: databricks bundle validate --target staging
      - run: databricks bundle deploy --target staging
      - run: databricks bundle run daily_etl --target staging  # Integration test
      - run: databricks bundle deploy --target prod
        if: github.ref == 'refs/heads/main'

SECTION 6: DATA GOVERNANCE AT SCALE

Q19: How do you handle data governance for a large organization?

Answer:

🗂️Governance Framework:
1. ACCESS CONTROL (Unity Catalog)
Principle of least privilege
Three-level namespace for organization
Row-level and column-level security
Dynamic data masking for PII
Grant/Revoke at table, schema, catalog levels
2. DATA CLASSIFICATION
Tag tables and columns with sensitivity levels
PII, PHI, financial, public
Use tags to drive access policies
Auto-classification tools (e.g., Privacera)
3. AUDIT & COMPLIANCE
Unity Catalog audit logs (who accessed what, when)
Delta Lake history (DESCRIBE HISTORY for data changes)
Regulatory compliance (GDPR right to deletion, CCPA)
Retention policies (auto-delete after N years)
4. DATA QUALITY
DLT expectations at every layer
Data contracts between teams
Quality dashboards and SLA monitoring
Quarantine + alerting for bad data
5. DATA DISCOVERY
Unity Catalog search and browse
Column descriptions and table comments
Lineage visualization
Data dictionary / business glossary

Q20: How do you handle GDPR "right to be forgotten" in Delta Lake?

Answer:

python — editable
0 HOURS")
spark.sql("VACUUM fact_orders RETAIN 0 HOURS")
spark.sql("VACUUM silver_interactions RETAIN 0 HOURS")

# Step 3: Log the deletion for compliance
spark.sql(f"""
    INSERT INTO gdpr_deletion_log VALUES
    ('{customer_id_to_delete}', current_timestamp(), 'completed', 'tables: dim_customer, fact_orders, silver_interactions')
""")

# Alternative: Use pseudonymization instead of deletion
# Replace PII with hashed/random values — preserves analytics while removing identity"># Challenge: Delta Lake retains history. GDPR requires permanent deletion.

# Step 1: Delete the customer's data from all tables
customer_id_to_delete = "CUST-12345"

# Delete from all relevant tables
spark.sql(f"DELETE FROM dim_customer WHERE customer_id = '{customer_id_to_delete}'")
spark.sql(f"DELETE FROM fact_orders WHERE customer_id = '{customer_id_to_delete}'")
spark.sql(f"DELETE FROM silver_interactions WHERE customer_id = '{customer_id_to_delete}'")

# Step 2: VACUUM to physically remove old files (including pre-delete versions)
# Must set retention to 0 for GDPR compliance
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark.sql("VACUUM dim_customer RETAIN 0 HOURS")
spark.sql("VACUUM fact_orders RETAIN 0 HOURS")
spark.sql("VACUUM silver_interactions RETAIN 0 HOURS")

# Step 3: Log the deletion for compliance
spark.sql(f"""
    INSERT INTO gdpr_deletion_log VALUES
    ('{customer_id_to_delete}', current_timestamp(), 'completed', 'tables: dim_customer, fact_orders, silver_interactions')
""")

# Alternative: Use pseudonymization instead of deletion
# Replace PII with hashed/random values — preserves analytics while removing identity

SECTION 7: SYSTEM DESIGN (LEADERSHIP-LEVEL)

Q21: How would you design a data platform on Databricks for a company with 200+ data engineers?

Answer:

🗂️Platform Architecture:
1. WORKSPACE STRATEGY
Separate workspaces: Dev, Staging, Prod
Shared workspace for ad-hoc analytics
Each workspace connected to same Unity Catalog metastore
Network isolation between environments
2. CATALOG STRUCTURE (Unity Catalog)
catalog: raw (bronze layer, all domains)
catalog: curated (silver layer, per domain)
catalog: analytics (gold layer, cross-domain)
catalog: ml (feature store, models)
catalog: sandbox (dev experimentation)
3. COMPUTE GOVERNANCE
Cluster policies per team (max size, instance types)
Job clusters for production (enforced)
Serverless SQL Warehouses for BI
Instance pools for faster startup
Cost tags for chargeback
4. CI/CD
Git integration (Databricks Repos)
Databricks Asset Bundles for IaC
GitHub Actions / Azure DevOps pipelines
Automated testing framework (pytest + nutter)
Environment promotion: dev → staging → prod
5. DATA GOVERNANCE
Unity Catalog for all access control
Data classification and tagging
Automated PII detection
Audit logging and compliance reporting
Data quality framework (DLT expectations)
6. STANDARDS & BEST PRACTICES
Medallion architecture (enforced naming convention)
Shared Python libraries (company data framework)
Coding standards (linting, testing requirements)
Documentation templates
On-call rotation for production pipelines
7. COST MANAGEMENT
Budget alerts per team
Monthly cost reviews
Auto-termination policies
Spot instance usage targets (>70%)
Quarterly optimization sprints

Q22: Scenario — You're brought in to rescue a failing Databricks migration. The team has been working for 6 months but pipelines are slow, data quality is poor, and costs are 3x budget. What do you do?

Answer:

🗂️Week 1: ASSESS
Inventory all pipelines (notebook? workflow? DLT?)
Map data flows: source → bronze → silver → gold
Measure: pipeline durations, failure rates, data quality scores
Analyze costs: which clusters, which jobs, which teams
Interview team: what are the blockers?
Common findings:
No medallion architecture (random file locations)
All-purpose clusters for everything (3x cost)
No OPTIMIZE/VACUUM (small file explosion)
Python UDFs everywhere (10x slower than built-in)
No monitoring (failures go unnoticed)
Week 2-3: QUICK WINS (80/20 rule)
Switch production to Job Clusters (immediate cost savings)
Run OPTIMIZE + VACUUM on all tables
Enable auto-compaction and optimized writes
Replace top 10 slowest Python UDFs with built-in functions
Enable AQE cluster-wide
Set up Photon for heavy workloads
Set up basic monitoring (row counts, freshness)
Week 4-8: STRUCTURAL FIXES
Implement medallion architecture (standardize layers)
Move to Unity Catalog (centralize governance)
Implement DLT for critical pipelines (quality + lineage)
Set up CI/CD with Asset Bundles
Implement data quality framework
Create cluster policies (cost control)
Week 8-12: OPTIMIZATION
Performance tune top 20 pipelines
Implement incremental processing (replace full refreshes)
Implement SCD Type 2 where needed
Set up cost dashboards and chargeback
Document standards and train the team
Metrics to track:
Pipeline duration (target: <1 hour for daily pipelines)
Failure rate (target: <5%)
Data freshness SLA (target: 99.5%)
Monthly cost (target: back to budget)
Data quality score (target: >99% passing expectations)

Q23: How do you approach data modeling in a lakehouse?

Answer:

🗂️Lakehouse Data Modeling Strategy:
BRONZE: No modeling — raw as-is
Preserve source schema
Add metadata columns
Append-only
SILVER: Light modeling — conformed
Apply business key relationships
Standardize naming conventions
Deduplicate and validate
SCD Type 2 for dimensions
Keep relatively normalized
GOLD: Heavy modeling — optimized for use case
Star/snowflake schema for BI
Fact tables (measures, metrics)
Dimension tables (descriptive)
Bridge tables (many-to-many)
Wide/denormalized tables for ML
Feature tables (all attributes pre-joined)
Aggregation tables for dashboards
Pre-computed daily/hourly metrics
CLUSTER BY (most queried columns)
Key Principles:
1. Model for the consumer, not the source
2. Gold tables should be self-service (no complex joins)
3. Use naming conventions: fact_, dim_, agg_, feature_
4. Document business logic in table comments
5. Maintain backward compatibility (add columns, don't remove)

Q24: Scenario — A production pipeline that was running fine for 6 months suddenly takes 5x longer. Nothing changed in the code. What happened?

Answer:

🗂️Investigation Checklist:
1. DATA VOLUME
Did the source data size increase dramatically?
(Seasonal spike, backfill, duplicate events)
2. DATA SKEW
Did a new customer/product become very popular?
(One partition now 100x larger than others)
3. SMALL FILES ACCUMULATION
6 months of streaming writes without OPTIMIZE?
(10,000 small files → slow reads)
4. STATISTICS STALE
Spark using wrong join strategy (SortMerge instead of Broadcast)
because table grew past the broadcast threshold
5. CLUSTER CHANGES
Cloud provider changed instance types?
Cluster policy updated? Fewer nodes?
6. INFRASTRUCTURE
S3/ADLS throttling? Network issues?
Noisy neighbor on shared infrastructure?
7. UPSTREAM CHANGES
Source system changed data format?
New columns causing schema evolution?
Change in data distribution pattern?
8. DELTA TABLE HEALTH
Run: DESCRIBE DETAIL my_table
Check: numFiles (too many?), sizeInBytes (unexpected growth?)
Quick Fixes:
OPTIMIZE all affected tables
ANALYZE TABLE ... COMPUTE STATISTICS
Check and increase spark.sql.shuffle.partitions
Enable AQE if not already
Consider Liquid Clustering for affected tables

Q25: What is Serverless in Databricks? How does it differ from standard compute?

Answer:

AspectStandard ClustersServerless
ManagementYou manage cluster configDatabricks manages everything
Startup time3-10 minutes<10 seconds
ScalingManual or autoscale configAutomatic
BillingRunning time (even idle)Query/task execution time only
Cost controlCluster policiesAutomatic
Available forNotebooks, jobsSQL Warehouses, notebooks, jobs

When to use Serverless:

  • SQL analytics (SQL Warehouses)
  • Bursty workloads (intermittent jobs)
  • Development (fast startup, no idle costs)
  • Teams without Spark expertise (no tuning needed)

When NOT to use:

  • Heavy streaming (24/7 processing — predictable load)
  • GPU workloads (ML training)
  • Very large clusters with specific hardware requirements

Q26: Quick-fire: Common production issues and their solutions.

IssueCauseSolution
ConcurrentModificationExceptionConcurrent writes to same tablePartition writes by time, use WAP pattern
VACUUM retention too shortConcurrent readers failKeep default 7-day retention
Pipeline silently produces wrong dataNo data quality checksAdd DLT expectations, row count validation
Costs spike on weekendsDevelopers leave clusters runningEnforce auto-termination in cluster policies
Query suddenly slowStatistics outdatedRun ANALYZE TABLE COMPUTE STATISTICS
Streaming job keeps restartingState store corruptionDelete checkpoint, reprocess from source
SchemaEvolutionExceptionSource schema changedUse Auto Loader rescue mode
FileNotFoundException during queryVACUUM ran while query executingIncrease VACUUM retention
Cluster OOM on startupToo many librariesUse init scripts, minimize dependencies
Job works in dev, fails in prodDifferent cluster configUse Databricks Asset Bundles for consistency