🧱
Databricks
Azure Databricks Interview Question Bank — All 3 Levels
🧱
🧱
Databricks · Section 17 of 17

Azure Databricks Interview Question Bank — All 3 Levels

Azure Databricks Interview Question Bank

TOPIC 1: DELTA LAKE (Internals, Transaction Log, ACID, MERGE, OPTIMIZE, VACUUM, Z-ORDER, Liquid Clustering)

L1 — Direct / Simple Questions

  1. What is Delta Lake and why was it created?
  2. What file format does Delta Lake use under the hood?
  3. What is the _delta_log directory and what does it contain?
  4. What are the four ACID properties and how does Delta Lake guarantee them?
  5. What is a checkpoint file in the Delta transaction log?
  6. What is schema enforcement in Delta Lake?
  7. What is schema evolution and how do you enable it?
  8. What is Time Travel in Delta Lake? How do you query an older version?
  9. What is the VACUUM command and what does it do?
  10. What is the default retention period for VACUUM?
  11. What does the OPTIMIZE command do?
  12. What is Z-ORDER and what problem does it solve?
  13. What are Deletion Vectors in Delta Lake?
  14. What is the difference between Delta Lake and Apache Parquet?
  15. What is the DESCRIBE HISTORY command used for?
  16. What is Change Data Feed (CDF) in Delta Lake?
  17. What is Liquid Clustering in Delta Lake?
  18. What is Predictive Optimization in Databricks?
  19. What are table constraints in Delta Lake (CHECK, NOT NULL)?
  20. What is the RESTORE command in Delta Lake?

L2

  1. Explain the anatomy of a Delta Lake transaction — what happens when you write to a Delta table?
  2. How does optimistic concurrency control work in Delta Lake? What happens during write conflicts?
  3. Compare Z-ORDER vs Liquid Clustering — when would you use each?
  4. Explain data skipping in Delta Lake. How does it use min/max statistics?
  5. What is the difference between OPTIMIZE and VACUUM? Can you run them together?
  6. Explain how MERGE INTO works internally. What are the performance implications of a full table scan in MERGE?
  7. How does the Delta transaction log handle concurrent writes from multiple clusters?
  8. Compare schema enforcement vs schema evolution — give an example where each is appropriate.
  9. What happens if you run VACUUM with a retention of 0 hours? What are the risks?
  10. Explain the difference between OPTIMIZE WHERE and partition-level OPTIMIZE.
  11. How does Delta Lake handle small file compaction? What is the "small file problem"?
  12. Explain the difference between Copy-on-Write and Merge-on-Read in Delta Lake.
  13. How do Deletion Vectors improve UPDATE/DELETE performance compared to the traditional approach?
  14. What is the relationship between file statistics, data skipping, and Z-ORDER?
  15. Explain how Time Travel works internally — what is stored in each JSON commit file?
  16. Compare Change Data Feed (CDF) vs reading the transaction log directly for CDC.
  17. What is the difference between managed and external Delta tables?
  18. How does Liquid Clustering handle incremental clustering vs Z-ORDER which requires full rewrite?
  19. What are the trade-offs of over-partitioning a Delta table?
  20. Explain Delta Lake 4.x features: UniForm, Universal Format. Why do they matter?

L3 — Scenario-Based Questions

  1. MERGE Optimization: Your MERGE INTO statement takes 45 minutes on a 2 TB Delta table. Walk me through how you would diagnose and optimize this.
  2. Transaction Log Corruption: A developer accidentally ran VACUUM with 0-hour retention and now Time Travel queries fail. What happened and how do you recover?
  3. Small File Problem: Your Bronze table has 50,000 small Parquet files (avg 2 MB each). How do you fix this and prevent it from recurring?
  4. Concurrent Writes: Two Databricks jobs write to the same Delta table simultaneously and one fails with a ConcurrentAppendException. Explain why and how you fix it.
  5. Z-ORDER Strategy: You have a 10 TB Delta table queried by country, date, and customer_id. Design the partitioning and Z-ORDER strategy.
  6. Liquid Clustering Migration: Your team wants to migrate from Z-ORDER to Liquid Clustering on a production table. What is your migration plan? Any risks?
  7. Schema Evolution Crisis: A source system added 5 new columns overnight and your streaming pipeline failed. How do you design for schema evolution in Auto Loader + Delta Lake?
  8. Time Travel for Audit: Your compliance team needs to prove what data looked like on a specific date 30 days ago. How do you implement this with Delta Lake Time Travel? What are the limitations?
  9. VACUUM vs Storage Costs: Your Delta table consumes 5x the expected storage due to retained old versions. Design a VACUUM strategy that balances cost vs Time Travel needs.
  10. CDC with Delta CDF: Design a pipeline where downstream consumers only process changed records from a Silver Delta table. How do you use Change Data Feed?
  11. Table Restore Scenario: A bad ETL job corrupted your Gold table at 3 AM. It is now 9 AM and 6 versions have been written since. Walk through the recovery process.
  12. Partition Evolution: Your table was partitioned by year/month/day but queries now filter primarily by region. How do you restructure without downtime?
  13. MERGE with SCD Type 2: Implement a MERGE strategy for SCD Type 2 on a customer dimension table where you need to close old records and insert new ones atomically.
  14. Delta Sharing: An external partner needs read access to a subset of your Delta table. How do you implement this securely using Delta Sharing?
  15. Deletion Vectors in Production: After enabling Deletion Vectors, read performance on certain queries degraded. Explain why and how you would resolve this.

TOPIC 2: ETL PIPELINES (Medallion Architecture, SCD Type 2, CDC, Auto Loader, DLT/Lakeflow)

L1 — Direct / Simple Questions

  1. What is the Medallion Architecture (Bronze/Silver/Gold)?
  2. What is Auto Loader in Databricks?
  3. What is the difference between Auto Loader and COPY INTO?
  4. What is Delta Live Tables (DLT)?
  5. What is Lakeflow and how does it relate to DLT?
  6. What is SCD Type 1 vs SCD Type 2?
  7. What is Change Data Capture (CDC)?
  8. What is a streaming table vs a materialized view in DLT?
  9. What are DLT expectations (data quality constraints)?
  10. What is the difference between cloudFiles and spark.readStream on Delta?
  11. What is structured streaming in Databricks?
  12. What is a checkpoint in Spark Structured Streaming?
  13. What is the trigger mode availableNow vs processingTime?
  14. What is idempotency and why is it important in ETL pipelines?
  15. What is the difference between batch and streaming ETL?
  16. What is an ETL pipeline vs an ELT pipeline?
  17. What are the three DLT expectation actions: warn, drop, fail?
  18. What is the foreachBatch sink in Structured Streaming?
  19. What is event-time processing vs processing-time in streaming?
  20. What is watermarking in Spark Structured Streaming?

L2

  1. Explain how Auto Loader's file notification mode works vs directory listing mode. When do you use each?
  2. How do you handle schema evolution with Auto Loader (cloudFiles.schemaEvolutionMode)?
  3. Compare Delta Live Tables (DLT) vs hand-coded Structured Streaming pipelines — trade-offs?
  4. Explain how to implement SCD Type 2 using MERGE INTO with Delta Lake. What are the key columns?
  5. How does DLT handle pipeline failures and retries? What is the concept of "idempotent recomputation"?
  6. Compare trigger(availableNow=True) vs trigger(processingTime='5 minutes') — when to use each?
  7. Explain the role of Bronze, Silver, and Gold layers in terms of data quality, latency, and consumers.
  8. How does watermarking work in Structured Streaming? What happens to late-arriving data?
  9. Compare CDC patterns: log-based CDC (Debezium/Kafka) vs query-based CDC vs timestamp-based CDC.
  10. How do you handle exactly-once semantics in a Databricks streaming pipeline?
  11. Explain foreachBatch — when would you use it over a standard Delta sink?
  12. How do DLT expectations compare to Great Expectations or other data quality frameworks?
  13. What are the different ways to orchestrate dependent DLT pipelines?
  14. Explain how Auto Loader handles file deduplication. What is the RocksDB state store?
  15. How do you test ETL pipelines in Databricks? What frameworks do you use?
  16. Explain the difference between a complete output mode, append mode, and update mode in streaming.
  17. How do you monitor and alert on streaming pipeline lag in Databricks?
  18. What is the APPLY CHANGES INTO syntax in DLT and when do you use it?
  19. How do you handle out-of-order events in a Medallion Architecture?
  20. Explain incremental data loading patterns: append-only vs upsert vs full refresh.

L3 — Scenario-Based Questions

  1. Oracle CDC Pipeline: Design a CDC pipeline from Oracle to Delta Lake for Amadeus booking data. Oracle does not support log-based CDC natively. What approach do you take?
  2. Late-Arriving Data: Flight booking amendments arrive 48 hours after the original booking. Design a pipeline that correctly handles these late-arriving events in the Medallion Architecture.
  3. SCD Type 2 at Scale: You need to maintain SCD Type 2 on a customer dimension table with 500M rows, receiving 2M updates daily. Design the MERGE strategy for performance.
  4. DLT Pipeline Failure: Your DLT pipeline fails at the Silver layer due to a data quality expectation violation at 2 AM. 50,000 records were dropped. How do you investigate, recover, and prevent recurrence?
  5. Auto Loader Schema Drift: A source system renames columns from camelCase to snake_case overnight. Your Auto Loader pipeline breaks. Design a resilient schema evolution strategy.
  6. Multi-Source Medallion: You have 20 source systems feeding Bronze. Some are batch (daily files), some are streaming (Kafka). Design the Medallion Architecture to unify them.
  7. Streaming Backpressure: Your streaming pipeline is processing 100K events/sec but the source is producing 500K events/sec. The lag keeps growing. How do you diagnose and fix this?
  8. GDPR Delete Pipeline: A GDPR deletion request arrives for a passenger. You need to delete their data across Bronze, Silver, and Gold layers in a Lakehouse. Design the process.
  9. Deduplication Strategy: Your Kafka source sends duplicate booking events. Design a deduplication strategy at the Bronze and Silver layers that guarantees exactly-once processing.
  10. Testing & Validation: How would you set up automated testing for a Databricks DLT pipeline? Include unit tests, integration tests, and data quality assertions.
  11. Hybrid Batch-Streaming: You need near-real-time dashboards (5-minute latency) but also end-of-day reconciliation reports. Design a single pipeline architecture.
  12. Multi-Hop Streaming: Design a streaming pipeline with three hops (Bronze->Silver->Gold) where each layer applies different transformations. How do you manage checkpoints and failure recovery?
  13. Slowly Changing Dimension with Deletes: Your source system sends hard deletes (records simply disappear). How do you detect and handle these in an SCD Type 2 pipeline?
  14. Cost-Efficient Ingestion: You ingest 10 TB/day of raw JSON from ADLS Gen2. Design an ingestion pipeline that minimizes compute cost while maintaining <15 min latency.
  15. Pipeline Dependency Management: You have 50 DLT pipelines with complex dependencies. Some must run sequentially, others can be parallel. How do you orchestrate this?

TOPIC 3: AZURE PLATFORM & GOVERNANCE (Unity Catalog, Photon, Serverless, ADLS Gen2, GDPR)

L1 — Direct / Simple Questions

  1. What is Unity Catalog in Databricks?
  2. What is the three-level namespace in Unity Catalog (catalog.schema.table)?
  3. What is a metastore in Unity Catalog?
  4. What is the difference between managed and external tables in Unity Catalog?
  5. What is a storage credential in Unity Catalog?
  6. What is an external location in Unity Catalog?
  7. What is data lineage in Unity Catalog?
  8. What is Photon engine in Databricks?
  9. What is Serverless compute in Databricks?
  10. What is ADLS Gen2 and how does Databricks connect to it?
  11. What is the difference between a Databricks workspace and a metastore?
  12. What is row-level security in Unity Catalog?
  13. What is column masking in Unity Catalog?
  14. What is a service principal in Databricks on Azure?
  15. What is the difference between instance profiles and storage credentials?
  16. What are tags and labels in Unity Catalog for data classification?
  17. What is the system tables feature in Unity Catalog?
  18. What is Databricks SQL (DBSQL)?
  19. What is a SQL Warehouse (Serverless vs Pro vs Classic)?
  20. What is GDPR and what does it mean for data engineering?

L2

  1. Explain the Unity Catalog hierarchy: metastore -> catalog -> schema -> table/view/function. How do permissions cascade?
  2. Compare Unity Catalog vs legacy Hive Metastore — what are the key differences and migration challenges?
  3. How does Photon engine accelerate queries? What workloads benefit most from Photon?
  4. Compare Serverless SQL Warehouses vs Classic SQL Warehouses — cost, startup time, scaling.
  5. Explain how ADLS Gen2 integrates with Databricks — authentication methods (OAuth, service principals, access keys).
  6. How do you implement GDPR "Right to be Forgotten" in a Lakehouse architecture?
  7. Explain dynamic views in Unity Catalog for row-level and column-level security.
  8. How does Unity Catalog handle cross-workspace data sharing?
  9. Compare Azure Databricks vs Azure Synapse Analytics — when would you recommend each?
  10. Explain how audit logging works in Unity Catalog. What events are captured?
  11. How do you implement data classification (PII tagging) using Unity Catalog?
  12. What are the networking options for Databricks on Azure (VNet injection, Private Link, NSGs)?
  13. Explain the difference between account-level and workspace-level identity in Databricks.
  14. How does Unity Catalog system tables help with cost monitoring and query auditing?
  15. Compare managed identity vs service principal vs access key for ADLS Gen2 access — pros/cons.
  16. Explain how Databricks handles encryption at rest and in transit on Azure.
  17. How do you design a multi-region Databricks deployment on Azure?
  18. What is the role of Azure Key Vault in Databricks? How do you manage secrets?
  19. Explain the difference between GRANT, DENY, and REVOKE in Unity Catalog's permission model.
  20. How does Unity Catalog's data lineage differ from tools like Apache Atlas or Purview?

L3 — Scenario-Based Questions

  1. Unity Catalog Migration: Your organization has 500 tables in Hive Metastore across 3 workspaces. Design a migration plan to Unity Catalog with zero downtime.
  2. GDPR Compliance Pipeline: Amadeus handles passenger PII (names, passport numbers, emails) across 100 countries. Design a GDPR-compliant data architecture using Unity Catalog, column masking, and deletion pipelines.
  3. Multi-Team Governance: You have Data Engineering, Data Science, and BI teams sharing a single Databricks deployment. Design the Unity Catalog structure (catalogs, schemas, permissions) for proper isolation and collaboration.
  4. Cost Optimization: Your Azure Databricks bill is $150K/month. 60% is compute. Design a cost reduction strategy using Serverless, autoscaling, spot instances, and cluster policies.
  5. Secure External Sharing: A partner airline needs read access to specific Gold tables but must NOT see PII columns. Design this using Unity Catalog, Delta Sharing, and dynamic views.
  6. Photon Decision: Your team is deciding whether to enable Photon on all clusters. Some workloads are Python UDF-heavy, others are SQL-heavy. How do you evaluate and decide?
  7. Network Security: Your security team requires all Databricks traffic to stay within the Azure virtual network and never traverse the public internet. Design the network architecture.
  8. Disaster Recovery: Design a DR strategy for Databricks on Azure. RTO = 4 hours, RPO = 1 hour. Consider metastore, Delta tables, notebooks, and cluster configurations.
  9. Audit & Compliance: The compliance team needs a report showing who accessed PII data in the last 90 days, what queries they ran, and what data they exported. Design this using system tables.
  10. ADLS Gen2 Organization: You have 50 data products across 5 business domains. Design the ADLS Gen2 storage layout (containers, folders) and the corresponding Unity Catalog structure.
  11. Data Mesh on Databricks: Leadership wants to adopt a Data Mesh approach. How would you structure Unity Catalog catalogs, schemas, and ownership to enable domain-oriented data products?
  12. Cross-Cloud Access: A team in AWS needs to read data from your Azure Databricks Lakehouse. Design the architecture using Delta Sharing.
  13. PII Detection & Tagging: You inherit 2,000 tables with no documentation. Design an automated PII detection and tagging pipeline using Unity Catalog tags and Databricks notebooks.
  14. Serverless Migration: Your team runs 200 interactive clusters daily. The CFO wants to move to Serverless. What is your evaluation and migration plan?
  15. Regulatory Audit: A regulator asks you to prove data lineage from source (Oracle) to final report (Power BI). How do you use Unity Catalog lineage to demonstrate this end-to-end?

TOPIC 4: PRODUCTION & CI/CD (Workflows, Asset Bundles, Cost Management, Debugging)

L1 — Direct / Simple Questions

  1. What is Databricks Workflows (formerly Jobs)?
  2. What is the difference between a Task and a Job in Databricks Workflows?
  3. What are Databricks Asset Bundles (DABs)?
  4. What is the Databricks CLI?
  5. What is a job cluster vs an all-purpose (interactive) cluster?
  6. What is cluster autoscaling and how does it work?
  7. What is a cluster policy in Databricks?
  8. What are spot instances and how do they reduce cost?
  9. What is the Databricks REST API used for?
  10. What is a Databricks repo (Git integration)?
  11. What are Databricks Notebooks vs IDE-based development?
  12. What is the difference between a wheel file and a notebook task in a Workflow?
  13. What is the Ganglia UI / Spark UI used for in debugging?
  14. What is a driver log vs an executor log?
  15. What is the Databricks DBU (Databricks Unit) and how is pricing calculated?
  16. What are init scripts and when would you use them?
  17. What is a multi-task workflow (DAG) in Databricks?
  18. What is the dbutils library and what are its key modules?
  19. What are widgets in Databricks notebooks?
  20. What is the difference between %run and dbutils.notebook.run()?

L2

  1. Explain Databricks Asset Bundles (DABs) — how do they enable CI/CD for Databricks projects?
  2. Compare DABs vs Terraform for Databricks infrastructure management — when to use each?
  3. How do you implement a CI/CD pipeline for Databricks using Azure DevOps?
  4. Explain how Databricks Workflows handles task dependencies, retries, and conditional execution.
  5. Compare job clusters vs all-purpose clusters — cost, startup time, use cases.
  6. How do you debug an OOM (Out of Memory) error in a Databricks Spark job?
  7. Explain how to read and interpret the Spark UI: stages, tasks, shuffle read/write, spill.
  8. How do you implement blue-green or canary deployments for Databricks ETL pipelines?
  9. Explain cluster pool strategy — how do pools reduce cluster startup time and cost?
  10. How do you manage secrets and environment-specific configurations across dev/staging/prod?
  11. Compare Databricks Repos (Git integration) vs external CI/CD tools for version control.
  12. How do you implement data pipeline monitoring and alerting in Databricks?
  13. Explain the cost implications of spot instances vs on-demand for different workload types.
  14. How do you diagnose data skew in a Spark job using the Spark UI?
  15. What is the recommended project structure for a Databricks DABs project?
  16. How do you implement parameterized jobs with dynamic values in Workflows?
  17. Explain the difference between task values (dbutils.jobs.taskValues) and widget parameters.
  18. How do you handle failing tasks in a DAG — retry policies, timeout, conditional logic?
  19. Compare Serverless jobs vs provisioned clusters for job execution — cost breakeven analysis.
  20. How do you implement logging and observability for production Databricks pipelines?

L3 — Scenario-Based Questions

  1. CI/CD Pipeline Design: Design an end-to-end CI/CD pipeline for a Databricks project. Include Git branching, testing, deployment to dev/staging/prod, and rollback strategy. Use DABs + Azure DevOps.
  2. Production Incident: Your nightly ETL job has been failing intermittently for 3 nights with SparkException: Job aborted due to stage failure. Walk through your debugging process step by step.
  3. Cost Reduction: Your team's Databricks spend increased 3x in 3 months. You need to reduce it by 40% without impacting SLAs. What do you analyze and what changes do you make?
  4. Cluster Strategy: You have 50 data engineers, 20 data scientists, and 10 BI analysts. Design the cluster strategy: interactive clusters, job clusters, pools, and policies.
  5. Job Orchestration: You have 30 ETL jobs. 10 run hourly, 15 run daily, 5 run weekly. Some have dependencies. Design the orchestration using Databricks Workflows.
  6. OOM Debugging: A Spark job processing a 5 TB dataset fails with OOM after running for 3 hours. You have 30 minutes to fix it before the business deadline. Walk through your approach.
  7. Migration from Airflow: Your team currently uses Apache Airflow for orchestration. Management wants to migrate to Databricks Workflows. Design the migration plan and address the gaps.
  8. Multi-Environment Deployment: Design a deployment strategy where the same code deploys to dev (small data, small clusters), staging (prod-like), and prod (full scale) using DABs.
  9. Data Pipeline SLA: Your Gold table must be refreshed by 6 AM every day. The pipeline takes 2-4 hours depending on data volume. Design the reliability strategy: monitoring, alerting, retry, fallback.
  10. Runaway Costs: A data scientist launched an interactive cluster with 100 nodes and forgot to terminate it. It ran for 72 hours. How do you prevent this from happening again?
  11. Spark Debugging: A join between two large tables is taking 6 hours instead of the expected 30 minutes. The Spark UI shows massive shuffle spill to disk. Diagnose and fix.
  12. Notebook to Production: A data scientist built a prototype in a notebook. You need to productionize it. Describe the steps: refactoring, testing, CI/CD, monitoring.
  13. DABs Project Setup: You are starting a new project with 3 DLT pipelines, 10 Workflows, and shared libraries. Design the DABs project structure, bundle configuration, and deployment targets.
  14. Rollback Strategy: A production deployment introduced a bug that corrupted the Silver layer. Design the rollback process: code rollback, data recovery, and communication plan.
  15. Monitoring Dashboard: Design a production monitoring dashboard for 50 Databricks pipelines. What metrics do you track? What alerting thresholds do you set? What tools do you use?

BONUS: CROSS-CUTTING / FREQUENTLY ASKED "REAL INTERVIEW" QUESTIONS

These are questions that appeared repeatedly across Glassdoor reports, Medium articles, and interview forums for 2025-2026 Senior Data Engineer roles.

Top 20 Most Frequently Asked

  1. What is the Lakehouse architecture and how does it differ from a Data Lake and a Data Warehouse? (L1)
  2. Explain the Delta Lake transaction log and how it provides ACID guarantees. (L2)
  3. Design a Medallion Architecture for [specific domain]. Walk through Bronze, Silver, Gold. (L3)
  4. How does MERGE INTO work in Delta Lake? What are its performance pitfalls? (L2)
  5. Implement SCD Type 2 using MERGE INTO. (L3)
  6. What is Unity Catalog and how does it improve governance over Hive Metastore? (L2)
  7. How do you handle CDC from a legacy database (Oracle/SQL Server) into Delta Lake? (L3)
  8. Compare Auto Loader vs COPY INTO — when do you use each? (L2)
  9. How do you optimize a slow-running Spark job? Walk through your debugging steps. (L3)
  10. What is Z-ORDER and when would you use it? How does Liquid Clustering improve on it? (L2)
  11. Design a GDPR-compliant data deletion pipeline in a Lakehouse. (L3)
  12. How do you implement CI/CD for Databricks? (L2)
  13. What is Photon and when should you enable it? (L1)
  14. Explain the small file problem and how to solve it in Delta Lake. (L2)
  15. How do you handle schema evolution in a streaming pipeline? (L2)
  16. Your pipeline is failing intermittently in production. Walk through your debugging process. (L3)
  17. How do you manage costs in Databricks? What strategies have you used? (L2)
  18. What is Delta Live Tables and how does it compare to manual Structured Streaming? (L2)
  19. Design a real-time analytics pipeline on Databricks. (L3)
  20. How do you handle data quality in a Lakehouse architecture? (L2)

Emerging 2025-2026 Topics (Newer Questions)

  1. What is Lakeflow Connect and how does it simplify ingestion? (L1)
  2. Explain Databricks Apps — what are they and when would you use them? (L1)
  3. What is Genie (natural language to SQL) and how does it fit into the Databricks ecosystem? (L1)
  4. How do you build and deploy an LLM-powered application using Databricks? (L3)
  5. What is the Databricks Marketplace and how do you publish/consume data products? (L2)
  6. How does Mosaic AI integrate with the Lakehouse for ML/AI workflows? (L2)
  7. What is UniForm in Delta Lake and why does it matter for interoperability? (L2)
  8. How do you use Databricks system tables for cost monitoring and optimization? (L3)
  9. Explain Serverless compute for jobs — how does it differ from provisioned clusters? (L2)
  10. What is Predictive Optimization and how does it automate OPTIMIZE/VACUUM/ANALYZE? (L2)

QUESTION COUNT SUMMARY

TopicL1 (Direct)L2 (Mid-Level)L3 (Scenario)Total
Delta Lake20201555
ETL Pipelines20201555
Azure Platform & Governance20201555
Production & CI/CD20201555
Bonus (Cross-Cutting)30
TOTAL808060250

SOURCES