Databricks Interview Questions
All 250 questions — full access
1 Q01 What is Delta Lake and why was it created? ▼
See the study guide for the detailed answer →
2 Q02 What file format does Delta Lake use under the hood? ▼
See the study guide for the detailed answer →
3 Q03 What is the `_delta_log` directory and what does it contain? ▼
See the study guide for the detailed answer →
4 Q04 What are the four ACID properties and how does Delta Lake guarantee them? ▼
See the study guide for the detailed answer →
5 Q05 What is a checkpoint file in the Delta transaction log? ▼
See the study guide for the detailed answer →
6 Q06 What is schema enforcement in Delta Lake? ▼
See the study guide for the detailed answer →
7 Q07 What is schema evolution and how do you enable it? ▼
See the study guide for the detailed answer →
8 Q08 What is Time Travel in Delta Lake? How do you query an older version? ▼
See the study guide for the detailed answer →
9 Q09 What is the VACUUM command and what does it do? ▼
See the study guide for the detailed answer →
10 Q10 What is the default retention period for VACUUM? ▼
See the study guide for the detailed answer →
11 Q11 What does the OPTIMIZE command do? ▼
See the study guide for the detailed answer →
12 Q12 What is Z-ORDER and what problem does it solve? ▼
See the study guide for the detailed answer →
13 Q13 What are Deletion Vectors in Delta Lake? ▼
See the study guide for the detailed answer →
14 Q14 What is the difference between Delta Lake and Apache Parquet? ▼
See the study guide for the detailed answer →
15 Q15 What is the DESCRIBE HISTORY command used for? ▼
See the study guide for the detailed answer →
16 Q16 What is Change Data Feed (CDF) in Delta Lake? ▼
See the study guide for the detailed answer →
17 Q17 What is Liquid Clustering in Delta Lake? ▼
See the study guide for the detailed answer →
18 Q18 What is Predictive Optimization in Databricks? ▼
See the study guide for the detailed answer →
19 Q19 What are table constraints in Delta Lake (CHECK, NOT NULL)? ▼
See the study guide for the detailed answer →
20 Q20 What is the RESTORE command in Delta Lake? ▼
See the study guide for the detailed answer →
21 Q21 Explain the anatomy of a Delta Lake transaction — what happens when you write to a Delta table? ▼
See the study guide for the detailed answer →
22 Q22 How does optimistic concurrency control work in Delta Lake? What happens during write conflicts? ▼
See the study guide for the detailed answer →
23 Q23 Compare Z-ORDER vs Liquid Clustering — when would you use each? ▼
See the study guide for the detailed answer →
24 Q24 Explain data skipping in Delta Lake. How does it use min/max statistics? ▼
See the study guide for the detailed answer →
25 Q25 What is the difference between OPTIMIZE and VACUUM? Can you run them together? ▼
See the study guide for the detailed answer →
26 Q26 Explain how MERGE INTO works internally. What are the performance implications of a full table scan in MERGE? ▼
See the study guide for the detailed answer →
27 Q27 How does the Delta transaction log handle concurrent writes from multiple clusters? ▼
See the study guide for the detailed answer →
28 Q28 Compare schema enforcement vs schema evolution — give an example where each is appropriate. ▼
See the study guide for the detailed answer →
29 Q29 What happens if you run VACUUM with a retention of 0 hours? What are the risks? ▼
See the study guide for the detailed answer →
30 Q30 Explain the difference between `OPTIMIZE WHERE` and partition-level OPTIMIZE. ▼
See the study guide for the detailed answer →
31 Q31 How does Delta Lake handle small file compaction? What is the "small file problem"? ▼
See the study guide for the detailed answer →
32 Q32 Explain the difference between Copy-on-Write and Merge-on-Read in Delta Lake. ▼
See the study guide for the detailed answer →
33 Q33 How do Deletion Vectors improve UPDATE/DELETE performance compared to the traditional approach? ▼
See the study guide for the detailed answer →
34 Q34 What is the relationship between file statistics, data skipping, and Z-ORDER? ▼
See the study guide for the detailed answer →
35 Q35 Explain how Time Travel works internally — what is stored in each JSON commit file? ▼
See the study guide for the detailed answer →
36 Q36 Compare Change Data Feed (CDF) vs reading the transaction log directly for CDC. ▼
See the study guide for the detailed answer →
37 Q37 What is the difference between managed and external Delta tables? ▼
See the study guide for the detailed answer →
38 Q38 How does Liquid Clustering handle incremental clustering vs Z-ORDER which requires full rewrite? ▼
See the study guide for the detailed answer →
39 Q39 What are the trade-offs of over-partitioning a Delta table? ▼
See the study guide for the detailed answer →
40 Q40 Explain Delta Lake 4.x features: UniForm, Universal Format. Why do they matter? ▼
See the study guide for the detailed answer →
41 Q41 **MERGE Optimization**: Your MERGE INTO statement takes 45 minutes on a 2TB Delta table. Walk me through how you would diagnose and optimize this. ▼
See the study guide for the detailed answer →
42 Q42 **Transaction Log Corruption**: A developer accidentally ran `VACUUM` with 0-hour retention and now Time Travel queries fail. What happened and how do you recover? ▼
See the study guide for the detailed answer →
43 Q43 **Small File Problem**: Your Bronze table has 50,000 small Parquet files (avg 2MB each). How do you fix this and prevent it from recurring? ▼
See the study guide for the detailed answer →
44 Q44 **Concurrent Writes**: Two Databricks jobs write to the same Delta table simultaneously and one fails with a `ConcurrentAppendException`. Explain why and how you fix it. ▼
See the study guide for the detailed answer →
45 Q45 **Z-ORDER Strategy**: You have a 10TB Delta table queried by `country`, `date`, and `customer_id`. Design the partitioning and Z-ORDER strategy. ▼
See the study guide for the detailed answer →
46 Q46 **Liquid Clustering Migration**: Your team wants to migrate from Z-ORDER to Liquid Clustering on a production table. What is your migration plan? Any risks? ▼
See the study guide for the detailed answer →
47 Q47 **Schema Evolution Crisis**: A source system added 5 new columns overnight and your streaming pipeline failed. How do you design for schema evolution in Auto Loader + Delta Lake? ▼
See the study guide for the detailed answer →
48 Q48 **Time Travel for Audit**: Your compliance team needs to prove what data looked like on a specific date 30 days ago. How do you implement this with Delta Lake Time Travel? What are the limitations? ▼
See the study guide for the detailed answer →
49 Q49 **VACUUM vs Storage Costs**: Your Delta table consumes 5x the expected storage due to retained old versions. Design a VACUUM strategy that balances cost vs Time Travel needs. ▼
See the study guide for the detailed answer →
50 Q50 **CDC with Delta CDF**: Design a pipeline where downstream consumers only process changed records from a Silver Delta table. How do you use Change Data Feed? ▼
See the study guide for the detailed answer →
51 Q51 **Table Restore Scenario**: A bad ETL job corrupted your Gold table at 3 AM. It is now 9 AM and 6 versions have been written since. Walk through the recovery process. ▼
See the study guide for the detailed answer →
52 Q52 **Partition Evolution**: Your table was partitioned by `year/month/day` but queries now filter primarily by `region`. How do you restructure without downtime? ▼
See the study guide for the detailed answer →
53 Q53 **MERGE with SCD Type 2**: Implement a MERGE strategy for SCD Type 2 on a customer dimension table where you need to close old records and insert new ones atomically. ▼
See the study guide for the detailed answer →
54 Q54 **Delta Sharing**: An external partner needs read access to a subset of your Delta table. How do you implement this securely using Delta Sharing? ▼
See the study guide for the detailed answer →
55 Q55 **Deletion Vectors in Production**: After enabling Deletion Vectors, read performance on certain queries degraded. Explain why and how you would resolve this. ▼
See the study guide for the detailed answer →
56 Q56 What is the Medallion Architecture (Bronze/Silver/Gold)? ▼
See the study guide for the detailed answer →
57 Q57 What is Auto Loader in Databricks? ▼
See the study guide for the detailed answer →
58 Q58 What is the difference between Auto Loader and COPY INTO? ▼
See the study guide for the detailed answer →
59 Q59 What is Delta Live Tables (DLT)? ▼
See the study guide for the detailed answer →
60 Q60 What is Lakeflow and how does it relate to DLT? ▼
See the study guide for the detailed answer →
61 Q61 What is SCD Type 1 vs SCD Type 2? ▼
See the study guide for the detailed answer →
62 Q62 What is Change Data Capture (CDC)? ▼
See the study guide for the detailed answer →
63 Q63 What is a streaming table vs a materialized view in DLT? ▼
See the study guide for the detailed answer →
64 Q64 What are DLT expectations (data quality constraints)? ▼
See the study guide for the detailed answer →
65 Q65 What is the difference between `cloudFiles` and `spark.readStream` on Delta? ▼
See the study guide for the detailed answer →
66 Q66 What is structured streaming in Databricks? ▼
See the study guide for the detailed answer →
67 Q67 What is a checkpoint in Spark Structured Streaming? ▼
See the study guide for the detailed answer →
68 Q68 What is the trigger mode `availableNow` vs `processingTime`? ▼
See the study guide for the detailed answer →
69 Q69 What is idempotency and why is it important in ETL pipelines? ▼
See the study guide for the detailed answer →
70 Q70 What is the difference between batch and streaming ETL? ▼
See the study guide for the detailed answer →
71 Q71 What is an ETL pipeline vs an ELT pipeline? ▼
See the study guide for the detailed answer →
72 Q72 What are the three DLT expectation actions: `warn`, `drop`, `fail`? ▼
See the study guide for the detailed answer →
73 Q73 What is the `foreachBatch` sink in Structured Streaming? ▼
See the study guide for the detailed answer →
74 Q74 What is event-time processing vs processing-time in streaming? ▼
See the study guide for the detailed answer →
75 Q75 What is watermarking in Spark Structured Streaming? ▼
See the study guide for the detailed answer →
76 Q76 Explain how Auto Loader's file notification mode works vs directory listing mode. When do you use each? ▼
See the study guide for the detailed answer →
77 Q77 How do you handle schema evolution with Auto Loader (`cloudFiles.schemaEvolutionMode`)? ▼
See the study guide for the detailed answer →
78 Q78 Compare Delta Live Tables (DLT) vs hand-coded Structured Streaming pipelines — trade-offs? ▼
See the study guide for the detailed answer →
79 Q79 Explain how to implement SCD Type 2 using MERGE INTO with Delta Lake. What are the key columns? ▼
See the study guide for the detailed answer →
80 Q80 How does DLT handle pipeline failures and retries? What is the concept of "idempotent recomputation"? ▼
See the study guide for the detailed answer →
81 Q81 Compare `trigger(availableNow=True)` vs `trigger(processingTime='5 minutes')` — when to use each? ▼
See the study guide for the detailed answer →
82 Q82 Explain the role of Bronze, Silver, and Gold layers in terms of data quality, latency, and consumers. ▼
See the study guide for the detailed answer →
83 Q83 How does watermarking work in Structured Streaming? What happens to late-arriving data? ▼
See the study guide for the detailed answer →
84 Q84 Compare CDC patterns: log-based CDC (Debezium/Kafka) vs query-based CDC vs timestamp-based CDC. ▼
See the study guide for the detailed answer →
85 Q85 How do you handle exactly-once semantics in a Databricks streaming pipeline? ▼
See the study guide for the detailed answer →
86 Q86 Explain `foreachBatch` — when would you use it over a standard Delta sink? ▼
See the study guide for the detailed answer →
87 Q87 How do DLT expectations compare to Great Expectations or other data quality frameworks? ▼
See the study guide for the detailed answer →
88 Q88 What are the different ways to orchestrate dependent DLT pipelines? ▼
See the study guide for the detailed answer →
89 Q89 Explain how Auto Loader handles file deduplication. What is the `RocksDB` state store? ▼
See the study guide for the detailed answer →
90 Q90 How do you test ETL pipelines in Databricks? What frameworks do you use? ▼
See the study guide for the detailed answer →
91 Q91 Explain the difference between a complete output mode, append mode, and update mode in streaming. ▼
See the study guide for the detailed answer →
92 Q92 How do you monitor and alert on streaming pipeline lag in Databricks? ▼
See the study guide for the detailed answer →
93 Q93 What is the `APPLY CHANGES INTO` syntax in DLT and when do you use it? ▼
See the study guide for the detailed answer →
94 Q94 How do you handle out-of-order events in a Medallion Architecture? ▼
See the study guide for the detailed answer →
95 Q95 Explain incremental data loading patterns: append-only vs upsert vs full refresh. ▼
See the study guide for the detailed answer →
96 Q96 **Oracle CDC Pipeline**: Design a CDC pipeline from Oracle to Delta Lake for Amadeus booking data. Oracle does not support log-based CDC natively. What approach do you take? ▼
See the study guide for the detailed answer →
97 Q97 **Late-Arriving Data**: Flight booking amendments arrive 48 hours after the original booking. Design a pipeline that correctly handles these late-arriving events in the Medallion Architecture. ▼
See the study guide for the detailed answer →
98 Q98 **SCD Type 2 at Scale**: You need to maintain SCD Type 2 on a customer dimension table with 500M rows, receiving 2M updates daily. Design the MERGE strategy for performance. ▼
See the study guide for the detailed answer →
99 Q99 **DLT Pipeline Failure**: Your DLT pipeline fails at the Silver layer due to a data quality expectation violation at 2 AM. 50,000 records were dropped. How do you investigate, recover, and prevent recurrence? ▼
See the study guide for the detailed answer →
100 Q100 **Auto Loader Schema Drift**: A source system renames columns from `camelCase` to `snake_case` overnight. Your Auto Loader pipeline breaks. Design a resilient schema evolution strategy. ▼
See the study guide for the detailed answer →
101 Q101 **Multi-Source Medallion**: You have 20 source systems feeding Bronze. Some are batch (daily files), some are streaming (Kafka). Design the Medallion Architecture to unify them. ▼
See the study guide for the detailed answer →
102 Q102 **Streaming Backpressure**: Your streaming pipeline is processing 100K events/sec but the source is producing 500K events/sec. The lag keeps growing. How do you diagnose and fix this? ▼
See the study guide for the detailed answer →
103 Q103 **GDPR Delete Pipeline**: A GDPR deletion request arrives for a passenger. You need to delete their data across Bronze, Silver, and Gold layers in a Lakehouse. Design the process. ▼
See the study guide for the detailed answer →
104 Q104 **Deduplication Strategy**: Your Kafka source sends duplicate booking events. Design a deduplication strategy at the Bronze and Silver layers that guarantees exactly-once processing. ▼
See the study guide for the detailed answer →
105 Q105 **Testing & Validation**: How would you set up automated testing for a Databricks DLT pipeline? Include unit tests, integration tests, and data quality assertions. ▼
See the study guide for the detailed answer →
106 Q106 **Hybrid Batch-Streaming**: You need near-real-time dashboards (5-minute latency) but also end-of-day reconciliation reports. Design a single pipeline architecture. ▼
See the study guide for the detailed answer →
107 Q107 **Multi-Hop Streaming**: Design a streaming pipeline with three hops (Bronze->Silver->Gold) where each layer applies different transformations. How do you manage checkpoints and failure recovery? ▼
See the study guide for the detailed answer →
108 Q108 **Slowly Changing Dimension with Deletes**: Your source system sends hard deletes (records simply disappear). How do you detect and handle these in an SCD Type 2 pipeline? ▼
See the study guide for the detailed answer →
109 Q109 **Cost-Efficient Ingestion**: You ingest 10TB/day of raw JSON from ADLS Gen2. Design an ingestion pipeline that minimizes compute cost while maintaining <15 min latency. ▼
See the study guide for the detailed answer →
110 Q110 **Pipeline Dependency Management**: You have 50 DLT pipelines with complex dependencies. Some must run sequentially, others can be parallel. How do you orchestrate this? ▼
See the study guide for the detailed answer →
111 Q111 What is Unity Catalog in Databricks? ▼
See the study guide for the detailed answer →
112 Q112 What is the three-level namespace in Unity Catalog (catalog.schema.table)? ▼
See the study guide for the detailed answer →
113 Q113 What is a metastore in Unity Catalog? ▼
See the study guide for the detailed answer →
114 Q114 What is the difference between managed and external tables in Unity Catalog? ▼
See the study guide for the detailed answer →
115 Q115 What is a storage credential in Unity Catalog? ▼
See the study guide for the detailed answer →
116 Q116 What is an external location in Unity Catalog? ▼
See the study guide for the detailed answer →
117 Q117 What is data lineage in Unity Catalog? ▼
See the study guide for the detailed answer →
118 Q118 What is Photon engine in Databricks? ▼
See the study guide for the detailed answer →
119 Q119 What is Serverless compute in Databricks? ▼
See the study guide for the detailed answer →
120 Q120 What is ADLS Gen2 and how does Databricks connect to it? ▼
See the study guide for the detailed answer →
121 Q121 What is the difference between a Databricks workspace and a metastore? ▼
See the study guide for the detailed answer →
122 Q122 What is row-level security in Unity Catalog? ▼
See the study guide for the detailed answer →
123 Q123 What is column masking in Unity Catalog? ▼
See the study guide for the detailed answer →
124 Q124 What is a service principal in Databricks on Azure? ▼
See the study guide for the detailed answer →
125 Q125 What is the difference between instance profiles and storage credentials? ▼
See the study guide for the detailed answer →
126 Q126 What are tags and labels in Unity Catalog for data classification? ▼
See the study guide for the detailed answer →
127 Q127 What is the system tables feature in Unity Catalog? ▼
See the study guide for the detailed answer →
128 Q128 What is Databricks SQL (DBSQL)? ▼
See the study guide for the detailed answer →
129 Q129 What is a SQL Warehouse (Serverless vs Pro vs Classic)? ▼
See the study guide for the detailed answer →
130 Q130 What is GDPR and what does it mean for data engineering? ▼
See the study guide for the detailed answer →
131 Q131 Explain the Unity Catalog hierarchy: metastore -> catalog -> schema -> table/view/function. How do permissions cascade? ▼
See the study guide for the detailed answer →
132 Q132 Compare Unity Catalog vs legacy Hive Metastore — what are the key differences and migration challenges? ▼
See the study guide for the detailed answer →
133 Q133 How does Photon engine accelerate queries? What workloads benefit most from Photon? ▼
See the study guide for the detailed answer →
134 Q134 Compare Serverless SQL Warehouses vs Classic SQL Warehouses — cost, startup time, scaling. ▼
See the study guide for the detailed answer →
135 Q135 Explain how ADLS Gen2 integrates with Databricks — authentication methods (OAuth, service principals, access keys). ▼
See the study guide for the detailed answer →
136 Q136 How do you implement GDPR "Right to be Forgotten" in a Lakehouse architecture? ▼
See the study guide for the detailed answer →
137 Q137 Explain dynamic views in Unity Catalog for row-level and column-level security. ▼
See the study guide for the detailed answer →
138 Q138 How does Unity Catalog handle cross-workspace data sharing? ▼
See the study guide for the detailed answer →
139 Q139 Compare Azure Databricks vs Azure Synapse Analytics — when would you recommend each? ▼
See the study guide for the detailed answer →
140 Q140 Explain how audit logging works in Unity Catalog. What events are captured? ▼
See the study guide for the detailed answer →
141 Q141 How do you implement data classification (PII tagging) using Unity Catalog? ▼
See the study guide for the detailed answer →
142 Q142 What are the networking options for Databricks on Azure (VNet injection, Private Link, NSGs)? ▼
See the study guide for the detailed answer →
143 Q143 Explain the difference between account-level and workspace-level identity in Databricks. ▼
See the study guide for the detailed answer →
144 Q144 How does Unity Catalog system tables help with cost monitoring and query auditing? ▼
See the study guide for the detailed answer →
145 Q145 Compare managed identity vs service principal vs access key for ADLS Gen2 access — pros/cons. ▼
See the study guide for the detailed answer →
146 Q146 Explain how Databricks handles encryption at rest and in transit on Azure. ▼
See the study guide for the detailed answer →
147 Q147 How do you design a multi-region Databricks deployment on Azure? ▼
See the study guide for the detailed answer →
148 Q148 What is the role of Azure Key Vault in Databricks? How do you manage secrets? ▼
See the study guide for the detailed answer →
149 Q149 Explain the difference between GRANT, DENY, and REVOKE in Unity Catalog's permission model. ▼
See the study guide for the detailed answer →
150 Q150 How does Unity Catalog's data lineage differ from tools like Apache Atlas or Purview? ▼
See the study guide for the detailed answer →
151 Q151 **Unity Catalog Migration**: Your organization has 500 tables in Hive Metastore across 3 workspaces. Design a migration plan to Unity Catalog with zero downtime. ▼
See the study guide for the detailed answer →
152 Q152 **GDPR Compliance Pipeline**: Amadeus handles passenger PII (names, passport numbers, emails) across 100 countries. Design a GDPR-compliant data architecture using Unity Catalog, column masking, and deletion pipelines. ▼
See the study guide for the detailed answer →
153 Q153 **Multi-Team Governance**: You have Data Engineering, Data Science, and BI teams sharing a single Databricks deployment. Design the Unity Catalog structure (catalogs, schemas, permissions) for proper isolation and collaboration. ▼
See the study guide for the detailed answer →
154 Q154 **Cost Optimization**: Your Azure Databricks bill is $150K/month. 60% is compute. Design a cost reduction strategy using Serverless, autoscaling, spot instances, and cluster policies. ▼
See the study guide for the detailed answer →
155 Q155 **Secure External Sharing**: A partner airline needs read access to specific Gold tables but must NOT see PII columns. Design this using Unity Catalog, Delta Sharing, and dynamic views. ▼
See the study guide for the detailed answer →
156 Q156 **Photon Decision**: Your team is deciding whether to enable Photon on all clusters. Some workloads are Python UDF-heavy, others are SQL-heavy. How do you evaluate and decide? ▼
See the study guide for the detailed answer →
157 Q157 **Network Security**: Your security team requires all Databricks traffic to stay within the Azure virtual network and never traverse the public internet. Design the network architecture. ▼
See the study guide for the detailed answer →
158 Q158 **Disaster Recovery**: Design a DR strategy for Databricks on Azure. RTO = 4 hours, RPO = 1 hour. Consider metastore, Delta tables, notebooks, and cluster configurations. ▼
See the study guide for the detailed answer →
159 Q159 **Audit & Compliance**: The compliance team needs a report showing who accessed PII data in the last 90 days, what queries they ran, and what data they exported. Design this using system tables. ▼
See the study guide for the detailed answer →
160 Q160 **ADLS Gen2 Organization**: You have 50 data products across 5 business domains. Design the ADLS Gen2 storage layout (containers, folders) and the corresponding Unity Catalog structure. ▼
See the study guide for the detailed answer →
161 Q161 **Data Mesh on Databricks**: Leadership wants to adopt a Data Mesh approach. How would you structure Unity Catalog catalogs, schemas, and ownership to enable domain-oriented data products? ▼
See the study guide for the detailed answer →
162 Q162 **Cross-Cloud Access**: A team in AWS needs to read data from your Azure Databricks Lakehouse. Design the architecture using Delta Sharing. ▼
See the study guide for the detailed answer →
163 Q163 **PII Detection & Tagging**: You inherit 2,000 tables with no documentation. Design an automated PII detection and tagging pipeline using Unity Catalog tags and Databricks notebooks. ▼
See the study guide for the detailed answer →
164 Q164 **Serverless Migration**: Your team runs 200 interactive clusters daily. The CFO wants to move to Serverless. What is your evaluation and migration plan? ▼
See the study guide for the detailed answer →
165 Q165 **Regulatory Audit**: A regulator asks you to prove data lineage from source (Oracle) to final report (Power BI). How do you use Unity Catalog lineage to demonstrate this end-to-end? ▼
See the study guide for the detailed answer →
166 Q166 What is Databricks Workflows (formerly Jobs)? ▼
See the study guide for the detailed answer →
167 Q167 What is the difference between a Task and a Job in Databricks Workflows? ▼
See the study guide for the detailed answer →
168 Q168 What are Databricks Asset Bundles (DABs)? ▼
See the study guide for the detailed answer →
169 Q169 What is the Databricks CLI? ▼
See the study guide for the detailed answer →
170 Q170 What is a job cluster vs an all-purpose (interactive) cluster? ▼
See the study guide for the detailed answer →
171 Q171 What is cluster autoscaling and how does it work? ▼
See the study guide for the detailed answer →
172 Q172 What is a cluster policy in Databricks? ▼
See the study guide for the detailed answer →
173 Q173 What are spot instances and how do they reduce cost? ▼
See the study guide for the detailed answer →
174 Q174 What is the Databricks REST API used for? ▼
See the study guide for the detailed answer →
175 Q175 What is a Databricks repo (Git integration)? ▼
See the study guide for the detailed answer →
176 Q176 What are Databricks Notebooks vs IDE-based development? ▼
See the study guide for the detailed answer →
177 Q177 What is the difference between a wheel file and a notebook task in a Workflow? ▼
See the study guide for the detailed answer →
178 Q178 What is the Ganglia UI / Spark UI used for in debugging? ▼
See the study guide for the detailed answer →
179 Q179 What is a driver log vs an executor log? ▼
See the study guide for the detailed answer →
180 Q180 What is the Databricks DBU (Databricks Unit) and how is pricing calculated? ▼
See the study guide for the detailed answer →
181 Q181 What are init scripts and when would you use them? ▼
See the study guide for the detailed answer →
182 Q182 What is a multi-task workflow (DAG) in Databricks? ▼
See the study guide for the detailed answer →
183 Q183 What is the `dbutils` library and what are its key modules? ▼
See the study guide for the detailed answer →
184 Q184 What are widgets in Databricks notebooks? ▼
See the study guide for the detailed answer →
185 Q185 What is the difference between `%run` and `dbutils.notebook.run()`? ▼
See the study guide for the detailed answer →
186 Q186 Explain Databricks Asset Bundles (DABs) — how do they enable CI/CD for Databricks projects? ▼
See the study guide for the detailed answer →
187 Q187 Compare DABs vs Terraform for Databricks infrastructure management — when to use each? ▼
See the study guide for the detailed answer →
188 Q188 How do you implement a CI/CD pipeline for Databricks using Azure DevOps? ▼
See the study guide for the detailed answer →
189 Q189 Explain how Databricks Workflows handles task dependencies, retries, and conditional execution. ▼
See the study guide for the detailed answer →
190 Q190 Compare job clusters vs all-purpose clusters — cost, startup time, use cases. ▼
See the study guide for the detailed answer →
191 Q191 How do you debug an OOM (Out of Memory) error in a Databricks Spark job? ▼
See the study guide for the detailed answer →
192 Q192 Explain how to read and interpret the Spark UI: stages, tasks, shuffle read/write, spill. ▼
See the study guide for the detailed answer →
193 Q193 How do you implement blue-green or canary deployments for Databricks ETL pipelines? ▼
See the study guide for the detailed answer →
194 Q194 Explain cluster pool strategy — how do pools reduce cluster startup time and cost? ▼
See the study guide for the detailed answer →
195 Q195 How do you manage secrets and environment-specific configurations across dev/staging/prod? ▼
See the study guide for the detailed answer →
196 Q196 Compare Databricks Repos (Git integration) vs external CI/CD tools for version control. ▼
See the study guide for the detailed answer →
197 Q197 How do you implement data pipeline monitoring and alerting in Databricks? ▼
See the study guide for the detailed answer →
198 Q198 Explain the cost implications of spot instances vs on-demand for different workload types. ▼
See the study guide for the detailed answer →
199 Q199 How do you diagnose data skew in a Spark job using the Spark UI? ▼
See the study guide for the detailed answer →
200 Q200 What is the recommended project structure for a Databricks DABs project? ▼
See the study guide for the detailed answer →
201 Q201 How do you implement parameterized jobs with dynamic values in Workflows? ▼
See the study guide for the detailed answer →
202 Q202 Explain the difference between task values (`dbutils.jobs.taskValues`) and widget parameters. ▼
See the study guide for the detailed answer →
203 Q203 How do you handle failing tasks in a DAG — retry policies, timeout, conditional logic? ▼
See the study guide for the detailed answer →
204 Q204 Compare Serverless jobs vs provisioned clusters for job execution — cost breakeven analysis. ▼
See the study guide for the detailed answer →
205 Q205 How do you implement logging and observability for production Databricks pipelines? ▼
See the study guide for the detailed answer →
206 Q206 **CI/CD Pipeline Design**: Design an end-to-end CI/CD pipeline for a Databricks project. Include Git branching, testing, deployment to dev/staging/prod, and rollback strategy. Use DABs + Azure DevOps. ▼
See the study guide for the detailed answer →
207 Q207 **Production Incident**: Your nightly ETL job has been failing intermittently for 3 nights with `SparkException: Job aborted due to stage failure`. Walk through your debugging process step by step. ▼
See the study guide for the detailed answer →
208 Q208 **Cost Reduction**: Your team's Databricks spend increased 3x in 3 months. You need to reduce it by 40% without impacting SLAs. What do you analyze and what changes do you make? ▼
See the study guide for the detailed answer →
209 Q209 **Cluster Strategy**: You have 50 data engineers, 20 data scientists, and 10 BI analysts. Design the cluster strategy: interactive clusters, job clusters, pools, and policies. ▼
See the study guide for the detailed answer →
210 Q210 **Job Orchestration**: You have 30 ETL jobs. 10 run hourly, 15 run daily, 5 run weekly. Some have dependencies. Design the orchestration using Databricks Workflows. ▼
See the study guide for the detailed answer →
211 Q211 **OOM Debugging**: A Spark job processing a 5TB dataset fails with OOM after running for 3 hours. You have 30 minutes to fix it before the business deadline. Walk through your approach. ▼
See the study guide for the detailed answer →
212 Q212 **Migration from Airflow**: Your team currently uses Apache Airflow for orchestration. Management wants to migrate to Databricks Workflows. Design the migration plan and address the gaps. ▼
See the study guide for the detailed answer →
213 Q213 **Multi-Environment Deployment**: Design a deployment strategy where the same code deploys to dev (small data, small clusters), staging (prod-like), and prod (full scale) using DABs. ▼
See the study guide for the detailed answer →
214 Q214 **Data Pipeline SLA**: Your Gold table must be refreshed by 6 AM every day. The pipeline takes 2-4 hours depending on data volume. Design the reliability strategy: monitoring, alerting, retry, fallback. ▼
See the study guide for the detailed answer →
215 Q215 **Runaway Costs**: A data scientist launched an interactive cluster with 100 nodes and forgot to terminate it. It ran for 72 hours. How do you prevent this from happening again? ▼
See the study guide for the detailed answer →
216 Q216 **Spark Debugging**: A join between two large tables is taking 6 hours instead of the expected 30 minutes. The Spark UI shows massive shuffle spill to disk. Diagnose and fix. ▼
See the study guide for the detailed answer →
217 Q217 **Notebook to Production**: A data scientist built a prototype in a notebook. You need to productionize it. Describe the steps: refactoring, testing, CI/CD, monitoring. ▼
See the study guide for the detailed answer →
218 Q218 **DABs Project Setup**: You are starting a new project with 3 DLT pipelines, 10 Workflows, and shared libraries. Design the DABs project structure, bundle configuration, and deployment targets. ▼
See the study guide for the detailed answer →
219 Q219 **Rollback Strategy**: A production deployment introduced a bug that corrupted the Silver layer. Design the rollback process: code rollback, data recovery, and communication plan. ▼
See the study guide for the detailed answer →
220 Q220 **Monitoring Dashboard**: Design a production monitoring dashboard for 50 Databricks pipelines. What metrics do you track? What alerting thresholds do you set? What tools do you use? ▼
See the study guide for the detailed answer →
221 Q221 **What is the Lakehouse architecture and how does it differ from a Data Lake and a Data Warehouse?** (L1) ▼
See the study guide for the detailed answer →
222 Q222 **Explain the Delta Lake transaction log and how it provides ACID guarantees.** (L2) ▼
See the study guide for the detailed answer →
223 Q223 **Design a Medallion Architecture for [specific domain]. Walk through Bronze, Silver, Gold.** (L3) ▼
See the study guide for the detailed answer →
224 Q224 **How does MERGE INTO work in Delta Lake? What are its performance pitfalls?** (L2) ▼
See the study guide for the detailed answer →
225 Q225 **Implement SCD Type 2 using MERGE INTO.** (L3) ▼
See the study guide for the detailed answer →
226 Q226 **What is Unity Catalog and how does it improve governance over Hive Metastore?** (L2) ▼
See the study guide for the detailed answer →
227 Q227 **How do you handle CDC from a legacy database (Oracle/SQL Server) into Delta Lake?** (L3) ▼
See the study guide for the detailed answer →
228 Q228 **Compare Auto Loader vs COPY INTO — when do you use each?** (L2) ▼
See the study guide for the detailed answer →
229 Q229 **How do you optimize a slow-running Spark job? Walk through your debugging steps.** (L3) ▼
See the study guide for the detailed answer →
230 Q230 **What is Z-ORDER and when would you use it? How does Liquid Clustering improve on it?** (L2) ▼
See the study guide for the detailed answer →
231 Q231 **Design a GDPR-compliant data deletion pipeline in a Lakehouse.** (L3) ▼
See the study guide for the detailed answer →
232 Q232 **How do you implement CI/CD for Databricks?** (L2) ▼
See the study guide for the detailed answer →
233 Q233 **What is Photon and when should you enable it?** (L1) ▼
See the study guide for the detailed answer →
234 Q234 **Explain the small file problem and how to solve it in Delta Lake.** (L2) ▼
See the study guide for the detailed answer →
235 Q235 **How do you handle schema evolution in a streaming pipeline?** (L2) ▼
See the study guide for the detailed answer →
236 Q236 **Your pipeline is failing intermittently in production. Walk through your debugging process.** (L3) ▼
See the study guide for the detailed answer →
237 Q237 **How do you manage costs in Databricks? What strategies have you used?** (L2) ▼
See the study guide for the detailed answer →
238 Q238 **What is Delta Live Tables and how does it compare to manual Structured Streaming?** (L2) ▼
See the study guide for the detailed answer →
239 Q239 **Design a real-time analytics pipeline on Databricks.** (L3) ▼
See the study guide for the detailed answer →
240 Q240 **How do you handle data quality in a Lakehouse architecture?** (L2) ▼
See the study guide for the detailed answer →
241 Q241 What is Lakeflow Connect and how does it simplify ingestion? (L1) ▼
See the study guide for the detailed answer →
242 Q242 Explain Databricks Apps — what are they and when would you use them? (L1) ▼
See the study guide for the detailed answer →
243 Q243 What is Genie (natural language to SQL) and how does it fit into the Databricks ecosystem? (L1) ▼
See the study guide for the detailed answer →
244 Q244 How do you build and deploy an LLM-powered application using Databricks? (L3) ▼
See the study guide for the detailed answer →
245 Q245 What is the Databricks Marketplace and how do you publish/consume data products? (L2) ▼
See the study guide for the detailed answer →
246 Q246 How does Mosaic AI integrate with the Lakehouse for ML/AI workflows? (L2) ▼
See the study guide for the detailed answer →
247 Q247 What is UniForm in Delta Lake and why does it matter for interoperability? (L2) ▼
See the study guide for the detailed answer →
248 Q248 How do you use Databricks system tables for cost monitoring and optimization? (L3) ▼
See the study guide for the detailed answer →
249 Q249 Explain Serverless compute for jobs — how does it differ from provisioned clusters? (L2) ▼
See the study guide for the detailed answer →
250 Q250 What is Predictive Optimization and how does it automate OPTIMIZE/VACUUM/ANALYZE? (L2) ▼
See the study guide for the detailed answer →