🧱
Databricks
Day 4: Production, CI/CD & Cost — Quick Recall Guide
🧱
🧱
Databricks · Section 16 of 17

Day 4: Production, CI/CD & Cost — Quick Recall Guide

Day 4: Production, CI/CD & Cost — Quick Recall Guide

🗺️ Memory Map
How to use this file:
  • ⚡ = Must remember (95% chance of being asked)
  • 🔑 = Key concept (core understanding needed)
  • ⚠️ = Common trap (interviewers love to test this)
  • 🧠 = Memory Map (mnemonic/acronym — memorize this!)
  • 📝 = One-liner (flash-card style — cover answer, test yourself)
Reading strategy: Read Memory Maps FIRST → then Direct Questions → then Mid-Level.

🧠 MASTER MEMORY MAP — Day 4

🧠 PRODUCTION DATABRICKS = "WCCDS"
PRODUCTION DATABRICKS"WCCDS"
WWorkflows (job scheduling + orchestration)
CCI/CD (Declarative Automation Bundles + Azure DevOps)
CCost Management (5 areas to optimize)
DDebugging (Spark UI + execution plans)
SSystem Design (end-to-end platform architecture)
COST MANAGEMENT"CCSQM" (think: Cost Control Saves Quite Much)
CCompute (right cluster type, auto-terminate)
CCluster Policies (enforce limits on team spending)
SStorage (OPTIMIZE + VACUUM + lifecycle rules)
QQuery Optimization (Photon, caching, predicate pushdown)
MMonitoring + Chargeback (tag everything, show cost per team)
CI/CD = "DABGit → Azure DevOps → Deploy"
DABDeclarative Automation Bundles (YAML config files in Git)
Was called: Asset Bundlesrenamed 2025

SECTION 1: DATABRICKS WORKFLOWS

🧠 Memory Map: Workflows

🧠 WORKFLOW = "Scheduled pipeline = series of tasks with dependencies"
WORKFLOW"Scheduled pipeline → series of tasks with dependencies"
Example daily pipeline:
Task 1: Bronze ingestion (6:00 AM)
↓ (depends on Task 1)
Task 2: Silver transformation (after Task 1)
↓ (depends on Task 2)
Task 3: Gold aggregation (after Task 2)
↓ (depends on Task 3)
Task 4: Data quality check (after Task 3)
TASK TYPES: Notebook, Python script, SQL, Lakeflow pipeline, dbt, JAR
KEY FEATURES
Repair Runre-run ONLY failed tasks (don't restart everything!)
Task Valuespass data between tasks (dbutils.jobs.taskValues)
Table-Triggeredstart job when Delta table has new data
Backfillreprocess historical data (GA 2025)
WORKFLOWS vs AIRFLOW:
Workflows = Databricks only, free, simple
Airflow = Multi-platform, requires infra, more flexible
ADFAzure native, good for cross-platform orchestration

⚡ MUST KNOW DIRECT QUESTIONS

Q1What is Databricks Workflows?

Built-in job scheduler in Databricks. Define tasks, set dependencies, schedule runs. Free (included in Databricks).

Q2What task types are supported?
📝 Note
book, Python script, SQL query, Lakeflow pipeline, dbt task, JAR, Spark submit.
Q3What is a Repair Run?

Re-runs ONLY failed tasks in a workflow — doesn't restart the whole pipeline. Saves time and compute.

Q4How do tasks pass data to each other?

Using dbutils.jobs.taskValues:

python — editable
# Task 1: Save a value
dbutils.jobs.taskValues.set(key="row_count", value=1500000)
# Task 2: Read the value
count = dbutils.jobs.taskValues.get(taskKey="task1", key="row_count")

Q5What is a table-triggered job?

A job that starts automatically when new data arrives in a Delta table (not on a schedule). Uses CDF (Change Data Feed) to detect new rows.

⚠️ Q6Workflows vs Airflow — when to use which?
Pro Tip
  • Workflows: Pipeline is 100% inside Databricks
  • Airflow: Orchestrate across multiple platforms (Databricks + Oracle + APIs)
  • Azure Data Factory: Azure-centric alternative to Airflow
  • ⚠️ Don't say "Airflow is better" — say "it depends on the orchestration scope"

SECTION 2: CI/CD

🧠 Memory Map: CI/CD Pipeline

🗂️CI/CD = "Code in Git → Test → Deploy to Databricks automatically"
CI/CD = "Code in Git → Test → Deploy to Databricks automatically"
DECLARATIVE AUTOMATION BUNDLES (DAB):
Old name: "Asset Bundles" → renamed 2025
= YAML configuration files that define your Databricks resources
databricks.yml = "Recipe book for your entire project"
workspace settings
job definitions (workflows)
cluster configs
pipeline definitions
permissions
DEPLOYMENT FLOW:
Developer writes code + databricks.yml
Git push → triggers Azure DevOps pipeline
Azure DevOps runs:
1. databricks bundle validate (check YAML syntax)
2. databricks bundle deploy -t staging (deploy to staging)
3. Run tests on staging
4. databricks bundle deploy -t production (deploy to prod)
ENVIRONMENTS:
databricks.yml defines targets:
dev → developer workspace
staging → QA/testing workspace
prod → production workspace
Remember: "DAB-VDP" = Declarative Automation Bundles → Validate → Deploy → Production

⚡ MUST KNOW DIRECT QUESTIONS

Q7What are Declarative Automation Bundles?

YAML configuration files (in Git) that define ALL Databricks resources — jobs, clusters, pipelines, permissions. Deploy with databricks bundle deploy. Formerly called "Asset Bundles."

Q8What is the main config file?

databricks.yml — defines workspace, targets (dev/staging/prod), jobs, clusters, and pipeline configurations.

Q9What CLI commands are used?

bash
databricks bundle validate     # Check YAML is correct
databricks bundle deploy -t staging  # Deploy to staging
databricks bundle deploy -t prod     # Deploy to production
databricks bundle destroy -t staging # Remove from staging

Q10How does CI/CD work with Azure DevOps?

  1. Developer pushes code to Git (Azure Repos)
  2. Azure DevOps pipeline triggers automatically
  3. Pipeline runs: validate → deploy to staging → test → deploy to prod
  4. Uses databricks bundle CLI commands in pipeline YAML

🔑 MID-LEVEL QUESTIONS

Q11Show a basic databricks.yml example
📝 Note
yaml
bundle:
  name: amadeus_booking_pipeline    # Project name

targets:
  staging:                           # Staging environment
    workspace:
      host: https://adb-staging.azuredatabricks.net
    default: true                    # Default target
  production:                        # Production environment
    workspace:
      host: https://adb-prod.azuredatabricks.net
    run_as:
      service_principal_name: sp-booking-pipeline  # Not personal account!

resources:
  jobs:
    daily_booking_etl:               # Job definition
      name: "Daily Booking ETL"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"  # 6 AM daily
      tasks:
        - task_key: bronze_ingest
          notebook_task:
            notebook_path: ./notebooks/bronze_ingest.py
        - task_key: silver_transform
          depends_on:
            - task_key: bronze_ingest     # Runs AFTER bronze
          notebook_task:
            notebook_path: ./notebooks/silver_transform.py
Q12What testing strategies exist for Databricks?

4 levels:

  1. Unit tests — Test Python functions (no Spark needed, use pytest)
  2. Integration tests — Test with real Spark session (use staging workspace)
  3. Data quality tests — Lakeflow Expectations on Silver/Gold tables
  4. End-to-end tests — Run full pipeline on sample data, validate output

Q13Azure DevOps pipeline YAML example

yaml
trigger:
  branches:
    include: [main]              # Trigger on push to main

stages:
  - stage: Deploy_Staging
    jobs:
      - job: deploy
        steps:
          - script: pip install databricks-cli
          - script: databricks bundle validate
          - script: databricks bundle deploy -t staging

  - stage: Deploy_Production
    dependsOn: Deploy_Staging     # Only after staging succeeds
    condition: succeeded()
    jobs:
      - job: deploy
        environment: production    # Requires approval gate
        steps:
          - script: databricks bundle deploy -t production

SECTION 3: COST MANAGEMENT

🧠 Memory Map: Cost

🧠 Memory Map
5 COST AREAS = "CCSQM"
1. COMPUTE (biggest cost — 60-70% of Databricks bill)
→ Use Job Clusters for production (auto-terminate)
→ Use Spot VMs (60-90% cheaper, but can be reclaimed)
→ Auto-terminate idle clusters (set to 10-15 minutes)
→ Right-size: start small, scale up if needed
2. CLUSTER POLICIES (prevent overspending)
→ Admins set max nodes, allowed VM types, auto-terminate rules
→ Teams can't create 100-node clusters accidentally
3. STORAGE (usually 10-20% of cost)
→ OPTIMIZE (compact files → fewer API calls)
→ VACUUM (delete unused files → save storage)
→ Lifecycle rules on ADLS (move old Bronze to Cool/Archive tier)
4. QUERY OPTIMIZATION (save compute time = save money)
→ Photon (faster queries → less compute time)
→ Liquid Clustering (skip irrelevant files)
→ Caching (spark.databricks.io.cache.enabled = true)
5. MONITORING + CHARGEBACK
→ Tag clusters with team/project (cost allocation)
→ Databricks account consolecost by workspace/cluster
→ Set budget alerts (notify when 80% budget used)
SPOT VMs:
What: Azure VMs at 60-90% discount
Risk: Azure can reclaim them with 30-sec notice
Use for: Workers (not driver!) in non-critical jobs
⚠️Never use Spot for driver node — job dies if reclaimed

⚡ MUST KNOW DIRECT QUESTIONS

Q14What is the biggest cost in Databricks?

Compute (60-70%). Choose right cluster type: Job Cluster for prod, Serverless for SQL, Spot VMs for non-critical batch jobs.

Q15What are Spot VMs?

Azure VMs at 60-90% discount. Azure can reclaim them anytime (30-sec notice). Use for worker nodes only, never for the driver.

Q16What are Cluster Policies?

Admin-defined rules that restrict what clusters teams can create. Example: max 10 nodes, only Standard_D4s_v3 VMs, auto-terminate after 15 min.

Q17How do you implement chargeback?

Tag every cluster/job with team name and project. Use Databricks account console to see cost per tag. Set budget alerts.

⚠️ Q18What is the #1 cost mistake?

Using All-Purpose clusters for production jobs. They stay running 24/7 even when idle. Switch to Job Clusters (auto-terminate) — saves 60-80%.

🔑 MID-LEVEL QUESTIONS

Q19Design a cost governance strategy for 200+ engineers
🗺️ Memory Map
🧠 Memory Map
1. CLUSTER POLICIES (prevent)
→ Dev: max 4 nodes, auto-terminate 15 min
→ Prod: max 20 nodes, Spot workers, Job Cluster only
→ Data Science: GPU clusters with 2-hour timeout
2. TAGGING (track)
→ Mandatory tags: team, project, cost_center
→ Automated via cluster policies (users can't skip)
3. MONITORING (alert)
→ Weekly cost reports per team
→ Budget alerts at 80% threshold
→ Anomaly detection (team X cost jumped 300%)
4. OPTIMIZATION (reduce)
→ Predictive Optimization (auto OPTIMIZE + VACUUM)
→ Serverless SQL Warehouses (no idle cost)
→ Reserved capacity for stable workloads (1-year commitment = 30-50% off)

SECTION 4: SPARK DEBUGGING

🧠 Memory Map: Debugging

🧠 DEBUGGING = "Read the Spark UI like a doctor reads an X-ray"
DEBUGGING"Read the Spark UI like a doctor reads an X-ray"
SPARK EXECUTION PLAN
Read BOTTOM to TOP! (physical plan starts at the bottom)
== Physical Plan ==
*(5) HashAggregate ← Step 5: Final aggregation (TOP = last step)
+- *(4) Exchange ← Step 4: Shuffle (data moves between nodes)
+- *(3) HashAggregate ← Step 3: Partial aggregation
+- *(2) Project ← Step 2: Column selection
+- *(1) Scan ← Step 1: Read table (BOTTOM = first step)
RED FLAGS in execution plan:
ExchangeSHUFFLE (expensive! data crosses network)
BroadcastExchangeSmall table broadcast (GOOD — no shuffle)
SortMergeJoinBoth tables are large (expensive)
BroadcastHashJoinOne table is small (cheap — GOOD)
CartesianProductCROSS JOIN (⚠️ usually a mistake!)
SPARK UI — 5 TABS TO CHECK:
1. JobsOverall progress (how many stages/tasks)
2. StagesWhere is time spent? (longest stage = bottleneck)
3. SQLQuery plan visualization
4. StorageCached data
5. EnvironmentSpark config values
Remember: "JSSEE" = Jobs, Stages, SQL, Storage, Environment
COMMON ISSUES"SSOS"
SSkew (one partition has 100x more data than others)
SSpill (not enough memory → data written to disk → slow)
OOOM (Out of Memory — increase executor memory)
SSmall files (too many tiny files → slow reads → OPTIMIZE)

⚡ MUST KNOW DIRECT QUESTIONS

Q20How do you read a Spark execution plan?

Read bottom to top. Bottom = first step (scan table). Top = last step (output). Look for Exchange (shuffle — expensive) and BroadcastHashJoin (good — no shuffle).

Q21What is a shuffle?

Data movement across network between Spark executors. Happens during GROUP BY, JOIN, DISTINCT, REPARTITION. Expensive — minimize shuffles for better performance.

Q22What is data skew?

One partition has much more data than others. Example: 99% of bookings are for "Delhi" airport → one executor does all the work while others sit idle.

Q23How to fix data skew?

  1. Salting: Add random number to skewed key → spreads data across partitions
  2. Broadcast join: If one side is small (<10 MB), broadcast it
  3. Adaptive Query Execution (AQE): Spark auto-detects and fixes skew (enabled by default)

Q24What is spill?

When Spark runs out of executor memory, it writes data to disk (SSD). Much slower than in-memory processing. Fix: increase spark.executor.memory or reduce partition size.

Q25What causes OOM (Out of Memory)?

  1. Too much data in one partition (skew)
  2. Collecting large dataset to driver (df.collect() on millions of rows)
  3. Too many broadcast joins (broadcasting large tables)

🔑 MID-LEVEL QUESTIONS

Q26Pipeline is suddenly 5x slower, no code change. What do you check?

Systematic debugging approach (in this order):

  1. Data volume — Did input data suddenly 10x? (check Bronze row counts)
  2. Data skew — One partition has disproportionate data? (check Spark UI → Stages → task duration variance)
  3. Cluster issues — Spot VMs reclaimed? Fewer nodes? (check cluster events log)
  4. Upstream changes — Source schema changed? New columns? (check Auto Loader _rescued_data)
  5. Concurrent jobs — Other jobs competing for resources? (check workspace activity)
  6. Storage — Too many small files? (check table metrics, run OPTIMIZE)
  7. Spark config — Someone changed spark configs? (check Environment tab)

Q27How to use the Spark UI to find bottlenecks?
🗺️ Memory Map
📋 Overview
Step 1: Jobs tabfind the slowest job
Step 2: Click itStages tab → find the slowest stage
Step 3: Click stageTasks tab → look at:
Task duration: all similar? (good) or one task 100x longer? (SKEW)
Shuffle read/write: very large? (shuffle problem)
Spill (Memory/Disk): any spill? (need more memory)
Step 4: SQL tabexecution plan → look for Exchange (shuffle)

SECTION 5: SYSTEM DESIGN

🔑 MID-LEVEL QUESTIONS

Q28Design a data platform for Amadeus on Azure Databricks
🗺️Memory Map
```
┌─────────────────────────────────────────────────────────────┐
│ AMADEUS DATA PLATFORM │
───────────────────────────────────────────────────────────┤
│ │
│ SOURCES: │
│ Oracle (bookings) → Debezium → Kafka → ADLS landing zone │
│ Flight APIs (JSON) → Event Hubs → ADLS landing zone │
│ Partner files (CSV) → SFTP → ADLS landing zone │
│ │
│ INGESTION: │
│ Auto Loader (cloudFiles) → Bronze Delta tables │
│ Mode: File Notification (scale: billions/day) │
│ │
│ PROCESSING: │
│ Lakeflow Pipelines: │
│ Bronze → Silver (MERGE + dedup + SCD2 + PII tagging) │
│ Silver → Gold (aggregations + star schema) │
│ Quality: Expectations (Warn/Drop/Fail per layer) │
│ │
│ SERVING: │
│ Serverless SQL Warehouse → Power BI dashboards │
│ Delta Sharing → Partner airline data access │
│ Feature Store → ML model training │
│ │
│ GOVERNANCE: │
│ Unity Catalog (3-level namespace) │
│ Row-level security (airline isolation) │
│ Column masking (PII/GDPR) │
│ ABAC tags for automated policy enforcement │
│ │
│ OPERATIONS: │
│ CI/CD: Declarative Automation Bundles + Azure DevOps │
│ Monitoring: Databricks Workflows + budget alerts │
│ Compute: Job Clusters (prod) + Spot workers + auto-term │
│ Cost: Cluster policies + tagging + chargeback │
│ │
───────────────────────────────────────────────────────────┘
```
Q29What makes a good answer to system design questions?

Follow the "RIGS" framework:

  1. Requirements — Clarify scale, latency, users, compliance
  2. Ingestion — How data enters (Auto Loader, Kafka, APIs)
  3. Governance — Security, PII, GDPR, access control
  4. Serving — How consumers access (SQL Warehouse, Delta Sharing, APIs)
Always mention: Medallion layers, Unity Catalog, cost strategy, monitoring

🧠 FINAL REVISION — Day 4 Summary Card

📐 Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│               DAY 4: PRODUCTION & CI/CD                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Workflows = built-in scheduler (free, Databricks-only)     │
│  Repair Run = re-run ONLY failed tasks                      │
│  Task Values = pass data between tasks (dbutils)            │
│  Table-triggered = start when new data arrives              │
│                                                             │
│  CI/CD = Declarative Automation Bundles (was Asset Bundles)  │
│  Config: databricks.yml (YAML in Git)                       │
│  Flow: validate → deploy staging → test → deploy prod       │
│  Azure DevOps pipeline triggers on Git push                 │
│  ⚠️ Always use Service Principal for prod (not personal!)   │
│                                                             │
│  Cost = "CCSQM"                                             │
│  Compute: Job Cluster + Spot VMs + auto-terminate           │
│  Cluster Policies: enforce limits per team                  │
│  Storage: OPTIMIZE + VACUUM + ADLS lifecycle                │
│  Query: Photon + Liquid Clustering + caching                │
│  Monitoring: tagging + budget alerts + chargeback           │
│  ⚠️ #1 mistake: All-Purpose clusters for production!       │
│                                                             │
│  Debugging = "SSOS" (Skew, Spill, OOM, Small files)         │
│  Read execution plan BOTTOM to TOP                          │
│  Exchange = shuffle (expensive!)                            │
│  BroadcastHashJoin = good (no shuffle)                      │
│  Spark UI: Jobs → Stages → Tasks (find bottleneck)          │
│                                                             │
│  System Design: "RIGS" framework                            │
│  Requirements → Ingestion → Governance → Serving            │
│                                                             │
│  TOP 5 THINGS TO SAY IN INTERVIEW:                          │
│  1. "Workflows with Repair Run for failed task recovery"    │
│  2. "DAB + Azure DevOps for automated CI/CD"                │
│  3. "Job Clusters + Spot VMs + auto-terminate for cost"     │
│  4. "Spark UI: check stages for skew + spill"               │
│  5. "Unity Catalog + ABAC + row security for governance"    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
🗺️ Memory Map
Study tip: Read this file TWICE:
  1. First pass (30 min): Read only 🧠 Memory Maps + ⚡ Direct Questions
  2. Second pass (30 min): Read 🔑 Mid-Level Questions + ⚠️ Traps
  3. Before interview (15 min): Read ONLY the Final Revision Summary Card

🧠 INTERVIEW DAY: ULTRA-QUICK CHEAT SHEET (Read 10 min before interview)

📐 Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│              ALL 4 DAYS — LAST-MINUTE RECALL                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DELTA = Parquet + _delta_log (ACID on data lake)               │
│  MERGE = Match→Update, NotMatch→Insert (deduplicate source!)    │
│  OPTIMIZE = compact files | VACUUM = delete old files (7 days)  │
│  Liquid Clustering > Z-ORDER > Partitioning (new→old)           │
│                                                                 │
│  MEDALLION = Bronze(raw) → Silver(clean) → Gold(business)       │
│  SCD2 = merge_key trick OR apply_changes(scd_type=2)            │
│  CDC = Oracle → Debezium → Kafka → Auto Loader → Delta          │
│  Auto Loader = cloudFiles (Directory or Notification mode)      │
│  Lakeflow = Declarative pipelines + Expectations (Warn/Drop/Fail)│
│                                                                 │
│  Unity Catalog = catalog.schema.table (3 levels)                │
│  Security: Row filter + Column mask + ABAC tags                 │
│  Photon = C++ SQL engine (not for Python UDFs/ML)               │
│  Compute: All-Purpose(dev) → Job Cluster(prod) → Serverless    │
│  GDPR: DELETE + VACUUM 0 HOURS                                  │
│                                                                 │
│  Workflows = built-in scheduler + Repair Run                    │
│  CI/CD = Declarative Automation Bundles (databricks.yml in Git)  │
│  Cost: Job Clusters + Spot VMs + Cluster Policies + Tagging     │
│  Debug: Spark UI bottom→top, look for Exchange(shuffle)         │
│                                                                 │
│  ALWAYS FRAME WITH AMADEUS:                                     │
│  "In our travel booking pipeline with billions of daily events..."│
│  "For GDPR with 200+ airline partners and passenger PII..."     │
│  "At Amadeus scale with Oracle legacy migration to Delta Lake..."│
│                                                                 │
└─────────────────────────────────────────────────────────────────┘