Day 4: Production, CI/CD & Cost — Quick Recall Guide
- ⚡ = Must remember (95% chance of being asked)
- 🔑 = Key concept (core understanding needed)
- ⚠️ = Common trap (interviewers love to test this)
- 🧠 = Memory Map (mnemonic/acronym — memorize this!)
- 📝 = One-liner (flash-card style — cover answer, test yourself)
🧠 MASTER MEMORY MAP — Day 4
SECTION 1: DATABRICKS WORKFLOWS
🧠 Memory Map: Workflows
⚡ MUST KNOW DIRECT QUESTIONS
Built-in job scheduler in Databricks. Define tasks, set dependencies, schedule runs. Free (included in Databricks).
Re-runs ONLY failed tasks in a workflow — doesn't restart the whole pipeline. Saves time and compute.
Using dbutils.jobs.taskValues:
# Task 1: Save a value
dbutils.jobs.taskValues.set(key="row_count", value=1500000)
# Task 2: Read the value
count = dbutils.jobs.taskValues.get(taskKey="task1", key="row_count")
A job that starts automatically when new data arrives in a Delta table (not on a schedule). Uses CDF (Change Data Feed) to detect new rows.
- Workflows: Pipeline is 100% inside Databricks
- Airflow: Orchestrate across multiple platforms (Databricks + Oracle + APIs)
- Azure Data Factory: Azure-centric alternative to Airflow
- ⚠️ Don't say "Airflow is better" — say "it depends on the orchestration scope"
SECTION 2: CI/CD
🧠 Memory Map: CI/CD Pipeline
⚡ MUST KNOW DIRECT QUESTIONS
YAML configuration files (in Git) that define ALL Databricks resources — jobs, clusters, pipelines, permissions. Deploy with databricks bundle deploy. Formerly called "Asset Bundles."
databricks.yml — defines workspace, targets (dev/staging/prod), jobs, clusters, and pipeline configurations.
databricks bundle validate # Check YAML is correct
databricks bundle deploy -t staging # Deploy to staging
databricks bundle deploy -t prod # Deploy to production
databricks bundle destroy -t staging # Remove from staging
- Developer pushes code to Git (Azure Repos)
- Azure DevOps pipeline triggers automatically
- Pipeline runs: validate → deploy to staging → test → deploy to prod
- Uses
databricks bundleCLI commands in pipeline YAML
🔑 MID-LEVEL QUESTIONS
databricks.yml examplebundle:
name: amadeus_booking_pipeline # Project name
targets:
staging: # Staging environment
workspace:
host: https://adb-staging.azuredatabricks.net
default: true # Default target
production: # Production environment
workspace:
host: https://adb-prod.azuredatabricks.net
run_as:
service_principal_name: sp-booking-pipeline # Not personal account!
resources:
jobs:
daily_booking_etl: # Job definition
name: "Daily Booking ETL"
schedule:
quartz_cron_expression: "0 0 6 * * ?" # 6 AM daily
tasks:
- task_key: bronze_ingest
notebook_task:
notebook_path: ./notebooks/bronze_ingest.py
- task_key: silver_transform
depends_on:
- task_key: bronze_ingest # Runs AFTER bronze
notebook_task:
notebook_path: ./notebooks/silver_transform.py
4 levels:
- Unit tests — Test Python functions (no Spark needed, use pytest)
- Integration tests — Test with real Spark session (use staging workspace)
- Data quality tests — Lakeflow Expectations on Silver/Gold tables
- End-to-end tests — Run full pipeline on sample data, validate output
trigger:
branches:
include: [main] # Trigger on push to main
stages:
- stage: Deploy_Staging
jobs:
- job: deploy
steps:
- script: pip install databricks-cli
- script: databricks bundle validate
- script: databricks bundle deploy -t staging
- stage: Deploy_Production
dependsOn: Deploy_Staging # Only after staging succeeds
condition: succeeded()
jobs:
- job: deploy
environment: production # Requires approval gate
steps:
- script: databricks bundle deploy -t production
SECTION 3: COST MANAGEMENT
🧠 Memory Map: Cost
⚡ MUST KNOW DIRECT QUESTIONS
Compute (60-70%). Choose right cluster type: Job Cluster for prod, Serverless for SQL, Spot VMs for non-critical batch jobs.
Azure VMs at 60-90% discount. Azure can reclaim them anytime (30-sec notice). Use for worker nodes only, never for the driver.
Admin-defined rules that restrict what clusters teams can create. Example: max 10 nodes, only Standard_D4s_v3 VMs, auto-terminate after 15 min.
Tag every cluster/job with team name and project. Use Databricks account console to see cost per tag. Set budget alerts.
Using All-Purpose clusters for production jobs. They stay running 24/7 even when idle. Switch to Job Clusters (auto-terminate) — saves 60-80%.
🔑 MID-LEVEL QUESTIONS
SECTION 4: SPARK DEBUGGING
🧠 Memory Map: Debugging
⚡ MUST KNOW DIRECT QUESTIONS
Read bottom to top. Bottom = first step (scan table). Top = last step (output). Look for Exchange (shuffle — expensive) and BroadcastHashJoin (good — no shuffle).
Data movement across network between Spark executors. Happens during GROUP BY, JOIN, DISTINCT, REPARTITION. Expensive — minimize shuffles for better performance.
One partition has much more data than others. Example: 99% of bookings are for "Delhi" airport → one executor does all the work while others sit idle.
- Salting: Add random number to skewed key → spreads data across partitions
- Broadcast join: If one side is small (<10 MB), broadcast it
- Adaptive Query Execution (AQE): Spark auto-detects and fixes skew (enabled by default)
When Spark runs out of executor memory, it writes data to disk (SSD). Much slower than in-memory processing. Fix: increase spark.executor.memory or reduce partition size.
- Too much data in one partition (skew)
- Collecting large dataset to driver (
df.collect()on millions of rows) - Too many broadcast joins (broadcasting large tables)
🔑 MID-LEVEL QUESTIONS
Systematic debugging approach (in this order):
- Data volume — Did input data suddenly 10x? (check Bronze row counts)
- Data skew — One partition has disproportionate data? (check Spark UI → Stages → task duration variance)
- Cluster issues — Spot VMs reclaimed? Fewer nodes? (check cluster events log)
- Upstream changes — Source schema changed? New columns? (check Auto Loader _rescued_data)
- Concurrent jobs — Other jobs competing for resources? (check workspace activity)
- Storage — Too many small files? (check table metrics, run OPTIMIZE)
- Spark config — Someone changed spark configs? (check Environment tab)
SECTION 5: SYSTEM DESIGN
🔑 MID-LEVEL QUESTIONS
Follow the "RIGS" framework:
- Requirements — Clarify scale, latency, users, compliance
- Ingestion — How data enters (Auto Loader, Kafka, APIs)
- Governance — Security, PII, GDPR, access control
- Serving — How consumers access (SQL Warehouse, Delta Sharing, APIs)
🧠 FINAL REVISION — Day 4 Summary Card
┌─────────────────────────────────────────────────────────────┐ │ DAY 4: PRODUCTION & CI/CD │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Workflows = built-in scheduler (free, Databricks-only) │ │ Repair Run = re-run ONLY failed tasks │ │ Task Values = pass data between tasks (dbutils) │ │ Table-triggered = start when new data arrives │ │ │ │ CI/CD = Declarative Automation Bundles (was Asset Bundles) │ │ Config: databricks.yml (YAML in Git) │ │ Flow: validate → deploy staging → test → deploy prod │ │ Azure DevOps pipeline triggers on Git push │ │ ⚠️ Always use Service Principal for prod (not personal!) │ │ │ │ Cost = "CCSQM" │ │ Compute: Job Cluster + Spot VMs + auto-terminate │ │ Cluster Policies: enforce limits per team │ │ Storage: OPTIMIZE + VACUUM + ADLS lifecycle │ │ Query: Photon + Liquid Clustering + caching │ │ Monitoring: tagging + budget alerts + chargeback │ │ ⚠️ #1 mistake: All-Purpose clusters for production! │ │ │ │ Debugging = "SSOS" (Skew, Spill, OOM, Small files) │ │ Read execution plan BOTTOM to TOP │ │ Exchange = shuffle (expensive!) │ │ BroadcastHashJoin = good (no shuffle) │ │ Spark UI: Jobs → Stages → Tasks (find bottleneck) │ │ │ │ System Design: "RIGS" framework │ │ Requirements → Ingestion → Governance → Serving │ │ │ │ TOP 5 THINGS TO SAY IN INTERVIEW: │ │ 1. "Workflows with Repair Run for failed task recovery" │ │ 2. "DAB + Azure DevOps for automated CI/CD" │ │ 3. "Job Clusters + Spot VMs + auto-terminate for cost" │ │ 4. "Spark UI: check stages for skew + spill" │ │ 5. "Unity Catalog + ABAC + row security for governance" │ │ │ └─────────────────────────────────────────────────────────────┘
- First pass (30 min): Read only 🧠 Memory Maps + ⚡ Direct Questions
- Second pass (30 min): Read 🔑 Mid-Level Questions + ⚠️ Traps
- Before interview (15 min): Read ONLY the Final Revision Summary Card
🧠 INTERVIEW DAY: ULTRA-QUICK CHEAT SHEET (Read 10 min before interview)
┌─────────────────────────────────────────────────────────────────┐
│ ALL 4 DAYS — LAST-MINUTE RECALL │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DELTA = Parquet + _delta_log (ACID on data lake) │
│ MERGE = Match→Update, NotMatch→Insert (deduplicate source!) │
│ OPTIMIZE = compact files | VACUUM = delete old files (7 days) │
│ Liquid Clustering > Z-ORDER > Partitioning (new→old) │
│ │
│ MEDALLION = Bronze(raw) → Silver(clean) → Gold(business) │
│ SCD2 = merge_key trick OR apply_changes(scd_type=2) │
│ CDC = Oracle → Debezium → Kafka → Auto Loader → Delta │
│ Auto Loader = cloudFiles (Directory or Notification mode) │
│ Lakeflow = Declarative pipelines + Expectations (Warn/Drop/Fail)│
│ │
│ Unity Catalog = catalog.schema.table (3 levels) │
│ Security: Row filter + Column mask + ABAC tags │
│ Photon = C++ SQL engine (not for Python UDFs/ML) │
│ Compute: All-Purpose(dev) → Job Cluster(prod) → Serverless │
│ GDPR: DELETE + VACUUM 0 HOURS │
│ │
│ Workflows = built-in scheduler + Repair Run │
│ CI/CD = Declarative Automation Bundles (databricks.yml in Git) │
│ Cost: Job Clusters + Spot VMs + Cluster Policies + Tagging │
│ Debug: Spark UI bottom→top, look for Exchange(shuffle) │
│ │
│ ALWAYS FRAME WITH AMADEUS: │
│ "In our travel booking pipeline with billions of daily events..."│
│ "For GDPR with 200+ airline partners and passenger PII..." │
│ "At Amadeus scale with Oracle legacy migration to Delta Lake..."│
│ │
└─────────────────────────────────────────────────────────────────┘