Day 4: Production, CI/CD, Cost Management & Mock Interview
SECTION 1: DATABRICKS WORKFLOWS & ORCHESTRATION (1 hour)
Q1: What is Databricks Workflows? How does it compare to Apache Airflow?
Simple Explanation: A workflow is a scheduled pipeline — a series of tasks that run in order. For example: "Every day at 6am, run bronze ingestion → then silver transformation → then gold aggregation."
Databricks Workflows is Databricks' built-in job scheduler. You define tasks (notebooks, SQL, Python scripts), set dependencies (task B runs after task A), and schedule them.
Apache Airflow is a separate open-source tool that does the same thing but works across ANY platform (not just Databricks). It's more flexible but requires more setup and maintenance.
Real-world analogy:
- Databricks Workflows = Your company's internal task management tool (simple, built-in, works only within your company)
- Apache Airflow = A universal project management tool (powerful, works everywhere, but you need to install and maintain it yourself)
| Aspect | Databricks Workflows | Apache Airflow |
|---|---|---|
| Setup | Zero — already built into Databricks | Requires deployment, servers, maintenance |
| Task types | Notebook, Python, SQL, Lakeflow, dbt, JAR | Any cloud/tool (Databricks, Snowflake, APIs, anything) |
| Repair/Retry | Built-in — re-run ONLY failed tasks (smart!) | Task retry, but no native "repair" feature |
| Multi-platform | Databricks only — can't orchestrate Snowflake or external APIs | Works across ANY platform (biggest advantage) |
| Cost | Free (included in Databricks) | Separate infrastructure cost |
| Versioning | Declarative Automation Bundles (YAML in Git) | Git-synced Python DAGs |
| Job backfills | Built-in GA (2025) — reprocess historical data | Built-in backfill support |
When to use which:
- Workflows: When your pipeline is 100% inside Databricks
- Airflow: When you orchestrate across multiple platforms (Databricks + Oracle + external APIs)
- Azure Data Factory: Alternative to Airflow for Azure-centric orchestration
Amadeus answer: "We use Databricks Workflows for our Databricks-native ETL pipelines. For cross-platform orchestration (Oracle → Event Hubs → Databricks → Power BI), we use Azure Data Factory."
Q2: How do tasks pass data to each other in a Workflow?
Simple Explanation: In a workflow with multiple tasks (bronze → silver → gold), sometimes Task 2 needs information from Task 1. For example, Task 1 ingests data and counts 1.5 million records. Task 2 needs to know that count to decide how to process the data.
Databricks provides dbutils.jobs.taskValues for this — Task 1 can SET values, and Task 2 can GET those values.
# ============================================
# TASK 1: Bronze Ingestion
# ============================================
# After ingesting data, save useful info for downstream tasks
dbutils.jobs.taskValues.set(key="record_count", value=1500000)
# "I ingested 1.5 million records" → save for Task 2
dbutils.jobs.taskValues.set(key="max_booking_date", value="2026-03-15")
# "The latest booking date in this batch is March 15" → save for Task 2
dbutils.jobs.taskValues.set(key="status", value="success")
# "I completed successfully" → save for Task 2
# ============================================
# TASK 2: Silver Transformation (runs AFTER Task 1)
# ============================================
# Read values that Task 1 saved
count = dbutils.jobs.taskValues.get(
taskKey="bronze_ingestion", # Name of the upstream task
key="record_count" # Which value to get
)
max_date = dbutils.jobs.taskValues.get(
taskKey="bronze_ingestion",
key="max_booking_date"
)
# Use the values for conditional logic
if count > 1000000:
# Large batch → use a different processing strategy
process_large_batch()
else:
process_normal_batch()
Q3: What are table-triggered jobs?
Simple Explanation: Normally, jobs run on a fixed schedule (e.g., "every day at 6am"). But what if bronze data arrives at unpredictable times? You'd either run too often (wasting compute) or too rarely (stale data).
Table-triggered jobs (new October 2025) solve this: the job runs AUTOMATICALLY when the source table gets new data. No cron schedule needed.
Real-world analogy: Instead of checking your mailbox every hour, you set up a notification: "Alert me when new mail arrives." That's table-triggered.
# In Declarative Automation Bundle config (databricks.yml):
resources:
jobs:
silver_bookings:
trigger:
table:
condition: ANY_UPDATED # Run when ANY of these tables change
table_names:
- travel_prod.bookings.bronze_bookings
- travel_prod.bookings.bronze_cancellations
# When bronze_bookings OR bronze_cancellations gets new data,
# this job automatically runs to update the silver layer!
tasks:
- task_key: transform_silver
notebook_task:
notebook_path: ./notebooks/silver_bookings.py
Amadeus use case: "Our Silver layer automatically refreshes whenever new Bronze data lands — no fixed schedule, no unnecessary runs, no stale data."
SECTION 2: CI/CD WITH DECLARATIVE AUTOMATION BUNDLES (1 hour)
Q4: What are Declarative Automation Bundles?
Simple Explanation: Imagine you need to deploy the same Databricks pipeline to 3 environments: dev, staging, and prod. Without CI/CD, you'd manually create jobs, configure clusters, and set schedules in each environment — error-prone and tedious.
Declarative Automation Bundles (formerly called "Asset Bundles", renamed March 2026) let you define your entire Databricks setup as YAML files in Git. Same code, same config, deployed to any environment with one command.
Real-world analogy: Think of it like a recipe book. Instead of cooking from memory (manual setup), you follow the recipe (YAML file). Same recipe, same dish, whether you cook in the dev kitchen or the prod kitchen.
Why "Declarative"? Because you DECLARE what you want (jobs, schedules, clusters) and Databricks creates everything for you. You don't write imperative code saying "create cluster, wait, attach notebook, run..."
# databricks.yml — THE RECIPE for your entire pipeline setup
bundle:
name: amadeus_booking_pipeline # Name of this bundle
workspace:
host: https://adb-1234567890.azuredatabricks.net # Default workspace
resources:
jobs:
daily_booking_etl: # Job definition
name: "Daily Booking ETL Pipeline"
schedule:
quartz_cron_expression: "0 0 6 * * ?" # Run at 6:00 AM daily
timezone_id: "UTC"
tasks:
- task_key: bronze_ingestion # First task
notebook_task:
notebook_path: ./notebooks/bronze_ingestion.py # Which notebook to run
new_cluster: # Cluster config for this task
spark_version: "18.1.x-scala2.13" # Databricks Runtime version
node_type_id: "Standard_DS3_v2" # Azure VM type
num_workers: 4 # 4 worker VMs
spark_conf:
spark.databricks.photon.enabled: "true" # Enable Photon for speed
- task_key: silver_transform # Second task
depends_on:
- task_key: bronze_ingestion # Runs AFTER bronze completes
notebook_task:
notebook_path: ./notebooks/silver_transform.py
- task_key: gold_aggregate # Third task
depends_on:
- task_key: silver_transform # Runs AFTER silver completes
notebook_task:
notebook_path: ./notebooks/gold_aggregate.py
# ENVIRONMENTS — same pipeline, different workspaces
targets:
dev: # Development environment
workspace:
host: https://dev-adb.azuredatabricks.net
staging: # Staging/testing environment
workspace:
host: https://staging-adb.azuredatabricks.net
prod: # Production environment
workspace:
host: https://prod-adb.azuredatabricks.net
run_as:
service_principal_name: "amadeus-prod-sp" # Prod runs as Service Principal (not a person)
# DEPLOY with one command per environment:
databricks bundle validate --target dev # Check config is valid
databricks bundle deploy --target dev # Deploy to dev
databricks bundle deploy --target staging # Deploy to staging (same code!)
databricks bundle deploy --target prod # Deploy to production (same code!)
# Same pipeline definition → consistent across ALL environments
Q5: How does CI/CD work with Azure DevOps?
Simple Explanation: CI/CD means:
- CI (Continuous Integration): Every time code is pushed to Git, automatically run tests and checks
- CD (Continuous Deployment): After tests pass, automatically deploy to staging/production
# azure-pipelines.yml — The CI/CD pipeline definition
trigger:
branches:
include: [main, develop] # Run this pipeline when code is pushed to main or develop
pool:
vmImage: 'ubuntu-latest' # Run on Ubuntu (Azure-hosted build agent)
stages:
# ============================================
# STAGE 1: VALIDATE (run on every push)
# ============================================
- stage: Validate
jobs:
- job: ValidateAndTest
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.10' # Use Python 3.10
- script: |
pip install databricks-cli ruff pytest
ruff check src/ # Lint: check code style and errors
pytest tests/ -v # Unit tests: test transformation logic
displayName: 'Lint and Unit Tests'
- script: |
databricks bundle validate --target staging
# Validate: check that the YAML config is correct
displayName: 'Validate Bundle Config'
# ============================================
# STAGE 2: DEPLOY TO STAGING (only on main branch)
# ============================================
- stage: DeployStaging
dependsOn: Validate # Only runs after Validate passes
condition: eq(variables['Build.SourceBranch'], 'refs/heads/main')
# Only deploy when pushing to main branch (not feature branches)
jobs:
- job: DeployToStaging
steps:
- script: |
databricks bundle deploy --target staging
# Deploy the pipeline to staging environment
databricks bundle run daily_booking_etl --target staging
# Run the pipeline on staging data (integration test)
displayName: 'Deploy and Integration Test on Staging'
# ============================================
# STAGE 3: DEPLOY TO PRODUCTION (with manual approval!)
# ============================================
- stage: DeployProd
dependsOn: DeployStaging # Only runs after staging succeeds
jobs:
- deployment: ProductionDeploy
environment: 'production' # REQUIRES manual approval by a reviewer!
# Someone must click "Approve" in Azure DevOps before this runs
strategy:
runOnce:
deploy:
steps:
- script: |
databricks bundle deploy --target prod
displayName: 'Deploy to Production'
Flow: Push code → Auto-lint + test → Auto-deploy staging → Manual approval → Deploy prod
Q6: What testing strategies exist for Databricks pipelines?
Simple Explanation: Testing ensures your pipeline works correctly before deploying to production. There are 4 levels:
SECTION 3: COST MANAGEMENT (45 min)
Q7: How do you manage costs for a large Databricks deployment?
Simple Explanation: Databricks can get expensive fast — especially with 200+ engineers and hundreds of pipelines. Cost management is about: using the right compute for each workload, preventing waste, and tracking who spends what.
5 areas of cost management:
Q8: What are Spot VMs? How do they save money?
Simple Explanation: Azure has spare VMs that nobody is using at the moment. Instead of letting them sit idle, Azure offers them at a huge discount (up to 90% off) — these are called Spot VMs (or Low-Priority VMs). The catch: Azure can take them back with 30 seconds notice if someone else needs them.
Databricks handles this gracefully — if a Spot worker is reclaimed, Spark retries the task on another worker. Your job doesn't fail.
Amadeus answer: "Our batch ETL jobs use Spot VMs for 80% of workers, saving 60-70% on compute costs. The driver is always on-demand to ensure job reliability."
SECTION 4: PRODUCTION DEBUGGING (45 min)
Q9: How do you read a Spark execution plan?
Simple Explanation: When you run a Spark query, Spark creates a plan — a step-by-step recipe for how to process the data. Reading this plan tells you: Is the query efficient? Are there unnecessary shuffles? Is Spark using the right join strategy?
Key rule: Read execution plans BOTTOM UP. The bottom is where data reading starts, and the top is the final result.
# Example query: Total fare by airport for a specific date
df = spark.table("bookings") \
.filter(col("booking_date") == "2026-03-15") \ # Filter by date
.join(spark.table("passengers"), "passenger_id") \ # Join with passengers
.groupBy("departure_airport").agg(sum("fare_amount")) # Sum fares by airport
df.explain(True) # Show the execution plan
What to look for (red flags vs green flags):
| Element | Good or Bad? | What It Means |
|---|---|---|
BroadcastHashJoin | ✅ Good | Small table sent to all workers — no shuffle needed |
SortMergeJoin | ⚠️ Check | Both tables shuffled — can broadcast instead if one side is small? |
Exchange | ⚠️ Expensive | SHUFFLE = data moving between workers. Fewer = better |
PushedFilters | ✅ Good | Filter pushed to file level — Delta skips irrelevant files |
No PushedFilters | ❌ Bad | Filter NOT pushed — reading more data than needed |
WholeStageCodegen (*) | ✅ Good | Spark compiled the query to fast native code |
Missing WholeStageCodegen | ⚠️ Check | May be using Python UDFs that break codegen |
Q10: How do you debug slow jobs using Spark UI?
Simple Explanation: Spark UI is a web interface that shows you exactly what happened during your job — how long each task took, how much data was shuffled, whether any tasks were much slower than others (data skew).
Interview tip: Walk through this checklist when asked "How would you debug a slow job?" — it shows systematic thinking, not guessing.
Q11: Scenario — Pipeline suddenly 5x slower, no code change. What happened?
Simple Explanation: This is one of the most common interview scenarios. The key: DON'T just say "run OPTIMIZE". Show a systematic investigation process.
Q12: Common production issues — quick reference
| Issue | What You See | Root Cause | How to Fix |
|---|---|---|---|
ConcurrentModificationException | Two pipelines write to same table at same time | Concurrent writes touch same files | Partition writes by time, or use WAP (Write-Audit-Publish) pattern |
| Pipeline silently produces wrong data | Reports show wrong numbers | No data quality checks | Add Lakeflow expectations at every layer |
| Costs spike on weekends | Bill jumps 2x on Saturday/Sunday | Developers leave clusters running | Enforce auto-termination in cluster policies |
| Query suddenly slow | Query that took 30 sec now takes 10 min | Table statistics are outdated | ANALYZE TABLE bookings COMPUTE STATISTICS |
| Streaming job keeps restarting | Job restarts every few minutes | Checkpoint/state store corrupted | Delete checkpoint, reprocess from source |
SchemaEvolutionException | Pipeline fails with schema error | Source added new columns | Use Auto Loader rescue mode |
FileNotFoundException | Query fails with "file not found" | VACUUM deleted a file while query was reading it | Increase VACUUM retention period (default 7 days) |
| Job works in dev, fails in prod | Same code, different results | Different cluster config/runtime | Use Declarative Automation Bundles (same config everywhere) |
SECTION 5: SYSTEM DESIGN (45 min)
Q13: Design a data platform for Amadeus on Azure Databricks
Simple Explanation: This is a "big picture" architecture question. They want to see: Can you design an entire data platform? Think about workspaces, catalogs, compute, governance, CI/CD, cost, and monitoring.
SECTION 6: MOCK INTERVIEW — 10 Most Likely Amadeus Questions (1 hour)
Practice answering each in 3-5 minutes OUT LOUD. Time yourself.
MOCK Q1: "Tell me about your experience with Databricks and Delta Lake."
How to answer (structure):
- Start with your role and the scale you work at
- Mention specific Databricks features you've used
- Share ONE challenge you solved (with results)
- Keep it to 2-3 minutes
Example: "In my current role, I work with Delta Lake tables processing X million records daily. I've implemented Medallion architecture using Lakeflow Pipelines with quality expectations at every layer. A key challenge was optimizing a MERGE operation on our fact table that took 3 hours — by adding partition pruning, OPTIMIZE with Z-ORDER, and switching to Photon, we brought it down to 40 minutes."
MOCK Q2: "Design a CDC pipeline from Oracle to Delta Lake for our booking system."
Key points to cover (in order):
- Source: Debezium Oracle connector reads redo logs (no Oracle performance impact)
- Message broker: Azure Event Hubs (Kafka-compatible, managed service)
- Bronze: Auto Loader ingests from Event Hubs, appends raw CDC events as-is
- Silver: MERGE for fact tables (SCD Type 1), apply_changes for dimensions (SCD Type 2)
- Gold: Materialized views for business aggregations
- Quality: Lakeflow expectations at every layer
- Governance: Unity Catalog from day 1, PII masking on passenger data
- GDPR: Deletion pipeline for "right to be forgotten"
MOCK Q3: "How would you implement SCD Type 2 for our passenger dimension?"
Cover BOTH approaches:
- SQL MERGE with merge_key trick — show you can write the code manually (Day 2, Q5)
- Lakeflow apply_changes — show you know the modern, simpler approach (Day 2, Q6)
- Mention hash-based change detection (MD5 of tracked columns)
- Mention deduplication of source before MERGE
MOCK Q4: "Our booking fact table MERGE takes 3 hours. How would you optimize?"
Systematic approach (don't jump to solutions — investigate first):
DESCRIBE DETAIL→ Check numFiles (small file problem?) and sizeInBytes- OPTIMIZE → Compact small files into ~1 GB files
- Add partition column to MERGE ON clause → partition pruning
- Z-ORDER on merge key → better data skipping
- Filter source → only MERGE rows that actually changed
- Enable Photon → 3-5x faster MERGE
- Consider Liquid Clustering → automatic, incremental optimization
- Enable autoOptimize for the future
MOCK Q5: "How do you handle data governance and GDPR for passenger data?"
Cover these areas:
- Unity Catalog → centralized governance, 3-level namespace
- Column masking → PII (email, phone) masked for non-authorized users
- Row-level security → each airline partner sees only their own bookings
- ABAC → tag-based policies (new!) — one policy governs all PII tables
- GDPR deletion → DELETE + VACUUM RETAIN 0 HOURS + compliance logging
- Pseudonymization → alternative to full deletion (preserves analytics)
- Audit → Unity Catalog audit logs → Azure Monitor
- Auto-classification → auto-detect PII columns (new 2025)
MOCK Q6: "Explain your CI/CD approach for Databricks pipelines."
Cover:
- Declarative Automation Bundles → YAML config in Git (same config for all environments)
- Azure DevOps pipeline → lint (ruff) → unit test (pytest) → validate bundle → deploy staging → integration test → manual approval → deploy prod
- Three environments: dev, staging, prod (same Unity Catalog metastore)
- Service Principals → production runs under robot accounts, not human accounts
- Testing: unit (pandas), integration (Nutter), quality (Lakeflow expectations)
- Rollback: redeploy previous bundle version from Git
MOCK Q7: "How would you manage costs for a large Databricks deployment?"
Cover all 5 areas (Day 4, Q7):
- Compute: Job Clusters for prod (not All-Purpose!), Serverless for bursty, Spot VMs for workers
- Cluster policies: enforce max size, auto-termination, required tags
- Storage: OPTIMIZE + VACUUM + Predictive Optimization
- Query: Photon + Liquid Clustering (scan less data)
- Monitoring: tags on everything, dashboards, budget alerts, chargeback per team
MOCK Q8: "What is Unity Catalog and how would you structure it for our organization?"
Cover:
- Centralized governance for ALL data assets (tables, models, files)
- Three-level namespace: catalog.schema.table
- Catalog strategy: travel_prod, travel_dev, travel_staging, travel_sandbox
- Schema per domain: bookings, passengers, flights, analytics
- Access: row-level security per airline, column masking for PII, ABAC for scale
- Migration from Hive Metastore using UCX tool
- Delta Sharing for airline partner data access
MOCK Q9: "A pipeline that was fine for 6 months is suddenly 5x slower. Walk me through your debugging."
Show systematic investigation (Day 4, Q11):
- Check data volume — did it spike? (seasonal, backfill?)
- Check Spark UI → Tasks tab → one task much slower than others? → DATA SKEW
- Check table health:
DESCRIBE DETAIL→ too many small files? - Check if OPTIMIZE has been running regularly
- Check for upstream schema changes
- Check cluster config changes
- Quick fixes: OPTIMIZE, ANALYZE, enable AQE, increase shuffle partitions
MOCK Q10: "What's new in Databricks in 2025-2026 that excites you?"
Pick 3-4 features you can speak about confidently:
- Lakeflow Declarative Pipelines — DLT rebranded, contributed to open-source Apache Spark. Shows Databricks' commitment to open source.
- ABAC — tag-based governance at scale. One policy instead of 500 GRANTs.
- Serverless Workspaces — instant startup, pay-per-use. Changes how teams get started.
- Predictive Optimization — automatic OPTIMIZE/VACUUM. No more manual maintenance.
- Multi-table Transactions — atomic operations across tables. Solves real consistency problems.
- Lakebase — serverless Postgres for low-latency app serving inside Databricks.
FINAL PREP CHECKLIST
Day 1 Review (Delta Lake)
- Delta transaction log, ACID, checkpoints
- MERGE (syntax, duplicates, 6 optimization techniques, schema evolution)
- OPTIMIZE / VACUUM / Z-ORDER / Liquid Clustering — what each does and when to use
- Time travel recovery (VERSION AS OF, RESTORE, selective MERGE)
- Lakehouse architecture (vs Lake vs Warehouse)
- Delta 4.x features (Variant, Type Widening), Predictive Optimization
Day 2 Review (ETL Pipelines)
- Medallion Architecture with Amadeus examples (Bronze/Silver/Gold decisions)
- SCD Type 2 — merge_key trick in SQL + apply_changes in Lakeflow
- CDC pipeline design (Oracle → Debezium → Kafka → Bronze → Silver → Gold)
- CDF (Change Data Feed) vs CDC — what's the difference?
- Auto Loader (modes, schema evolution, rescue mode, vs COPY INTO)
- Lakeflow Declarative Pipelines (expectations 3 levels, MV vs ST)
Day 3 Review (Platform & Governance)
- Unity Catalog (hierarchy, 6 pillars, 3-level namespace)
- Row-level security + Column masking (write the SQL)
- ABAC (tag-based policies — write the SQL)
- Photon (when to use / NOT use)
- Serverless vs Job Cluster vs All-Purpose
- Azure: 3-plane architecture, Key Vault, Service Principals
- GDPR: deletion pipeline + pseudonymization
- Delta Sharing for airline partners
Day 4 Review (Production & CI/CD)
- Workflows vs Airflow — when to use each
- Declarative Automation Bundles + Azure DevOps CI/CD
- Cost management (5 areas)
- Spark execution plans — read bottom up
- Spark UI debugging methodology
- System design for Amadeus (workspaces, catalogs, compute, governance)
- All 10 mock questions practiced OUT LOUD
INTERVIEW DAY TIPS
- Frame every answer with Amadeus context — "In a travel booking system with 200+ airline partners..."
- Show trade-offs, not just answers — "Z-ORDER is good but Liquid Clustering is better for new tables because it's incremental and automatic..."
- Mention scale — "At Amadeus' scale of billions of daily transactions..."
- Know the new names — DLT → Lakeflow Declarative Pipelines, Asset Bundles → Declarative Automation Bundles, Databricks Assistant → Genie Code
- Be honest about what you don't know — "I haven't used Lakebase in production yet, but I understand it's a serverless Postgres for low-latency serving use cases..."
- Ask clarifying questions — shows you think before jumping to solutions
- Think out loud — senior interviews value your thought process, not just the answer
- Use the STAR method for experience questions — Situation, Task, Action, Result