🧱
Databricks
Day 4: Production, CI/CD, Cost Management & Mock Interview
🧱
🧱
Databricks · Section 15 of 17

Day 4: Production, CI/CD, Cost Management & Mock Interview

Day 4: Production, CI/CD, Cost Management & Mock Interview

💡 Interview Tip
Time: 5-6 hours | Priority: HIGH — Production experience separates senior from mid-level Amadeus Context: Azure DevOps, cost governance, 200+ engineers, regulatory compliance Approach: Every topic starts with simple explanation → then interview-level depth

SECTION 1: DATABRICKS WORKFLOWS & ORCHESTRATION (1 hour)

Q1: What is Databricks Workflows? How does it compare to Apache Airflow?

Simple Explanation: A workflow is a scheduled pipeline — a series of tasks that run in order. For example: "Every day at 6am, run bronze ingestion → then silver transformation → then gold aggregation."

Databricks Workflows is Databricks' built-in job scheduler. You define tasks (notebooks, SQL, Python scripts), set dependencies (task B runs after task A), and schedule them.

Apache Airflow is a separate open-source tool that does the same thing but works across ANY platform (not just Databricks). It's more flexible but requires more setup and maintenance.

Real-world analogy:

  • Databricks Workflows = Your company's internal task management tool (simple, built-in, works only within your company)
  • Apache Airflow = A universal project management tool (powerful, works everywhere, but you need to install and maintain it yourself)
AspectDatabricks WorkflowsApache Airflow
SetupZero — already built into DatabricksRequires deployment, servers, maintenance
Task typesNotebook, Python, SQL, Lakeflow, dbt, JARAny cloud/tool (Databricks, Snowflake, APIs, anything)
Repair/RetryBuilt-in — re-run ONLY failed tasks (smart!)Task retry, but no native "repair" feature
Multi-platformDatabricks only — can't orchestrate Snowflake or external APIsWorks across ANY platform (biggest advantage)
CostFree (included in Databricks)Separate infrastructure cost
VersioningDeclarative Automation Bundles (YAML in Git)Git-synced Python DAGs
Job backfillsBuilt-in GA (2025) — reprocess historical dataBuilt-in backfill support

When to use which:

  • Workflows: When your pipeline is 100% inside Databricks
  • Airflow: When you orchestrate across multiple platforms (Databricks + Oracle + external APIs)
  • Azure Data Factory: Alternative to Airflow for Azure-centric orchestration

Amadeus answer: "We use Databricks Workflows for our Databricks-native ETL pipelines. For cross-platform orchestration (Oracle → Event Hubs → Databricks → Power BI), we use Azure Data Factory."

Q2: How do tasks pass data to each other in a Workflow?

Simple Explanation: In a workflow with multiple tasks (bronze → silver → gold), sometimes Task 2 needs information from Task 1. For example, Task 1 ingests data and counts 1.5 million records. Task 2 needs to know that count to decide how to process the data.

Databricks provides dbutils.jobs.taskValues for this — Task 1 can SET values, and Task 2 can GET those values.

python — editable
# ============================================
# TASK 1: Bronze Ingestion
# ============================================
# After ingesting data, save useful info for downstream tasks
dbutils.jobs.taskValues.set(key="record_count", value=1500000)
# "I ingested 1.5 million records" → save for Task 2

dbutils.jobs.taskValues.set(key="max_booking_date", value="2026-03-15")
# "The latest booking date in this batch is March 15" → save for Task 2

dbutils.jobs.taskValues.set(key="status", value="success")
# "I completed successfully" → save for Task 2


# ============================================
# TASK 2: Silver Transformation (runs AFTER Task 1)
# ============================================
# Read values that Task 1 saved
count = dbutils.jobs.taskValues.get(
    taskKey="bronze_ingestion",      # Name of the upstream task
    key="record_count"               # Which value to get
)
max_date = dbutils.jobs.taskValues.get(
    taskKey="bronze_ingestion",
    key="max_booking_date"
)

# Use the values for conditional logic
if count > 1000000:
    # Large batch → use a different processing strategy
    process_large_batch()
else:
    process_normal_batch()

Q3: What are table-triggered jobs?

Simple Explanation: Normally, jobs run on a fixed schedule (e.g., "every day at 6am"). But what if bronze data arrives at unpredictable times? You'd either run too often (wasting compute) or too rarely (stale data).

Table-triggered jobs (new October 2025) solve this: the job runs AUTOMATICALLY when the source table gets new data. No cron schedule needed.

Real-world analogy: Instead of checking your mailbox every hour, you set up a notification: "Alert me when new mail arrives." That's table-triggered.

yaml
# In Declarative Automation Bundle config (databricks.yml):
resources:
  jobs:
    silver_bookings:
      trigger:
        table:
          condition: ANY_UPDATED          # Run when ANY of these tables change
          table_names:
            - travel_prod.bookings.bronze_bookings
            - travel_prod.bookings.bronze_cancellations
      # When bronze_bookings OR bronze_cancellations gets new data,
      # this job automatically runs to update the silver layer!
      tasks:
        - task_key: transform_silver
          notebook_task:
            notebook_path: ./notebooks/silver_bookings.py

Amadeus use case: "Our Silver layer automatically refreshes whenever new Bronze data lands — no fixed schedule, no unnecessary runs, no stale data."

SECTION 2: CI/CD WITH DECLARATIVE AUTOMATION BUNDLES (1 hour)

Q4: What are Declarative Automation Bundles?

Simple Explanation: Imagine you need to deploy the same Databricks pipeline to 3 environments: dev, staging, and prod. Without CI/CD, you'd manually create jobs, configure clusters, and set schedules in each environment — error-prone and tedious.

Declarative Automation Bundles (formerly called "Asset Bundles", renamed March 2026) let you define your entire Databricks setup as YAML files in Git. Same code, same config, deployed to any environment with one command.

Real-world analogy: Think of it like a recipe book. Instead of cooking from memory (manual setup), you follow the recipe (YAML file). Same recipe, same dish, whether you cook in the dev kitchen or the prod kitchen.

Why "Declarative"? Because you DECLARE what you want (jobs, schedules, clusters) and Databricks creates everything for you. You don't write imperative code saying "create cluster, wait, attach notebook, run..."

yaml
# databricks.yml — THE RECIPE for your entire pipeline setup
bundle:
  name: amadeus_booking_pipeline     # Name of this bundle

workspace:
  host: https://adb-1234567890.azuredatabricks.net  # Default workspace

resources:
  jobs:
    daily_booking_etl:               # Job definition
      name: "Daily Booking ETL Pipeline"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"   # Run at 6:00 AM daily
        timezone_id: "UTC"
      tasks:
        - task_key: bronze_ingestion           # First task
          notebook_task:
            notebook_path: ./notebooks/bronze_ingestion.py  # Which notebook to run
          new_cluster:                          # Cluster config for this task
            spark_version: "18.1.x-scala2.13"   # Databricks Runtime version
            node_type_id: "Standard_DS3_v2"     # Azure VM type
            num_workers: 4                       # 4 worker VMs
            spark_conf:
              spark.databricks.photon.enabled: "true"  # Enable Photon for speed

        - task_key: silver_transform           # Second task
          depends_on:
            - task_key: bronze_ingestion       # Runs AFTER bronze completes
          notebook_task:
            notebook_path: ./notebooks/silver_transform.py

        - task_key: gold_aggregate             # Third task
          depends_on:
            - task_key: silver_transform       # Runs AFTER silver completes
          notebook_task:
            notebook_path: ./notebooks/gold_aggregate.py

# ENVIRONMENTS — same pipeline, different workspaces
targets:
  dev:                                          # Development environment
    workspace:
      host: https://dev-adb.azuredatabricks.net
  staging:                                      # Staging/testing environment
    workspace:
      host: https://staging-adb.azuredatabricks.net
  prod:                                         # Production environment
    workspace:
      host: https://prod-adb.azuredatabricks.net
    run_as:
      service_principal_name: "amadeus-prod-sp"  # Prod runs as Service Principal (not a person)
bash
# DEPLOY with one command per environment:
databricks bundle validate --target dev      # Check config is valid
databricks bundle deploy --target dev        # Deploy to dev
databricks bundle deploy --target staging    # Deploy to staging (same code!)
databricks bundle deploy --target prod       # Deploy to production (same code!)
# Same pipeline definition → consistent across ALL environments

Q5: How does CI/CD work with Azure DevOps?

Simple Explanation: CI/CD means:

  • CI (Continuous Integration): Every time code is pushed to Git, automatically run tests and checks
  • CD (Continuous Deployment): After tests pass, automatically deploy to staging/production
yaml
# azure-pipelines.yml — The CI/CD pipeline definition

trigger:
  branches:
    include: [main, develop]    # Run this pipeline when code is pushed to main or develop

pool:
  vmImage: 'ubuntu-latest'     # Run on Ubuntu (Azure-hosted build agent)

stages:
  # ============================================
  # STAGE 1: VALIDATE (run on every push)
  # ============================================
  - stage: Validate
    jobs:
      - job: ValidateAndTest
        steps:
          - task: UsePythonVersion@0
            inputs:
              versionSpec: '3.10'    # Use Python 3.10

          - script: |
              pip install databricks-cli ruff pytest
              ruff check src/              # Lint: check code style and errors
              pytest tests/ -v             # Unit tests: test transformation logic
            displayName: 'Lint and Unit Tests'

          - script: |
              databricks bundle validate --target staging
              # Validate: check that the YAML config is correct
            displayName: 'Validate Bundle Config'

  # ============================================
  # STAGE 2: DEPLOY TO STAGING (only on main branch)
  # ============================================
  - stage: DeployStaging
    dependsOn: Validate                  # Only runs after Validate passes
    condition: eq(variables['Build.SourceBranch'], 'refs/heads/main')
    # Only deploy when pushing to main branch (not feature branches)
    jobs:
      - job: DeployToStaging
        steps:
          - script: |
              databricks bundle deploy --target staging
              # Deploy the pipeline to staging environment
              databricks bundle run daily_booking_etl --target staging
              # Run the pipeline on staging data (integration test)
            displayName: 'Deploy and Integration Test on Staging'

  # ============================================
  # STAGE 3: DEPLOY TO PRODUCTION (with manual approval!)
  # ============================================
  - stage: DeployProd
    dependsOn: DeployStaging              # Only runs after staging succeeds
    jobs:
      - deployment: ProductionDeploy
        environment: 'production'         # REQUIRES manual approval by a reviewer!
        # Someone must click "Approve" in Azure DevOps before this runs
        strategy:
          runOnce:
            deploy:
              steps:
                - script: |
                    databricks bundle deploy --target prod
                  displayName: 'Deploy to Production'

Flow: Push code → Auto-lint + test → Auto-deploy staging → Manual approval → Deploy prod

Q6: What testing strategies exist for Databricks pipelines?

Simple Explanation: Testing ensures your pipeline works correctly before deploying to production. There are 4 levels:

🗂️1. UNIT TESTS (fastest — run in CI, no Spark needed)
What: Test your transformation logic with small sample data
How: pytest with pandas DataFrames (no Spark cluster needed!)
Example: "Does my dedup function keep only the latest record?"
Where: Runs in Azure DevOps build agent (free, fast)
New: pytest now integrated in Databricks workspace (Feb 2026)
2. INTEGRATION TESTS (medium — run on staging cluster)
What: Run the full pipeline end-to-end on sample data
How: Deploy to staging → run → check output matches expected results
Example: "Does Bronze → Silver → Gold produce the right output?"
Tool: Nutter framework (Databricks' own testing framework)
Check: Schema correct? Row counts match? Checksums match?
3. DATA QUALITY TESTS (continuous — run as part of pipeline)
What: Quality checks built into the pipeline itself
How: Lakeflow expectations (@dlt.expect, @dlt.expect_or_fail)
Example: "Are there any null booking_ids? Any negative fares?"
New: Quality monitoring anomaly detection (auto-alerts, Feb 2026)
4. PERFORMANCE TESTS (periodic — run before major releases)
What: Benchmark pipeline speed on production-scale data
How: Run on staging with a copy of production data
Check: Is the pipeline >20% slower than the previous version?
Alert: If performance regresses, block the release

SECTION 3: COST MANAGEMENT (45 min)

Q7: How do you manage costs for a large Databricks deployment?

Simple Explanation: Databricks can get expensive fast — especially with 200+ engineers and hundreds of pipelines. Cost management is about: using the right compute for each workload, preventing waste, and tracking who spends what.

5 areas of cost management:

🗂️1. COMPUTE OPTIMIZATION (biggest cost driver!)
Production jobs → Job Clusters (NOT All-Purpose — saves 30-50%)
│ Job Clusters are created for the job and destroyed after.
│ All-Purpose stay running and cost money even when idle.
BI queries → Serverless SQL Warehouses
│ Auto-suspend when idle (zero cost at night).
│ Auto-scale with query load.
Bursty/intermittent workloads → Serverless compute
│ Pay ONLY for execution time — up to 70% savings.
Long-running streaming → Standard clusters + Spot VMs
│ Use Spot VMs for workers (up to 90% cheaper! See Q8 below).
Instance Pools → Shared pool of warm VMs
│ Clusters start faster (VMs already running in the pool).
Right-size → Check actual utilization
If your cluster uses only 20% CPU, downsize it!
2. CLUSTER POLICIES (enforce rules so people can't waste money)
Max nodes: "No cluster can have more than 20 workers" (prevents runaway costs)
Allowed VM sizes: "Only Standard_DS3 to DS5" (no accidentally picking expensive GPUs)
Auto-termination: "Must auto-stop after 60 min idle" (no overnight idle clusters)
Required tags: "Must tag with team + project" (for cost tracking)
Photon required for > 4 workers (more efficient = less cost)
Block All-Purpose in prod workspace (force Job Clusters)
3. STORAGE OPTIMIZATION
VACUUM regularly → delete old files → free ADLS storage
OPTIMIZE → fewer files → less I/O → less compute cost for reads
Predictive Optimization → does both automatically (UC managed tables)
Lifecycle policies → archive raw Bronze data after 90 days
Zstd compression → default since Runtime 16.0 (smaller files)
4. QUERY OPTIMIZATION (process less data = less cost)
Photon → more queries per DBU (faster = less time = less cost)
Liquid Clustering → queries scan less data (skips irrelevant files)
SQL Warehouse auto-suspend → zero cost when not queried
Cache hot Gold tables → avoid re-reading from storage
5. MONITORING & CHARGEBACK (track and allocate costs)
Tag everything: clusters, jobs, warehouses → team, project, cost_center
Azure usage dashboards (new Mar 2026) → see costs per team
Query tags for SQL warehouses (new Feb 2026) → see costs per query group
Budget alerts → "Alert when team X exceeds $5000/month"
Weekly review → find unused clusters, oversized VMs
Chargeback → each team/airline partner pays for their usage

Q8: What are Spot VMs? How do they save money?

Simple Explanation: Azure has spare VMs that nobody is using at the moment. Instead of letting them sit idle, Azure offers them at a huge discount (up to 90% off) — these are called Spot VMs (or Low-Priority VMs). The catch: Azure can take them back with 30 seconds notice if someone else needs them.

Databricks handles this gracefully — if a Spot worker is reclaimed, Spark retries the task on another worker. Your job doesn't fail.

🗂️On-Demand (Standard) VMs:
Guaranteed — won't be taken away
Full price
Use for: driver node (must be reliable), critical streaming jobs
Spot (Low-Priority) VMs:
Up to 90% cheaper!
Can be evicted with 30-second notice
Databricks auto-retries tasks on remaining workers
Use for: worker nodes, batch processing, development
Best practice for production:
Driver: ALWAYS On-Demand (driver failing = entire job fails)
Workers: Mix of On-Demand (minimum) + Spot (auto-scale extra)
Example: min_workers=2 (on-demand), max_workers=20 (spot)
→ Guaranteed baseline of 2 workers + up to 18 cheap spot workers

Amadeus answer: "Our batch ETL jobs use Spot VMs for 80% of workers, saving 60-70% on compute costs. The driver is always on-demand to ensure job reliability."

SECTION 4: PRODUCTION DEBUGGING (45 min)

Q9: How do you read a Spark execution plan?

Simple Explanation: When you run a Spark query, Spark creates a plan — a step-by-step recipe for how to process the data. Reading this plan tells you: Is the query efficient? Are there unnecessary shuffles? Is Spark using the right join strategy?

Key rule: Read execution plans BOTTOM UP. The bottom is where data reading starts, and the top is the final result.

python — editable
# Example query: Total fare by airport for a specific date
df = spark.table("bookings") \
    .filter(col("booking_date") == "2026-03-15") \      # Filter by date
    .join(spark.table("passengers"), "passenger_id") \    # Join with passengers
    .groupBy("departure_airport").agg(sum("fare_amount")) # Sum fares by airport

df.explain(True)  # Show the execution plan
== Physical Plan == (Read BOTTOM UP ↑)
*(3) HashAggregate(keys=[departure_airport], functions=[sum(fare_amount)])
│ ↑ STEP 5: Final aggregation — compute the final SUM per airport
+- Exchange hashpartitioning(departure_airport, 200)
│ ↑ STEP 4: SHUFFLE — redistribute data by airport (expensive! stage boundary)
+- *(2) HashAggregate(keys=[departure_airport], partial_sum)
│ ↑ STEP 3: Partial aggregation — each worker computes a local SUM first
│ (this reduces data before the shuffle — smart optimization)
+- *(2) Project [departure_airport, fare_amount]
│ ↑ STEP 2: Column pruning — keep only the columns we need
+- *(2) BroadcastHashJoin [passenger_id]
│ ↑ STEP 1b: JOIN — passengers table was BROADCAST (sent to all workers)
│ BroadcastHashJoin = GOOD (no shuffle needed for this join)
│ SortMergeJoin would be BAD (requires shuffle on both sides)
:- *(2) Filter (booking_date = 2026-03-15)
: +- FileScan delta PushedFilters: [EqualTo(booking_date)]
: ↑ STEP 1a: Read bookings table with PREDICATE PUSHDOWN
: PushedFilters means the filter was pushed into file reading
:Delta skips files that don't contain March 15 data
+- BroadcastExchange
+- FileScan delta [passenger_id, departure_airport]
↑ STEP 0: Read passengers table (broadcast to all workers)

What to look for (red flags vs green flags):

ElementGood or Bad?What It Means
BroadcastHashJoin✅ GoodSmall table sent to all workers — no shuffle needed
SortMergeJoin⚠️ CheckBoth tables shuffled — can broadcast instead if one side is small?
Exchange⚠️ ExpensiveSHUFFLE = data moving between workers. Fewer = better
PushedFilters✅ GoodFilter pushed to file level — Delta skips irrelevant files
No PushedFilters❌ BadFilter NOT pushed — reading more data than needed
WholeStageCodegen (*)✅ GoodSpark compiled the query to fast native code
Missing WholeStageCodegen⚠️ CheckMay be using Python UDFs that break codegen

Q10: How do you debug slow jobs using Spark UI?

Simple Explanation: Spark UI is a web interface that shows you exactly what happened during your job — how long each task took, how much data was shuffled, whether any tasks were much slower than others (data skew).

🗂️SPARK UI DEBUGGING CHECKLIST:
(Follow this step-by-step when a job is slow)
1. JOBS TAB → Which job is slow?
Find the job with the longest duration
Click on it to see its stages
2. STAGES TAB → Find the bottleneck stage
Sort by Duration → which stage takes the longest?
Check Shuffle Read/Write Size → large shuffle = expensive
Check Spill (Memory) → if > 0, workers ran out of memory
and had to write to disk (very slow!)
3. TASKS TAB → Find slow tasks (within the bottleneck stage)
Sort by Duration → any outliers?
Max Task Duration vs Median Duration:
│ If max = 40 minutes, median = 30 seconds → DATA SKEW!
│ One task is processing way more data than others
Shuffle Read per task → should be roughly EQUAL
│ If one task reads 50 GB and others read 100 MB → SKEW
GC Time → if > 10% of task time → memory pressure (need more RAM)
4. SQL TAB → Check the execution plan
Is the join strategy correct? (Broadcast vs SortMerge)
Rows output at each node → data amplification?
Time per operator → which step is slowest?
5. EXECUTORS TAB → Check worker health
Any dead executors? (might have OOM'd)
Task count per executor → balanced?
Memory usage → anyone maxed out?

Interview tip: Walk through this checklist when asked "How would you debug a slow job?" — it shows systematic thinking, not guessing.

Q11: Scenario — Pipeline suddenly 5x slower, no code change. What happened?

Simple Explanation: This is one of the most common interview scenarios. The key: DON'T just say "run OPTIMIZE". Show a systematic investigation process.

🗂️INVESTIGATION CHECKLIST (go through these one by one):
1. DATA VOLUME — Did the source data suddenly get bigger?
Holiday season? Black Friday? End of month?
Check: Compare input sizes today vs last week
Fix: If data doubled, you may need more workers or partitions
2. DATA SKEW — Is one partition much bigger than others?
New popular flight route? One airline sent 10x more data?
Check: Spark UI → Tasks tab → one task taking 100x longer?
Fix: Salting, broadcast join, AQE skew join handling
3. SMALL FILES ACCUMULATED — 6 months without OPTIMIZE?
Streaming writes create many small files over time
Check: DESCRIBE DETAIL my_table → numFiles (if > 10,000 = problem!)
Fix: OPTIMIZE my_table; + enable auto-compaction
4. STALE STATISTICS — Spark using wrong join strategy?
Table grew past broadcast threshold → Spark switched to slow SortMergeJoin
Check: df.explain(True) → is it using SortMergeJoin when BroadcastHashJoin is better?
Fix: ANALYZE TABLE my_table COMPUTE STATISTICS
5. CLUSTER CHANGES — Did someone change the cluster config?
Fewer workers? Different VM type? Autoscale range changed?
Check: Cluster config history in workspace
6. INFRASTRUCTURE — Cloud platform issues?
ADLS Gen2 throttling? Network latency? Azure outage?
Check: Azure Monitor for storage latency spikes
7. UPSTREAM CHANGES — Did the source data structure change?
New columns causing schema evolution? Different data distribution?
Check: Compare source data schema and patterns
8. DELTA TABLE HEALTH:
DESCRIBE DETAIL bookings;
-- numFiles: if > 10,000 → small file problem → OPTIMIZE
-- sizeInBytes: unexpected growth? → data volume spike
QUICK FIXES (after diagnosing the root cause):
OPTIMIZE all affected tables (compacts small files)
ANALYZE TABLE bookings COMPUTE STATISTICS (refresh stats)
Increase spark.sql.shuffle.partitions (if data grew)
Enable AQE if not already (handles skew and partitions automatically)
Consider Liquid Clustering (for long-term health)

Q12: Common production issues — quick reference

IssueWhat You SeeRoot CauseHow to Fix
ConcurrentModificationExceptionTwo pipelines write to same table at same timeConcurrent writes touch same filesPartition writes by time, or use WAP (Write-Audit-Publish) pattern
Pipeline silently produces wrong dataReports show wrong numbersNo data quality checksAdd Lakeflow expectations at every layer
Costs spike on weekendsBill jumps 2x on Saturday/SundayDevelopers leave clusters runningEnforce auto-termination in cluster policies
Query suddenly slowQuery that took 30 sec now takes 10 minTable statistics are outdatedANALYZE TABLE bookings COMPUTE STATISTICS
Streaming job keeps restartingJob restarts every few minutesCheckpoint/state store corruptedDelete checkpoint, reprocess from source
SchemaEvolutionExceptionPipeline fails with schema errorSource added new columnsUse Auto Loader rescue mode
FileNotFoundExceptionQuery fails with "file not found"VACUUM deleted a file while query was reading itIncrease VACUUM retention period (default 7 days)
Job works in dev, fails in prodSame code, different resultsDifferent cluster config/runtimeUse Declarative Automation Bundles (same config everywhere)

SECTION 5: SYSTEM DESIGN (45 min)

Q13: Design a data platform for Amadeus on Azure Databricks

Simple Explanation: This is a "big picture" architecture question. They want to see: Can you design an entire data platform? Think about workspaces, catalogs, compute, governance, CI/CD, cost, and monitoring.

🗂️============================================================
AMADEUS DATA PLATFORM — AZURE DATABRICKS ARCHITECTURE
WORKSPACE STRATEGY (separate environments):
travel-dev → Development & experimentation
│ All engineers have full access. Sandbox for testing.
travel-staging → Testing & validation
│ Integration tests run here before promoting to prod.
travel-prod → Production workloads
│ Only Service Principals write. Humans have read-only access.
travel-analytics → Ad-hoc queries & BI
│ Business analysts connect Power BI here.
ALL workspaces → Connected to the SAME Unity Catalog metastore
So table definitions, permissions, and lineage are shared.
CATALOG STRUCTURE (how data is organized in Unity Catalog):
travel_raw → Bronze layer (all domains)
│ Raw data from all sources, append-only
travel_curated → Silver layer (organized by domain)
bookings schema → clean booking facts + dimensions
passengers schema → passenger SCD Type 2 dimensions
flights schema → flight schedules, routes
travel_analytics → Gold layer (cross-domain aggregations)
│ Daily revenue, route performance, passenger 360
travel_ml → ML assets
│ Feature tables, registered models, experiments
travel_sandbox → Developer experimentation
Auto-cleaned after 30 days. No production data.
COMPUTE STRATEGY (right compute for each workload):
File ingestion: Serverless (auto-scales with file volume)
Silver MERGE: Job Clusters + Photon (heavy transformations)
Gold aggregations: Serverless SQL Warehouse (BI queries)
ML training: GPU clusters (separate from ETL)
Development: Serverless notebooks (instant start, no idle cost)
24/7 streaming: Standard clusters + Spot workers (cost-efficient)
DATA GOVERNANCE (who can access what):
Unity Catalog → centralized access control
ABAC → tag PII tables → one policy governs all
Row-level security → each airline sees only their own data
Column masking → passenger PII hidden from non-authorized teams
GDPR deletion pipeline → automated with compliance logging
Auto-classification → auto-detect PII columns
Delta Sharing → share data with airline partners (no copying)
CI/CD (how code gets to production):
Azure DevOps pipelines → automate lint, test, deploy
Declarative Automation Bundles → YAML configs in Git
Environment promotion: dev → staging → prod
Testing: unit (pytest) → integration (Nutter) → quality (expectations)
Service Principals for prod → no human credentials in production
COST GOVERNANCE (prevent waste):
Cluster policies → enforce max size, auto-termination, required tags
Tags → team, project, airline_partner, cost_center
Serverless for bursty → Job Clusters for scheduled → Spot for batch
Azure cost dashboards + budget alerts per team
Monthly review → unused clusters, oversized VMs
MONITORING (is everything working?):
Lakeflow expectations → data quality at every layer
Quality monitoring anomaly detection → auto-alerts
Pipeline SLAs: Bronze < 5 min, Silver < 30 min, Gold < 1 hour
Azure Monitor → infrastructure health
PagerDuty → critical failure alerts
Weekly data health review → team syncs on data quality

SECTION 6: MOCK INTERVIEW — 10 Most Likely Amadeus Questions (1 hour)

Practice answering each in 3-5 minutes OUT LOUD. Time yourself.

MOCK Q1: "Tell me about your experience with Databricks and Delta Lake."

How to answer (structure):

  1. Start with your role and the scale you work at
  2. Mention specific Databricks features you've used
  3. Share ONE challenge you solved (with results)
  4. Keep it to 2-3 minutes

Example: "In my current role, I work with Delta Lake tables processing X million records daily. I've implemented Medallion architecture using Lakeflow Pipelines with quality expectations at every layer. A key challenge was optimizing a MERGE operation on our fact table that took 3 hours — by adding partition pruning, OPTIMIZE with Z-ORDER, and switching to Photon, we brought it down to 40 minutes."

MOCK Q2: "Design a CDC pipeline from Oracle to Delta Lake for our booking system."

Key points to cover (in order):

  1. Source: Debezium Oracle connector reads redo logs (no Oracle performance impact)
  2. Message broker: Azure Event Hubs (Kafka-compatible, managed service)
  3. Bronze: Auto Loader ingests from Event Hubs, appends raw CDC events as-is
  4. Silver: MERGE for fact tables (SCD Type 1), apply_changes for dimensions (SCD Type 2)
  5. Gold: Materialized views for business aggregations
  6. Quality: Lakeflow expectations at every layer
  7. Governance: Unity Catalog from day 1, PII masking on passenger data
  8. GDPR: Deletion pipeline for "right to be forgotten"

MOCK Q3: "How would you implement SCD Type 2 for our passenger dimension?"

Cover BOTH approaches:

  1. SQL MERGE with merge_key trick — show you can write the code manually (Day 2, Q5)
  2. Lakeflow apply_changes — show you know the modern, simpler approach (Day 2, Q6)
  3. Mention hash-based change detection (MD5 of tracked columns)
  4. Mention deduplication of source before MERGE

MOCK Q4: "Our booking fact table MERGE takes 3 hours. How would you optimize?"

Systematic approach (don't jump to solutions — investigate first):

  1. DESCRIBE DETAIL → Check numFiles (small file problem?) and sizeInBytes
  2. OPTIMIZE → Compact small files into ~1 GB files
  3. Add partition column to MERGE ON clause → partition pruning
  4. Z-ORDER on merge key → better data skipping
  5. Filter source → only MERGE rows that actually changed
  6. Enable Photon → 3-5x faster MERGE
  7. Consider Liquid Clustering → automatic, incremental optimization
  8. Enable autoOptimize for the future

MOCK Q5: "How do you handle data governance and GDPR for passenger data?"

Cover these areas:

  1. Unity Catalog → centralized governance, 3-level namespace
  2. Column masking → PII (email, phone) masked for non-authorized users
  3. Row-level security → each airline partner sees only their own bookings
  4. ABAC → tag-based policies (new!) — one policy governs all PII tables
  5. GDPR deletion → DELETE + VACUUM RETAIN 0 HOURS + compliance logging
  6. Pseudonymization → alternative to full deletion (preserves analytics)
  7. Audit → Unity Catalog audit logs → Azure Monitor
  8. Auto-classification → auto-detect PII columns (new 2025)

MOCK Q6: "Explain your CI/CD approach for Databricks pipelines."

Cover:

  1. Declarative Automation Bundles → YAML config in Git (same config for all environments)
  2. Azure DevOps pipeline → lint (ruff) → unit test (pytest) → validate bundle → deploy staging → integration test → manual approval → deploy prod
  3. Three environments: dev, staging, prod (same Unity Catalog metastore)
  4. Service Principals → production runs under robot accounts, not human accounts
  5. Testing: unit (pandas), integration (Nutter), quality (Lakeflow expectations)
  6. Rollback: redeploy previous bundle version from Git

MOCK Q7: "How would you manage costs for a large Databricks deployment?"

Cover all 5 areas (Day 4, Q7):

  1. Compute: Job Clusters for prod (not All-Purpose!), Serverless for bursty, Spot VMs for workers
  2. Cluster policies: enforce max size, auto-termination, required tags
  3. Storage: OPTIMIZE + VACUUM + Predictive Optimization
  4. Query: Photon + Liquid Clustering (scan less data)
  5. Monitoring: tags on everything, dashboards, budget alerts, chargeback per team

MOCK Q8: "What is Unity Catalog and how would you structure it for our organization?"

Cover:

  1. Centralized governance for ALL data assets (tables, models, files)
  2. Three-level namespace: catalog.schema.table
  3. Catalog strategy: travel_prod, travel_dev, travel_staging, travel_sandbox
  4. Schema per domain: bookings, passengers, flights, analytics
  5. Access: row-level security per airline, column masking for PII, ABAC for scale
  6. Migration from Hive Metastore using UCX tool
  7. Delta Sharing for airline partner data access

MOCK Q9: "A pipeline that was fine for 6 months is suddenly 5x slower. Walk me through your debugging."

Show systematic investigation (Day 4, Q11):

  1. Check data volume — did it spike? (seasonal, backfill?)
  2. Check Spark UI → Tasks tab → one task much slower than others? → DATA SKEW
  3. Check table health: DESCRIBE DETAIL → too many small files?
  4. Check if OPTIMIZE has been running regularly
  5. Check for upstream schema changes
  6. Check cluster config changes
  7. Quick fixes: OPTIMIZE, ANALYZE, enable AQE, increase shuffle partitions

MOCK Q10: "What's new in Databricks in 2025-2026 that excites you?"

Pick 3-4 features you can speak about confidently:

  1. Lakeflow Declarative Pipelines — DLT rebranded, contributed to open-source Apache Spark. Shows Databricks' commitment to open source.
  2. ABAC — tag-based governance at scale. One policy instead of 500 GRANTs.
  3. Serverless Workspaces — instant startup, pay-per-use. Changes how teams get started.
  4. Predictive Optimization — automatic OPTIMIZE/VACUUM. No more manual maintenance.
  5. Multi-table Transactions — atomic operations across tables. Solves real consistency problems.
  6. Lakebase — serverless Postgres for low-latency app serving inside Databricks.

FINAL PREP CHECKLIST

Day 1 Review (Delta Lake)

  • Delta transaction log, ACID, checkpoints
  • MERGE (syntax, duplicates, 6 optimization techniques, schema evolution)
  • OPTIMIZE / VACUUM / Z-ORDER / Liquid Clustering — what each does and when to use
  • Time travel recovery (VERSION AS OF, RESTORE, selective MERGE)
  • Lakehouse architecture (vs Lake vs Warehouse)
  • Delta 4.x features (Variant, Type Widening), Predictive Optimization

Day 2 Review (ETL Pipelines)

  • Medallion Architecture with Amadeus examples (Bronze/Silver/Gold decisions)
  • SCD Type 2 — merge_key trick in SQL + apply_changes in Lakeflow
  • CDC pipeline design (Oracle → Debezium → Kafka → Bronze → Silver → Gold)
  • CDF (Change Data Feed) vs CDC — what's the difference?
  • Auto Loader (modes, schema evolution, rescue mode, vs COPY INTO)
  • Lakeflow Declarative Pipelines (expectations 3 levels, MV vs ST)

Day 3 Review (Platform & Governance)

  • Unity Catalog (hierarchy, 6 pillars, 3-level namespace)
  • Row-level security + Column masking (write the SQL)
  • ABAC (tag-based policies — write the SQL)
  • Photon (when to use / NOT use)
  • Serverless vs Job Cluster vs All-Purpose
  • Azure: 3-plane architecture, Key Vault, Service Principals
  • GDPR: deletion pipeline + pseudonymization
  • Delta Sharing for airline partners

Day 4 Review (Production & CI/CD)

  • Workflows vs Airflow — when to use each
  • Declarative Automation Bundles + Azure DevOps CI/CD
  • Cost management (5 areas)
  • Spark execution plans — read bottom up
  • Spark UI debugging methodology
  • System design for Amadeus (workspaces, catalogs, compute, governance)
  • All 10 mock questions practiced OUT LOUD

INTERVIEW DAY TIPS

  1. Frame every answer with Amadeus context — "In a travel booking system with 200+ airline partners..."
  2. Show trade-offs, not just answers — "Z-ORDER is good but Liquid Clustering is better for new tables because it's incremental and automatic..."
  3. Mention scale — "At Amadeus' scale of billions of daily transactions..."
  4. Know the new names — DLT → Lakeflow Declarative Pipelines, Asset Bundles → Declarative Automation Bundles, Databricks Assistant → Genie Code
  5. Be honest about what you don't know — "I haven't used Lakebase in production yet, but I understand it's a serverless Postgres for low-latency serving use cases..."
  6. Ask clarifying questions — shows you think before jumping to solutions
  7. Think out loud — senior interviews value your thought process, not just the answer
  8. Use the STAR method for experience questions — Situation, Task, Action, Result