🧱
Databricks
Day 3: Azure Databricks Platform & Governance
🧱
🧱
Databricks · Section 13 of 17

Day 3: Azure Databricks Platform & Governance

Day 3: Azure Databricks Platform & Governance

💡 Interview Tip
Time: 6-7 hours | Priority: HIGH — Platform knowledge differentiates senior from mid-level Amadeus Context: Azure cloud, GDPR compliance, multi-tenant data sharing, PII governance Approach: Every topic starts with simple explanation → then interview-level depth

SECTION 1: UNITY CATALOG (1.5 hours)

Q1: What is Unity Catalog? What is the object hierarchy?

Simple Explanation: In Databricks, you have hundreds of tables, ML models, files, and functions. Unity Catalog is the single place that manages ALL of them — who can access what, where data came from, how it's organized, and what changed.

Think of Unity Catalog as the security guard + librarian + receptionist of your entire data platform:

  • Security guard: Controls who can access which tables (access control)
  • Librarian: Organizes all data into catalogs/schemas so you can find it easily (discovery)
  • Receptionist: Keeps a log of who accessed what and when (auditing)

Why do we need it? Without Unity Catalog: Each team creates tables in random locations, no one knows who has access to what, PII data leaks because there's no control, and you can't trace where data came from. With Unity Catalog: One central place to govern everything — tables, ML models, files, permissions.

The hierarchy (how data is organized):

🗂️Metastore (top level — ONE per Azure region)
│ Think of this as: "The entire data universe for your region"
Catalog (logical grouping — like separate departments)
│ Examples: travel_prod, travel_dev, travel_staging
│ Think of this as: "A database server" or "an environment"
Schema (like a database inside the catalog)
│ Examples: bookings, passengers, flights, analytics
│ Think of this as: "A database" or "a topic area"
Table (your data — managed or external)
View (a saved query on top of tables)
Function (reusable SQL/Python functions)
Model (ML models registered here)
Volume (unstructured files — CSVs, images, PDFs)

Three-level namespace — how you reference any table:

sql
SELECT * FROM travel_prod.bookings.fact_flights;
--           ^^^^^^^^^^^  ^^^^^^^^  ^^^^^^^^^^^^
--           catalog      schema    table
-- Read as: "In the travel_prod catalog, bookings schema, fact_flights table"

Why 3 levels? Before Unity Catalog, Databricks used Hive Metastore with only 2 levels (database.table). With 3 levels, you can separate environments (prod/dev), domains (bookings/flights), AND tables — much better organization.

Six pillars of Unity Catalog (what it does):

  1. Access Control — GRANT/REVOKE permissions at any level (catalog, schema, table, column)
  2. Discovery — Search, browse, preview all data assets in one place
  3. Lineage — Automatically tracks which table feeds into which table, column by column
  4. Auditing — Logs every access: who read what table, when, from which notebook
  5. Quality Monitoring — Detects anomalies in data quality automatically (new 2026)
  6. Data Sharing — Share data with external partners via Delta Sharing protocol

Interview tip: When asked about Unity Catalog, don't just describe the hierarchy. Mention the 6 pillars — it shows you understand it's not just a "table organizer" but a full governance platform.

Q2: What is row-level and column-level security?

Simple Explanation:

Column-level security (column masking): Different users see different values for the SAME column. For example, the HR team sees full email addresses, but the analytics team sees masked emails (kri***@gmail.com).

Row-level security (row filtering): Different users see different ROWS from the same table. For example, the Lufthansa team sees only Lufthansa bookings, the Air India team sees only Air India bookings.

Why do we need this at Amadeus? Amadeus has 200+ airline partners. Each airline should ONLY see their own bookings. And passenger PII (email, phone) should be hidden from teams that don't need it (GDPR compliance).

sql
-- ============================================
-- COLUMN MASKING: Hide passenger email based on user's team
-- ============================================

-- Step 1: Create a masking function
-- This function decides what each user sees for the "email" column
CREATE FUNCTION mask_email(email STRING)
RETURN CASE
    WHEN is_member('gdpr_admins') THEN email
    -- GDPR admins see the full email: "krishna@gmail.com"

    WHEN is_member('booking_agents') THEN CONCAT(LEFT(email, 3), '***@', SPLIT(email, '@')[1])
    -- Booking agents see partial: "kri***@gmail.com"

    ELSE '***@***.***'
    -- Everyone else sees fully masked: "***@***.***"
END;

-- Step 2: Apply the mask to the column
ALTER TABLE passengers ALTER COLUMN email SET MASK mask_email;
-- Now every query on passengers.email automatically applies the mask!
-- No code changes needed in any notebook or report.


-- ============================================
-- ROW-LEVEL SECURITY: Airlines see only their own bookings
-- ============================================

-- Step 1: Create a filter function
-- This function decides which ROWS each user can see
CREATE FUNCTION airline_access(airline_code STRING)
RETURN CASE
    WHEN is_member('amadeus_admins') THEN TRUE
    -- Admins see ALL rows (all airlines)

    WHEN is_member('lufthansa_team') THEN airline_code = 'LH'
    -- Lufthansa team sees only rows where airline_code = 'LH'

    WHEN is_member('air_india_team') THEN airline_code = 'AI'
    -- Air India team sees only rows where airline_code = 'AI'

    ELSE FALSE
    -- Everyone else sees NO rows
END;

-- Step 2: Apply the filter to the table
ALTER TABLE bookings SET ROW FILTER airline_access ON (airline_code);
-- Now when Lufthansa user queries bookings, they automatically see only LH rows!
-- The filter is applied at the database level — can't be bypassed.

Amadeus use case: "200+ airline partners each access only their own booking data via row-level security, while passenger PII is masked for non-authorized teams. This is all configured in Unity Catalog — no application-level code needed."

Interview tip: Mention that these security policies are applied at the catalog level — so even if someone writes new SQL or connects via Power BI, the rules still apply. It can't be bypassed.

Q3: What is data lineage? How does Unity Catalog track it?

Simple Explanation: Data lineage answers the question: "Where did this data come from?" and "What depends on this data?"

Unity Catalog automatically tracks lineage at the column level. You don't need to set anything up — it just works. Every time a notebook, job, or Lakeflow pipeline reads from table A and writes to table B, Unity Catalog records that connection.

Real-world analogy: Think of lineage like a family tree for your data. You can trace any piece of data back to its source (parents) and see what it feeds into (children).

Example:

Oracle bookingsBronze.raw_bookings → Silver.clean_bookings → Gold.daily_revenue
→ Gold.passenger_360

If someone asks "Where does the daily_revenue.total_fare column come from?", lineage shows: daily_revenue.total_fare ← Silver.clean_bookings.fare_amount ← Bronze.raw_bookings.fare ← Oracle.BOOKING_AMT

Use cases for Amadeus:

  1. Impact analysis: "If we rename the fare_amount column in Silver, which Gold tables will break?"
  2. Root cause debugging: "Gold revenue report shows wrong numbers — let me trace back to see where the data comes from"
  3. GDPR compliance: "Show me ALL tables that contain passenger email — I need to delete a passenger's data"
  4. Data quality: "Null values appearing in Gold — lineage shows they come from a specific Bronze source file"

How to view: Open Catalog Explorer in Databricks UI → click on any table → click "Lineage" tab.

Q4: What are Storage Credentials and External Locations?

Simple Explanation: When you create an external table (data stored in YOUR ADLS Gen2 account, not managed by Databricks), Unity Catalog needs a way to access that storage. This is done through a chain:

  1. Storage Credential = "Here are the keys to my cloud storage" (Azure Service Principal or Managed Identity)
  2. External Location = "Use those keys to access THIS specific folder" (maps a storage path to the credential)
  3. External Table = "Create a table pointing to data at that location"

Real-world analogy:

  • Storage Credential = Master key to the building (access to Azure storage account)
  • External Location = Permission to enter a specific room (access to a specific folder in ADLS)
  • External Table = A desk in that room (a table pointing to files in that folder)
🗂️Chain: Storage Credential → External Location → External Table
Storage Credential:
What: Azure Service Principal or Managed Identity credentials
Why: Gives Databricks permission to access your ADLS Gen2 storage
Who creates: Account admin (one-time setup)
Scope: Can be used by multiple External Locations
External Location:
What: Maps a specific ADLS path to a Storage Credential
Why: Controls which folders Databricks can access
Example: "abfss://raw@amadeusstorage.dfs.core.windows.net/" → uses credential X
Who creates: Metastore admin
sql
-- Step 1: Create a storage credential (admin task — done once)
CREATE STORAGE CREDENTIAL amadeus_adls_cred
WITH AZURE_MANAGED_IDENTITY (
    access_connector_id = '/subscriptions/.../accessConnectors/databricks-connector'
);
-- This tells Databricks: "Use this Azure Managed Identity to access ADLS"

-- Step 2: Create an external location (map a folder to the credential)
CREATE EXTERNAL LOCATION amadeus_raw_data
URL 'abfss://raw@amadeusstorage.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL amadeus_adls_cred);
-- This tells Databricks: "You can access this specific ADLS folder using that credential"

-- Step 3: Create an external table at that location
CREATE TABLE travel_prod.bookings.legacy_flights
LOCATION 'abfss://raw@amadeusstorage.dfs.core.windows.net/legacy/flights/';
-- This creates a table pointing to existing data files in ADLS
-- Data lives in YOUR storage — Databricks just reads it

When do you need this? When migrating from existing data lakes, when data is shared with other systems, or when your company's policy requires data to stay in their own storage accounts.

Q5: How do you migrate from Hive Metastore to Unity Catalog?

Simple Explanation: Before Unity Catalog existed, Databricks used Hive Metastore (the old system) to manage tables. It had only 2 levels (database.table), limited security, and no lineage. Many companies — including Amadeus — need to migrate from Hive to Unity Catalog.

Why migrate? Hive Metastore has no row-level security, no column masking, no lineage, no ABAC, no data sharing. Unity Catalog has all of these.

🗂️Migration Steps for Amadeus (500+ tables):
1. PREPARATION (Week 1)
Audit all existing Hive tables: how many? managed or external?
Plan the naming: hive_metastore.bookings.* → travel_prod.bookings.*
│ Old: SELECT * FROM bookings.fact_flights
│ New: SELECT * FROM travel_prod.bookings.fact_flights (3-level namespace)
Review existing permissions (who has access to what)
Plan catalogs: travel_prod, travel_dev, travel_staging
2. INFRASTRUCTURE (Week 2)
Create Unity Catalog metastore → assign to Azure workspace
Create storage credentials for ADLS Gen2 (Service Principal)
Create external locations (for external table data)
Create catalogs and schemas matching your naming plan
3. TABLE MIGRATION (Weeks 3-5)
Managed tables: Copy data using CTAS (Create Table As Select)
│ CREATE TABLE travel_prod.bookings.fact_flights
│ AS SELECT * FROM hive_metastore.bookings.fact_flights;
External tables: Just re-point to existing ADLS path (no data copy!)
│ CREATE TABLE travel_prod.bookings.legacy_data
│ LOCATION 'abfss://existing/path/in/adls/';
Tool: Databricks UCX (Unity Catalog Migration tool) automates most of this
Validate: Compare row counts and checksums between old and new
4. CODE UPDATES (Week 5-6)
Update ALL notebooks: db.table → catalog.schema.table
Update all jobs and Lakeflow pipelines
Update BI tool connections (Power BI, Tableau)
5. VALIDATION & CUTOVER (Weeks 7-8)
Run BOTH old Hive and new UC tables in parallel for 2 weeks
Compare query results — they must match exactly
After validation, deprecate all hive_metastore references

Interview tip: Mention the UCX tool — it shows you know the practical migration path, not just theory. Also mention "parallel run for 2 weeks" — it shows you understand production safety.

SECTION 2: NEW — ABAC (Attribute-Based Access Control) (45 min)

Q6: What is ABAC? How is it different from RBAC?

Simple Explanation:

RBAC (Role-Based Access Control) is the traditional way: you create roles and GRANT permissions on specific tables to each role. If you have 500 tables, you need 500 GRANT statements for each role. When a new table is added, you must manually GRANT access to it.

ABAC (Attribute-Based Access Control) is the new way: you TAG tables with attributes (like "pii=true") and create ONE policy that says "only gdpr_authorized users can access tables tagged pii=true." When a new PII table is added and tagged, the policy automatically applies — no manual GRANT needed.

Real-world analogy:

  • RBAC = Security badges with specific room numbers: "Badge allows Room 101, 102, 103". New room added? Update every badge.
  • ABAC = Security badges with rules: "Badge allows any room tagged 'public'". New public room added? Already accessible — no badge update needed.

Public Preview since November 2025 — knowing this in your interview shows you're up to date!

AspectRBAC (Traditional)ABAC (New)
How it worksGRANT permissions on specific tablesCreate policies based on tags
Scale500 tables = 500 GRANT statements per role1 policy covers ALL tables with matching tag
New table addedMust manually GRANT accessJust TAG the table → policy auto-applies
MaintenanceHigh — every new table needs manual grantsLow — tag-based, automatic
DynamicNo — static grantsYes — add/remove tags at any time
sql
-- ============================================
-- RBAC way (old — lots of manual work):
-- ============================================
GRANT SELECT ON TABLE passengers TO gdpr_team;
GRANT SELECT ON TABLE passenger_contacts TO gdpr_team;
GRANT SELECT ON TABLE passenger_addresses TO gdpr_team;
-- ... repeat for every PII table (500+ statements!)
-- New PII table added? Must remember to add another GRANT!

-- ============================================
-- ABAC way (new — one policy covers everything):
-- ============================================

-- Step 1: Tag tables with attributes
ALTER TABLE passengers SET TAGS ('pii' = 'true', 'gdpr_scope' = 'eu');
-- This table contains PII data and falls under EU GDPR

ALTER TABLE bookings SET TAGS ('pii' = 'false');
-- This table doesn't contain PII

ALTER TABLE passenger_contacts SET TAGS ('pii' = 'true', 'gdpr_scope' = 'eu');
-- Another PII table — just tag it!

-- Step 2: Create ONE policy that covers ALL PII tables
CREATE FUNCTION pii_access_policy()
RETURN CASE
    WHEN has_tag('pii', 'true') AND NOT is_member('gdpr_authorized') THEN FALSE
    -- If table is tagged pii=true AND user is NOT in gdpr_authorized group → DENY access
    ELSE TRUE
    -- Otherwise → ALLOW access
END;
-- has_tag() checks if the table being accessed has a specific tag
-- is_member() checks if the current user belongs to a specific group

-- Step 3: Apply policy to entire catalog (all tables in it)
ALTER CATALOG travel_prod SET ACCESS CONTROL USING pii_access_policy;
-- Now EVERY table in travel_prod with tag pii=true is automatically protected!
-- New PII table added? Just tag it pii=true → protection is automatic!

Note: ABAC uses snake_case functions: has_tag, has_tag_value, is_member. CamelCase (hasTag) is being deprecated.

Amadeus use case: "Instead of managing GRANT statements for 500+ tables, we tag all PII tables with pii=true and apply ONE ABAC policy. When any new table is created with that tag, it's automatically governed. This saves hours of manual access management."

SECTION 3: PHOTON ENGINE (30 min)

Q7: What is Photon Engine?

Simple Explanation: Normally, Spark runs on the JVM (Java Virtual Machine). It works, but it's not the fastest. Photon is Databricks' own query engine written in C++ (a much faster language for data processing). It replaces Spark's JVM execution with native C++ code — making queries 3-12x faster.

The best part: You don't change ANY code. Same SQL, same PySpark, same notebooks. You just pick "Photon" runtime when creating your cluster, and everything runs faster automatically.

Real-world analogy: Imagine your car (Spark on JVM) gets a turbocharged engine swap (Photon in C++). Same car, same dashboard, same steering wheel — but much faster.

When Photon helps A LOT:

OperationSpeed ImprovementWhy
Table scans (reading data)Up to 12x fasterC++ reads data more efficiently than JVM
Aggregations (SUM, COUNT, AVG)3-10x fasterVectorized processing (processes many rows at once)
Joins3-8x fasterBetter memory management
Filters (WHERE clauses)5-10x fasterNative code skips rows faster
MERGE/UPDATE/DELETE3-5x fasterImportant for Silver layer at Amadeus!

When Photon does NOT help:

OperationWhy It Doesn't Help
Python UDFsUDFs run in Python process, completely bypasses Photon
RDD-based codePhoton only works with DataFrame API and SQL, not old-style RDDs
Very small data (<100 MB)The overhead of Photon startup isn't worth it for tiny datasets
ML training (MLlib)ML uses different execution path — use GPU clusters instead
Unsupported operationsPhoton silently falls back to Spark (no error — just runs slower)
python — editable
# HOW TO ENABLE PHOTON:

# Option 1: Select Photon runtime when creating cluster (recommended)
# In Databricks UI: Cluster → Databricks Runtime → select "18.1 LTS Photon"
# That's it! No code changes needed.

# Option 2: Enable via config (if cluster already exists)
spark.conf.set("spark.databricks.photon.enabled", "true")

# Your existing SQL and PySpark code runs 3-12x faster — no modifications!

Amadeus answer: "We use Photon for all our Silver-layer MERGE operations on booking tables. The 3-5x speedup on MERGE reduced our daily pipeline from 3 hours to 45 minutes. No code changes were needed — we just switched the runtime."

Interview tip: Don't just say "Photon makes things faster." Say WHEN to use it and WHEN NOT to use it. Mention "falls back gracefully" — if Photon can't handle an operation, it silently uses regular Spark instead of crashing.

SECTION 4: SERVERLESS COMPUTE (45 min)

Q8: What is Serverless compute in Databricks?

Simple Explanation: Normally in Databricks, you create a cluster (a group of VMs/machines) to run your code. You pick the machine type, the number of machines, configure auto-scaling, etc. If the cluster sits idle for hours, you're still paying. If it takes 10 minutes to start up, you wait.

Serverless means: Databricks manages the machines FOR you. You just submit your query/job, and Databricks instantly provides compute resources. When you're done, resources are released. You only pay for the time your code is actually running.

Real-world analogy:

  • Standard cluster = Owning a car. You pay for insurance, parking, gas even when it's parked.
  • Serverless = Uber. You pay only when you're riding. No car maintenance, no parking fees.
AspectStandard ClustersServerless
Who manages it?You configure machine type, count, etc.Databricks manages everything
Startup time3-10 minutes (slow!)<10 seconds (instant!)
ScalingYou configure min/max workersAutomatic — scales up/down as needed
BillingPay for running time (EVEN when idle)Pay ONLY for execution time
Cost savingsUp to 70% savings for bursty workloads
ComplianceStandardHITRUST, PCI-DSS, TISAX (important for travel!)

What can run on Serverless?

  • SQL Warehouses (most mature — been around longest)
  • Notebooks (interactive development)
  • Jobs/Workflows (scheduled production pipelines)
  • Lakeflow Pipelines
  • JAR tasks (new — February 2026)
  • AI Runtime with GPU (Public Preview — March 2026)

Q9: What are Serverless Workspaces?

Simple Explanation: A regular Databricks workspace requires you to set up compute configurations, storage, networking, etc. A Serverless Workspace (GA January 2026) is a pre-configured workspace where EVERYTHING is managed — compute, storage, networking. You just log in and start working.

Think of it as: "Databricks out of the box — no setup needed."

Key features:

  • Managed compute — no cluster configuration needed
  • Managed storage — Databricks handles where data is stored
  • Instant startup — everything is ready in seconds
  • Auto-scaling — resources scale with your workload
  • Network boundary isolation — secure by default

When to use Serverless vs Standard:

WorkloadRecommendationWhy
SQL analytics (BI dashboards)Serverless SQL WarehouseAuto-scale, auto-suspend, no idle cost
Bursty/intermittent jobsServerlessPay only when running — big cost savings
Development/ad-hoc explorationServerlessInstant startup, no idle cost
24/7 streaming (always running)StandardPredictable load → predictable cost is cheaper
GPU workloads (ML training)StandardGPU serverless is still in preview
Very specific hardware needsStandardServerless doesn't let you pick machine types

Q10: What is the difference between Job Cluster, All-Purpose Cluster, and Serverless?

Simple Explanation: Databricks has 3 types of compute. Knowing which to use WHERE is a common interview question:

All-Purpose Cluster = A long-running cluster shared by multiple users. Like a shared office computer — always on, multiple people use it. MOST EXPENSIVE.

Job Cluster = A cluster created just for one job and destroyed after. Like a rental car — rent it for the trip, return it after. CHEAPER than All-Purpose.

Serverless = No cluster at all — Databricks provides compute on demand. Like Uber — just request a ride, pay per minute. CHEAPEST for bursty work.

AspectAll-PurposeJob ClusterServerless
LifecycleRuns until you stop itCreated per job, destroyed afterOn-demand per query/task
CostMost expensive (all-purpose pricing)Cheaper (job compute pricing)Cheapest for bursty
SharingMultiple users share itSingle job onlySingle job/query
Startup timeInstant (already running)3-10 minutes (creating VMs)<10 seconds
When to useDevelopment ONLYProduction scheduled jobsBI queries, bursty workloads

Golden rule for production: NEVER use All-Purpose clusters for production pipelines. They're too expensive. Use Job Clusters or Serverless.

Amadeus answer: "We use All-Purpose clusters only for development notebooks. Production ETL pipelines run on Job Clusters with Photon for cost efficiency. BI dashboards connect via Serverless SQL Warehouses that auto-suspend when idle — zero cost overnight."

SECTION 5: AZURE-SPECIFIC INTEGRATION (45 min)

Q11: How does Azure Databricks architecture work? (Three-plane model)

Simple Explanation: Azure Databricks has 3 main components (called "planes") that work together. Understanding this helps you answer architecture questions and troubleshoot issues.

📐 Architecture Diagram
┌──────────────────────────────────────────────────┐
│              CONTROL PLANE                       │
│        (Managed by Databricks in Azure)          │
│                                                  │
│  What lives here:                                │
│  ├── Databricks Workspace UI (the web interface) │
│  ├── REST APIs (for automation)                  │
│  ├── Job Scheduler (runs your scheduled jobs)    │
│  ├── Unity Catalog Metastore (table metadata)    │
│  └── Cluster Manager (creates/destroys VMs)      │
│                                                  │
│  Key point: You don't manage this. Databricks    │
│  runs it in THEIR Azure subscription.            │
└──────────────────┬───────────────────────────────┘
                   │ Secure encrypted connection
┌──────────────────▼───────────────────────────────┐
│              COMPUTE PLANE                       │
│        (In YOUR Azure subscription)              │
│                                                  │
│  What lives here:                                │
│  ├── VMs (cluster worker/driver nodes)           │
│  ├── VNet (Virtual Network for isolation)        │
│  ├── NSG (Network Security Groups — firewall)    │
│  └── Managed Resource Group (Databricks creates) │
│                                                  │
│  Key point: VMs run in YOUR subscription,        │
│  so your data never leaves your network.         │
└──────────────────┬───────────────────────────────┘
                   │
┌──────────────────▼───────────────────────────────┐
│              STORAGE PLANE                       │
│        (In YOUR Azure subscription)              │
│                                                  │
│  What lives here:                                │
│  ├── ADLS Gen2 (your main data lake storage)     │
│  ├── Azure Blob Storage                          │
│  └── DBFS root storage (Databricks internal)     │
│                                                  │
│  Key point: Your data stays in YOUR storage.     │
│  Databricks never stores your data.              │
└──────────────────────────────────────────────────┘

Why this matters for Amadeus: "Our passenger PII data never leaves our Azure subscription. Databricks Control Plane manages orchestration, but the actual data processing (Compute) and storage (ADLS Gen2) happen inside our own subscription — important for GDPR compliance."

Key Azure services that integrate with Databricks:

Azure ServiceWhat It Does With Databricks
ADLS Gen2Primary storage for all Delta tables (the data lake)
Azure Key VaultStores secrets (passwords, API keys) — Databricks reads them securely
Azure Entra ID (formerly Active Directory)Single sign-on (SSO), user provisioning, Service Principals
Azure Event HubsKafka-compatible messaging — used for CDC streaming
Azure DevOpsCI/CD pipelines to deploy Databricks jobs
Azure Data FactoryCan orchestrate Databricks jobs (alternative to Databricks Workflows)
Power BIConnects directly to Databricks SQL Warehouse for BI reports

Q12: What is Azure Key Vault integration? Why is it important?

Simple Explanation: You should NEVER put passwords, API keys, or connection strings directly in your code or notebooks. Anyone who sees the notebook can see the password. Instead, store secrets in Azure Key Vault (a secure vault in Azure), and Databricks reads them at runtime.

Real-world analogy: Instead of writing your house key code on a Post-it note stuck to the door, you keep it in a safe (Key Vault) and only authorized people can open the safe.

python — editable
# BAD — NEVER do this! Password visible in notebook:
# oracle_password = "MySecretP@ssw0rd"

# GOOD — Read password from Azure Key Vault at runtime:

# Step 1: (Done ONCE by admin) Create a "secret scope" in Databricks
# that connects to your Azure Key Vault:
# databricks secrets create-scope \
#   --scope amadeus-secrets \
#   --scope-backend-type AZURE_KEYVAULT \
#   --resource-id /subscriptions/.../vaults/amadeus-kv \
#   --dns-name https://amadeus-kv.vault.azure.net/

# Step 2: In your notebook, read secrets using dbutils:
oracle_password = dbutils.secrets.get(
    scope="amadeus-secrets",         # Which vault to read from
    key="oracle-prod-password"       # Which secret to get
)
# The password is now in the variable — but if you try to print it,
# Databricks shows [REDACTED] in the output (security feature!)

kafka_connection = dbutils.secrets.get(
    scope="amadeus-secrets",
    key="eventhubs-connection-string"  # Connection string for Azure Event Hubs
)
# Use these variables in your JDBC connections, Kafka configs, etc.

Q13: What are Service Principals? Why use them for production?

Simple Explanation: A Service Principal is like a "robot user" account in Azure. Instead of using a real person's login to run production jobs, you create a Service Principal (non-human identity) that represents your application.

Why not use a real person's account?

  • What if the person leaves the company? All jobs break.
  • What if they change their password? All jobs break.
  • What if MFA (multi-factor authentication) pops up? Job can't enter the code.
  • Audit logs show "Krishna ran this job" — was it Krishna manually, or the scheduled job?
🗂️Why Service Principals for production:
No dependency on individual employees (person leaves → jobs still work)
No MFA interruptions (robots don't need phone verification)
Scoped permissions (give it ONLY what it needs — principle of least privilege)
Clear audit trail ("amadeus-etl-sp ran this job" vs "krishna ran this job")
Required for Unity Catalog external access in production
How to set up:
1. Create Service Principal in Azure Entra ID (done by Azure admin)
2. Add the Service Principal to your Databricks workspace
3. Grant Unity Catalog permissions: GRANT SELECT ON TABLE ... TO `amadeus-etl-sp`
4. Configure Job Clusters to run as this Service Principal
5. Store the SP's secret in Azure Key Vault (rotate every 90 days)

Amadeus answer: "All our production pipelines run under Service Principals, not user accounts. This ensures jobs aren't affected when employees change teams, and audit logs clearly show which application accessed what data."

SECTION 6: DATA GOVERNANCE AT SCALE (1 hour)

Q14: How do you handle GDPR "Right to Be Forgotten" in Delta Lake?

Simple Explanation: GDPR (EU data protection law) gives every person the right to say: "Delete ALL my data from your systems." For Amadeus, this means: when a passenger requests deletion, we must remove their data from EVERY table — and prove it's really gone.

The challenge with Delta Lake: Delta keeps history (time travel). Even after you DELETE a row, the old data is still in the old Parquet files. You must also VACUUM to physically remove those old files.

Two approaches:

Approach 1: Hard Delete (complete removal)

sql
-- Step 1: Delete the passenger from ALL tables that contain their data
DELETE FROM dim_passenger WHERE passenger_id = 'PAX-12345';
-- Removes this passenger's row from the dimension table

DELETE FROM fact_bookings WHERE passenger_id = 'PAX-12345';
-- Removes all their bookings

DELETE FROM silver_interactions WHERE passenger_id = 'PAX-12345';
-- Removes all their interaction records

-- Step 2: VACUUM with 0 hours retention to physically remove old files
-- WARNING: This breaks time travel! But GDPR requires it.
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
-- Must disable the safety check (Delta normally prevents 0-hour vacuum)

VACUUM dim_passenger RETAIN 0 HOURS;
-- Physically deletes ALL old files — data is truly gone
VACUUM fact_bookings RETAIN 0 HOURS;
VACUUM silver_interactions RETAIN 0 HOURS;

-- Step 3: Log the deletion for compliance audit
INSERT INTO gdpr_deletion_log VALUES (
    'PAX-12345',                     -- Which passenger was deleted
    current_timestamp(),             -- When it was deleted
    'completed',                     -- Status
    'tables: dim_passenger, fact_bookings, silver_interactions'  -- Which tables
);
-- This log PROVES to regulators that we deleted the data

Approach 2: Pseudonymization (preferred — keeps analytics working) Instead of deleting the entire row, replace PII with random/hashed values. The row stays for analytics, but the person can't be identified.

sql
UPDATE dim_passenger SET
    full_name = 'REDACTED',                           -- Can't identify the person
    email = CONCAT(MD5(email), '@redacted.com'),      -- Hashed (one-way, can't reverse)
    phone = 'REDACTED',
    address = 'REDACTED'
WHERE passenger_id = 'PAX-12345';
-- The booking data still exists for analytics (revenue, route stats)
-- But the person's identity is completely removed

Interview tip: Mention BOTH approaches and when to use each. Hard delete = when regulation requires complete removal. Pseudonymization = when you need to keep analytics data but remove identity.

Q15: How would you design a governance framework for 200+ data engineers?

Simple Explanation: With 200+ engineers, you can't rely on "everyone will follow the rules." You need automated governance — policies that are enforced at the platform level, not just documented in a wiki.

🗂️1. ACCESS CONTROL (Unity Catalog)
4 catalogs: travel_prod, travel_dev, travel_staging, travel_sandbox
│ Developers get full access to dev and sandbox
│ Only pipelines (Service Principals) write to prod
Row-level security: each airline sees only their own data
Column masking: PII hidden from teams that don't need it
ABAC: tag PII tables → one policy governs all automatically
Principle of least privilege: start with no access, grant only what's needed
2. DATA CLASSIFICATION (know what data you have)
Auto-tagging (new Oct 2025 feature): Databricks auto-detects PII columns
Manual tags: pii=true, gdpr_scope=eu, sensitivity=high
Tags drive ABAC policies (tagged pii=true → access restricted)
Discover page (new Feb 2026): browse data organized by business domain
3. AUDIT & COMPLIANCE (prove you're following rules)
Unity Catalog audit logs → Azure Monitor (who accessed what, when)
Delta Lake DESCRIBE HISTORY: track every change to every table
GDPR deletion pipeline: automated process with compliance logging
Data retention policies: auto-delete data older than X years
4. DATA QUALITY (ensure data is correct)
Lakeflow expectations at every layer (Bronze/Silver/Gold)
Quality monitoring anomaly detection (new Feb 2026): auto-alerts
Data contracts: teams agree on schema/quality of data they share
Quarantine: bad data goes to a separate table + alert sent
5. DATA DISCOVERY (find what you need)
Unity Catalog search: find any table, view, model
Column descriptions and table comments: document what data means
Lineage visualization: see data flow from source to report
Domains (new Feb 2026): organize by business area (Bookings, Flights, Loyalty)
Certification tags (GA Oct 2025): mark trusted tables as "certified"

Q16: What is Delta Sharing?

Simple Explanation: Delta Sharing is a way to share data with external partners (other companies) without copying the data. The partner reads directly from YOUR storage. You control what they can access, and you see audit logs of what they read.

Why is this useful for Amadeus? Amadeus works with 200+ airlines. Each airline needs access to their own booking data. Instead of creating copies for each airline (expensive, hard to keep in sync), you share the data directly using Delta Sharing.

Real-world analogy: Instead of printing copies of a report for each airline (copies get outdated), you give each airline a "viewing link" to the live report. They always see the latest data, and you control who can view what.

How it works:

1. Amadeus (Provider) shares a Delta table via Delta Sharing
2. Amadeus creates a "share" with only Lufthansa's data (filtered)
3. Amadeus gives Lufthansa a sharing profile (JSON file with credentials)
4. Lufthansa (Recipient) uses ANY tool to read the shared data:
Databricks, Spark, pandas, Power BI, Snowflake, Trino...
5. Data is read DIRECTLY from Amadeus's ADLS — no copying!
6. Amadeus sees audit logs: "Lufthansa read fact_bookings at 3pm"
sql
-- Step 1: Create a share (a named collection of shared tables)
CREATE SHARE airline_bookings_share;

-- Step 2: Add a table to the share — with FILTERING (only Lufthansa data!)
ALTER SHARE airline_bookings_share ADD TABLE travel_prod.bookings.fact_bookings
    PARTITION (airline_code = 'LH');
-- Only rows where airline_code = 'LH' are shared. Lufthansa can't see other airlines.

-- Step 3: Create a recipient and grant access
CREATE RECIPIENT lufthansa WITH SHARING_IDENTIFIER 'lufthansa.databricks.com';
GRANT SELECT ON SHARE airline_bookings_share TO RECIPIENT lufthansa;
-- Lufthansa now has read-only access to their filtered booking data

New December 2025: Delta Sharing now supports Iceberg clients — recipients can use Snowflake, Trino, or Flink to read shared Delta data. The recipient doesn't even need Databricks!

SECTION 7: NEW 2026 FEATURES (30 min)

Q17: What are Multi-table Transactions?

Simple Explanation: Before March 2026, Delta Lake transactions were per-table. If you needed to update 3 tables atomically (all succeed or all fail), you had to write complex error-handling code.

Multi-table Transactions (Public Preview March 2026) let you wrap multiple table operations in a single atomic block using BEGIN ATOMIC ... END;. Either ALL operations succeed, or NONE of them apply.

Real-world analogy: Transferring money between bank accounts. You want to debit Account A AND credit Account B atomically. If the credit fails, the debit should also be rolled back. Multi-table transactions = this guarantee for your data tables.

sql
-- Process a new booking: update 3 tables atomically
BEGIN ATOMIC
    -- 1. Insert the booking into the fact table
    INSERT INTO fact_bookings VALUES (...);

    -- 2. Update the passenger's loyalty tier
    UPDATE dim_passenger SET loyalty_tier = 'Gold'
    WHERE passenger_id = 'PAX-123';

    -- 3. Record in the audit log
    INSERT INTO audit_log VALUES (...);
END;
-- If ANY of these 3 operations fails, ALL are rolled back
-- No half-completed bookings in the system!

Requirement: All tables must be Unity Catalog managed tables (not external tables).

Amadeus use case: "When processing a booking, we update the fact table, the passenger dimension, and the audit log atomically. Multi-table transactions guarantee we never have a booking without a corresponding audit entry."

Q18: What is Compatibility Mode?

Simple Explanation: Delta Lake uses a specific format that only Delta-compatible engines can read. But what if your partner uses Snowflake or AWS Athena (which don't natively support Delta)? Compatibility Mode makes your Delta table readable by non-Delta engines — they see it as plain Parquet.

sql
-- Enable compatibility mode
ALTER TABLE bookings SET TBLPROPERTIES ('delta.compatibility.mode' = 'read');
-- Now Athena, Snowflake, Azure Fabric can read this table
-- They see it as regular Parquet files — no Delta SDK needed

When to use: When you need to share data with teams/partners who use non-Databricks tools.

Q19: What is Lakebase?

Simple Explanation: Delta tables are great for analytics (batch queries, BI). But what about when an application needs to look up a passenger's details in 10 milliseconds? Delta tables can't do that — they're designed for scanning large amounts of data, not fast single-row lookups.

Lakebase (GA on Azure March 2026) is a serverless PostgreSQL-compatible database built into Databricks. It fills the gap between analytics (Delta) and applications (need fast lookups).

Think of it as: "A real database inside Databricks for application use cases."

Key features:

  • Scale-to-zero: When nobody is querying, it costs $0 (auto-shuts down)
  • Database branching: Create an instant copy for testing (like git branch for databases!)
  • Instant restore: Go back to any point in time
  • Auto-failover HA: If one server fails, another takes over automatically
sql
-- Create a Lakebase database (uses PostgreSQL syntax!)
CREATE DATABASE amadeus_app_db ENGINE = 'LAKEBASE';

-- Create a lookup table (standard PostgreSQL)
CREATE TABLE lookup_airports (
    iata_code VARCHAR(3) PRIMARY KEY,   -- Airport code: BLR, DEL, JFK
    airport_name VARCHAR(100),           -- Full name
    city VARCHAR(50),
    country VARCHAR(50)
);
-- This can serve API requests in milliseconds!

When to use Lakebase vs Delta tables:

Use CaseLakebaseDelta Tables
API backend (fast single-row lookups)✅ Perfect❌ Too slow for single rows
Batch analytics & BI reporting❌ Not designed for this✅ Perfect
Feature serving for ML (low-latency)✅ Good⚠️ OK for batch features
Application CRUD operations✅ Perfect❌ Not designed for this

Q20: What is Predictive Optimization?

Simple Explanation: In Day 1, we learned that you need to run OPTIMIZE (compact files), VACUUM (clean up old files), and ANALYZE TABLE (refresh statistics) regularly to keep tables healthy. But scheduling these manually for 500+ tables is tedious.

Predictive Optimization does this AUTOMATICALLY. Databricks watches how your tables are used, learns patterns, and runs these maintenance commands at the right time. No manual scheduling needed.

Real-world analogy: Like a Roomba robot vacuum. Instead of manually vacuuming every room on a schedule, the Roomba learns your floor plan and cleans automatically when needed.

What it does:

  • Auto-OPTIMIZE: Compacts small files into large ones when needed
  • Auto-VACUUM: Cleans up old unused files to free storage
  • Auto-ANALYZE: Refreshes table statistics for better query plans
  • Learns patterns: If a table gets heavy writes at midnight, it optimizes at 2am
  • Enabled by default on all new Unity Catalog managed tables (since 2025)
  • No configuration needed — just use managed tables!

Amadeus answer: "For our 500+ Delta tables, Predictive Optimization eliminates manual maintenance scheduling. The platform learns each table's access patterns and optimizes automatically — saving our team hours of DevOps work every week."

QUICK REVISION CHECKLIST — DAY 3

Test yourself — can you answer each in 2-3 minutes?

  • What is Unity Catalog? What are the 6 pillars? Explain the 3-level namespace. (Q1)
  • How do you implement column masking and row-level security? Write the SQL. (Q2)
  • What is data lineage? How does Unity Catalog track it automatically? (Q3)
  • What are Storage Credentials → External Locations → External Tables? (Q4)
  • How do you migrate from Hive Metastore to Unity Catalog? (Q5)
  • What is ABAC? How is it different from RBAC? Write a tag-based policy. (Q6)
  • What is Photon? When does it help? When does it NOT help? (Q7)
  • What is Serverless compute? When to use vs standard clusters? (Q8)
  • Job Cluster vs All-Purpose vs Serverless — when to use each? (Q10)
  • Explain the Azure Databricks 3-plane architecture. (Q11)
  • How does Key Vault integration work? Why never hardcode secrets? (Q12)
  • What are Service Principals? Why use them for production? (Q13)
  • How do you handle GDPR "right to be forgotten" in Delta Lake? (Q14)
  • What is Delta Sharing? How would Amadeus share data with airline partners? (Q16)
  • What are Multi-table Transactions? When to use? (Q17)
  • What is Lakebase? When to use vs Delta tables? (Q19)
  • What is Predictive Optimization? (Q20)