Day 3: Azure Databricks Platform & Governance
SECTION 1: UNITY CATALOG (1.5 hours)
Q1: What is Unity Catalog? What is the object hierarchy?
Simple Explanation: In Databricks, you have hundreds of tables, ML models, files, and functions. Unity Catalog is the single place that manages ALL of them — who can access what, where data came from, how it's organized, and what changed.
Think of Unity Catalog as the security guard + librarian + receptionist of your entire data platform:
- Security guard: Controls who can access which tables (access control)
- Librarian: Organizes all data into catalogs/schemas so you can find it easily (discovery)
- Receptionist: Keeps a log of who accessed what and when (auditing)
Why do we need it? Without Unity Catalog: Each team creates tables in random locations, no one knows who has access to what, PII data leaks because there's no control, and you can't trace where data came from. With Unity Catalog: One central place to govern everything — tables, ML models, files, permissions.
The hierarchy (how data is organized):
Three-level namespace — how you reference any table:
SELECT * FROM travel_prod.bookings.fact_flights;
-- ^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^
-- catalog schema table
-- Read as: "In the travel_prod catalog, bookings schema, fact_flights table"
Why 3 levels? Before Unity Catalog, Databricks used Hive Metastore with only 2 levels (database.table). With 3 levels, you can separate environments (prod/dev), domains (bookings/flights), AND tables — much better organization.
Six pillars of Unity Catalog (what it does):
- Access Control — GRANT/REVOKE permissions at any level (catalog, schema, table, column)
- Discovery — Search, browse, preview all data assets in one place
- Lineage — Automatically tracks which table feeds into which table, column by column
- Auditing — Logs every access: who read what table, when, from which notebook
- Quality Monitoring — Detects anomalies in data quality automatically (new 2026)
- Data Sharing — Share data with external partners via Delta Sharing protocol
Interview tip: When asked about Unity Catalog, don't just describe the hierarchy. Mention the 6 pillars — it shows you understand it's not just a "table organizer" but a full governance platform.
Q2: What is row-level and column-level security?
Simple Explanation:
Column-level security (column masking): Different users see different values for the SAME column. For example, the HR team sees full email addresses, but the analytics team sees masked emails (kri***@gmail.com).
Row-level security (row filtering): Different users see different ROWS from the same table. For example, the Lufthansa team sees only Lufthansa bookings, the Air India team sees only Air India bookings.
Why do we need this at Amadeus? Amadeus has 200+ airline partners. Each airline should ONLY see their own bookings. And passenger PII (email, phone) should be hidden from teams that don't need it (GDPR compliance).
-- ============================================
-- COLUMN MASKING: Hide passenger email based on user's team
-- ============================================
-- Step 1: Create a masking function
-- This function decides what each user sees for the "email" column
CREATE FUNCTION mask_email(email STRING)
RETURN CASE
WHEN is_member('gdpr_admins') THEN email
-- GDPR admins see the full email: "krishna@gmail.com"
WHEN is_member('booking_agents') THEN CONCAT(LEFT(email, 3), '***@', SPLIT(email, '@')[1])
-- Booking agents see partial: "kri***@gmail.com"
ELSE '***@***.***'
-- Everyone else sees fully masked: "***@***.***"
END;
-- Step 2: Apply the mask to the column
ALTER TABLE passengers ALTER COLUMN email SET MASK mask_email;
-- Now every query on passengers.email automatically applies the mask!
-- No code changes needed in any notebook or report.
-- ============================================
-- ROW-LEVEL SECURITY: Airlines see only their own bookings
-- ============================================
-- Step 1: Create a filter function
-- This function decides which ROWS each user can see
CREATE FUNCTION airline_access(airline_code STRING)
RETURN CASE
WHEN is_member('amadeus_admins') THEN TRUE
-- Admins see ALL rows (all airlines)
WHEN is_member('lufthansa_team') THEN airline_code = 'LH'
-- Lufthansa team sees only rows where airline_code = 'LH'
WHEN is_member('air_india_team') THEN airline_code = 'AI'
-- Air India team sees only rows where airline_code = 'AI'
ELSE FALSE
-- Everyone else sees NO rows
END;
-- Step 2: Apply the filter to the table
ALTER TABLE bookings SET ROW FILTER airline_access ON (airline_code);
-- Now when Lufthansa user queries bookings, they automatically see only LH rows!
-- The filter is applied at the database level — can't be bypassed.
Amadeus use case: "200+ airline partners each access only their own booking data via row-level security, while passenger PII is masked for non-authorized teams. This is all configured in Unity Catalog — no application-level code needed."
Interview tip: Mention that these security policies are applied at the catalog level — so even if someone writes new SQL or connects via Power BI, the rules still apply. It can't be bypassed.
Q3: What is data lineage? How does Unity Catalog track it?
Simple Explanation: Data lineage answers the question: "Where did this data come from?" and "What depends on this data?"
Unity Catalog automatically tracks lineage at the column level. You don't need to set anything up — it just works. Every time a notebook, job, or Lakeflow pipeline reads from table A and writes to table B, Unity Catalog records that connection.
Real-world analogy: Think of lineage like a family tree for your data. You can trace any piece of data back to its source (parents) and see what it feeds into (children).
Example:
If someone asks "Where does the daily_revenue.total_fare column come from?", lineage shows:
daily_revenue.total_fare ← Silver.clean_bookings.fare_amount ← Bronze.raw_bookings.fare ← Oracle.BOOKING_AMT
Use cases for Amadeus:
- Impact analysis: "If we rename the fare_amount column in Silver, which Gold tables will break?"
- Root cause debugging: "Gold revenue report shows wrong numbers — let me trace back to see where the data comes from"
- GDPR compliance: "Show me ALL tables that contain passenger email — I need to delete a passenger's data"
- Data quality: "Null values appearing in Gold — lineage shows they come from a specific Bronze source file"
How to view: Open Catalog Explorer in Databricks UI → click on any table → click "Lineage" tab.
Q4: What are Storage Credentials and External Locations?
Simple Explanation: When you create an external table (data stored in YOUR ADLS Gen2 account, not managed by Databricks), Unity Catalog needs a way to access that storage. This is done through a chain:
- Storage Credential = "Here are the keys to my cloud storage" (Azure Service Principal or Managed Identity)
- External Location = "Use those keys to access THIS specific folder" (maps a storage path to the credential)
- External Table = "Create a table pointing to data at that location"
Real-world analogy:
- Storage Credential = Master key to the building (access to Azure storage account)
- External Location = Permission to enter a specific room (access to a specific folder in ADLS)
- External Table = A desk in that room (a table pointing to files in that folder)
-- Step 1: Create a storage credential (admin task — done once)
CREATE STORAGE CREDENTIAL amadeus_adls_cred
WITH AZURE_MANAGED_IDENTITY (
access_connector_id = '/subscriptions/.../accessConnectors/databricks-connector'
);
-- This tells Databricks: "Use this Azure Managed Identity to access ADLS"
-- Step 2: Create an external location (map a folder to the credential)
CREATE EXTERNAL LOCATION amadeus_raw_data
URL 'abfss://raw@amadeusstorage.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL amadeus_adls_cred);
-- This tells Databricks: "You can access this specific ADLS folder using that credential"
-- Step 3: Create an external table at that location
CREATE TABLE travel_prod.bookings.legacy_flights
LOCATION 'abfss://raw@amadeusstorage.dfs.core.windows.net/legacy/flights/';
-- This creates a table pointing to existing data files in ADLS
-- Data lives in YOUR storage — Databricks just reads it
When do you need this? When migrating from existing data lakes, when data is shared with other systems, or when your company's policy requires data to stay in their own storage accounts.
Q5: How do you migrate from Hive Metastore to Unity Catalog?
Simple Explanation: Before Unity Catalog existed, Databricks used Hive Metastore (the old system) to manage tables. It had only 2 levels (database.table), limited security, and no lineage. Many companies — including Amadeus — need to migrate from Hive to Unity Catalog.
Why migrate? Hive Metastore has no row-level security, no column masking, no lineage, no ABAC, no data sharing. Unity Catalog has all of these.
Interview tip: Mention the UCX tool — it shows you know the practical migration path, not just theory. Also mention "parallel run for 2 weeks" — it shows you understand production safety.
SECTION 2: NEW — ABAC (Attribute-Based Access Control) (45 min)
Q6: What is ABAC? How is it different from RBAC?
Simple Explanation:
RBAC (Role-Based Access Control) is the traditional way: you create roles and GRANT permissions on specific tables to each role. If you have 500 tables, you need 500 GRANT statements for each role. When a new table is added, you must manually GRANT access to it.
ABAC (Attribute-Based Access Control) is the new way: you TAG tables with attributes (like "pii=true") and create ONE policy that says "only gdpr_authorized users can access tables tagged pii=true." When a new PII table is added and tagged, the policy automatically applies — no manual GRANT needed.
Real-world analogy:
- RBAC = Security badges with specific room numbers: "Badge allows Room 101, 102, 103". New room added? Update every badge.
- ABAC = Security badges with rules: "Badge allows any room tagged 'public'". New public room added? Already accessible — no badge update needed.
Public Preview since November 2025 — knowing this in your interview shows you're up to date!
| Aspect | RBAC (Traditional) | ABAC (New) |
|---|---|---|
| How it works | GRANT permissions on specific tables | Create policies based on tags |
| Scale | 500 tables = 500 GRANT statements per role | 1 policy covers ALL tables with matching tag |
| New table added | Must manually GRANT access | Just TAG the table → policy auto-applies |
| Maintenance | High — every new table needs manual grants | Low — tag-based, automatic |
| Dynamic | No — static grants | Yes — add/remove tags at any time |
-- ============================================
-- RBAC way (old — lots of manual work):
-- ============================================
GRANT SELECT ON TABLE passengers TO gdpr_team;
GRANT SELECT ON TABLE passenger_contacts TO gdpr_team;
GRANT SELECT ON TABLE passenger_addresses TO gdpr_team;
-- ... repeat for every PII table (500+ statements!)
-- New PII table added? Must remember to add another GRANT!
-- ============================================
-- ABAC way (new — one policy covers everything):
-- ============================================
-- Step 1: Tag tables with attributes
ALTER TABLE passengers SET TAGS ('pii' = 'true', 'gdpr_scope' = 'eu');
-- This table contains PII data and falls under EU GDPR
ALTER TABLE bookings SET TAGS ('pii' = 'false');
-- This table doesn't contain PII
ALTER TABLE passenger_contacts SET TAGS ('pii' = 'true', 'gdpr_scope' = 'eu');
-- Another PII table — just tag it!
-- Step 2: Create ONE policy that covers ALL PII tables
CREATE FUNCTION pii_access_policy()
RETURN CASE
WHEN has_tag('pii', 'true') AND NOT is_member('gdpr_authorized') THEN FALSE
-- If table is tagged pii=true AND user is NOT in gdpr_authorized group → DENY access
ELSE TRUE
-- Otherwise → ALLOW access
END;
-- has_tag() checks if the table being accessed has a specific tag
-- is_member() checks if the current user belongs to a specific group
-- Step 3: Apply policy to entire catalog (all tables in it)
ALTER CATALOG travel_prod SET ACCESS CONTROL USING pii_access_policy;
-- Now EVERY table in travel_prod with tag pii=true is automatically protected!
-- New PII table added? Just tag it pii=true → protection is automatic!
Note: ABAC uses snake_case functions: has_tag, has_tag_value, is_member. CamelCase (hasTag) is being deprecated.
Amadeus use case: "Instead of managing GRANT statements for 500+ tables, we tag all PII tables with pii=true and apply ONE ABAC policy. When any new table is created with that tag, it's automatically governed. This saves hours of manual access management."
SECTION 3: PHOTON ENGINE (30 min)
Q7: What is Photon Engine?
Simple Explanation: Normally, Spark runs on the JVM (Java Virtual Machine). It works, but it's not the fastest. Photon is Databricks' own query engine written in C++ (a much faster language for data processing). It replaces Spark's JVM execution with native C++ code — making queries 3-12x faster.
The best part: You don't change ANY code. Same SQL, same PySpark, same notebooks. You just pick "Photon" runtime when creating your cluster, and everything runs faster automatically.
Real-world analogy: Imagine your car (Spark on JVM) gets a turbocharged engine swap (Photon in C++). Same car, same dashboard, same steering wheel — but much faster.
When Photon helps A LOT:
| Operation | Speed Improvement | Why |
|---|---|---|
| Table scans (reading data) | Up to 12x faster | C++ reads data more efficiently than JVM |
| Aggregations (SUM, COUNT, AVG) | 3-10x faster | Vectorized processing (processes many rows at once) |
| Joins | 3-8x faster | Better memory management |
| Filters (WHERE clauses) | 5-10x faster | Native code skips rows faster |
| MERGE/UPDATE/DELETE | 3-5x faster | Important for Silver layer at Amadeus! |
When Photon does NOT help:
| Operation | Why It Doesn't Help |
|---|---|
| Python UDFs | UDFs run in Python process, completely bypasses Photon |
| RDD-based code | Photon only works with DataFrame API and SQL, not old-style RDDs |
| Very small data (<100 MB) | The overhead of Photon startup isn't worth it for tiny datasets |
| ML training (MLlib) | ML uses different execution path — use GPU clusters instead |
| Unsupported operations | Photon silently falls back to Spark (no error — just runs slower) |
# HOW TO ENABLE PHOTON:
# Option 1: Select Photon runtime when creating cluster (recommended)
# In Databricks UI: Cluster → Databricks Runtime → select "18.1 LTS Photon"
# That's it! No code changes needed.
# Option 2: Enable via config (if cluster already exists)
spark.conf.set("spark.databricks.photon.enabled", "true")
# Your existing SQL and PySpark code runs 3-12x faster — no modifications!
Amadeus answer: "We use Photon for all our Silver-layer MERGE operations on booking tables. The 3-5x speedup on MERGE reduced our daily pipeline from 3 hours to 45 minutes. No code changes were needed — we just switched the runtime."
Interview tip: Don't just say "Photon makes things faster." Say WHEN to use it and WHEN NOT to use it. Mention "falls back gracefully" — if Photon can't handle an operation, it silently uses regular Spark instead of crashing.
SECTION 4: SERVERLESS COMPUTE (45 min)
Q8: What is Serverless compute in Databricks?
Simple Explanation: Normally in Databricks, you create a cluster (a group of VMs/machines) to run your code. You pick the machine type, the number of machines, configure auto-scaling, etc. If the cluster sits idle for hours, you're still paying. If it takes 10 minutes to start up, you wait.
Serverless means: Databricks manages the machines FOR you. You just submit your query/job, and Databricks instantly provides compute resources. When you're done, resources are released. You only pay for the time your code is actually running.
Real-world analogy:
- Standard cluster = Owning a car. You pay for insurance, parking, gas even when it's parked.
- Serverless = Uber. You pay only when you're riding. No car maintenance, no parking fees.
| Aspect | Standard Clusters | Serverless |
|---|---|---|
| Who manages it? | You configure machine type, count, etc. | Databricks manages everything |
| Startup time | 3-10 minutes (slow!) | <10 seconds (instant!) |
| Scaling | You configure min/max workers | Automatic — scales up/down as needed |
| Billing | Pay for running time (EVEN when idle) | Pay ONLY for execution time |
| Cost savings | — | Up to 70% savings for bursty workloads |
| Compliance | Standard | HITRUST, PCI-DSS, TISAX (important for travel!) |
What can run on Serverless?
- SQL Warehouses (most mature — been around longest)
- Notebooks (interactive development)
- Jobs/Workflows (scheduled production pipelines)
- Lakeflow Pipelines
- JAR tasks (new — February 2026)
- AI Runtime with GPU (Public Preview — March 2026)
Q9: What are Serverless Workspaces?
Simple Explanation: A regular Databricks workspace requires you to set up compute configurations, storage, networking, etc. A Serverless Workspace (GA January 2026) is a pre-configured workspace where EVERYTHING is managed — compute, storage, networking. You just log in and start working.
Think of it as: "Databricks out of the box — no setup needed."
Key features:
- Managed compute — no cluster configuration needed
- Managed storage — Databricks handles where data is stored
- Instant startup — everything is ready in seconds
- Auto-scaling — resources scale with your workload
- Network boundary isolation — secure by default
When to use Serverless vs Standard:
| Workload | Recommendation | Why |
|---|---|---|
| SQL analytics (BI dashboards) | Serverless SQL Warehouse | Auto-scale, auto-suspend, no idle cost |
| Bursty/intermittent jobs | Serverless | Pay only when running — big cost savings |
| Development/ad-hoc exploration | Serverless | Instant startup, no idle cost |
| 24/7 streaming (always running) | Standard | Predictable load → predictable cost is cheaper |
| GPU workloads (ML training) | Standard | GPU serverless is still in preview |
| Very specific hardware needs | Standard | Serverless doesn't let you pick machine types |
Q10: What is the difference between Job Cluster, All-Purpose Cluster, and Serverless?
Simple Explanation: Databricks has 3 types of compute. Knowing which to use WHERE is a common interview question:
All-Purpose Cluster = A long-running cluster shared by multiple users. Like a shared office computer — always on, multiple people use it. MOST EXPENSIVE.
Job Cluster = A cluster created just for one job and destroyed after. Like a rental car — rent it for the trip, return it after. CHEAPER than All-Purpose.
Serverless = No cluster at all — Databricks provides compute on demand. Like Uber — just request a ride, pay per minute. CHEAPEST for bursty work.
| Aspect | All-Purpose | Job Cluster | Serverless |
|---|---|---|---|
| Lifecycle | Runs until you stop it | Created per job, destroyed after | On-demand per query/task |
| Cost | Most expensive (all-purpose pricing) | Cheaper (job compute pricing) | Cheapest for bursty |
| Sharing | Multiple users share it | Single job only | Single job/query |
| Startup time | Instant (already running) | 3-10 minutes (creating VMs) | <10 seconds |
| When to use | Development ONLY | Production scheduled jobs | BI queries, bursty workloads |
Golden rule for production: NEVER use All-Purpose clusters for production pipelines. They're too expensive. Use Job Clusters or Serverless.
Amadeus answer: "We use All-Purpose clusters only for development notebooks. Production ETL pipelines run on Job Clusters with Photon for cost efficiency. BI dashboards connect via Serverless SQL Warehouses that auto-suspend when idle — zero cost overnight."
SECTION 5: AZURE-SPECIFIC INTEGRATION (45 min)
Q11: How does Azure Databricks architecture work? (Three-plane model)
Simple Explanation: Azure Databricks has 3 main components (called "planes") that work together. Understanding this helps you answer architecture questions and troubleshoot issues.
┌──────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ (Managed by Databricks in Azure) │
│ │
│ What lives here: │
│ ├── Databricks Workspace UI (the web interface) │
│ ├── REST APIs (for automation) │
│ ├── Job Scheduler (runs your scheduled jobs) │
│ ├── Unity Catalog Metastore (table metadata) │
│ └── Cluster Manager (creates/destroys VMs) │
│ │
│ Key point: You don't manage this. Databricks │
│ runs it in THEIR Azure subscription. │
└──────────────────┬───────────────────────────────┘
│ Secure encrypted connection
┌──────────────────▼───────────────────────────────┐
│ COMPUTE PLANE │
│ (In YOUR Azure subscription) │
│ │
│ What lives here: │
│ ├── VMs (cluster worker/driver nodes) │
│ ├── VNet (Virtual Network for isolation) │
│ ├── NSG (Network Security Groups — firewall) │
│ └── Managed Resource Group (Databricks creates) │
│ │
│ Key point: VMs run in YOUR subscription, │
│ so your data never leaves your network. │
└──────────────────┬───────────────────────────────┘
│
┌──────────────────▼───────────────────────────────┐
│ STORAGE PLANE │
│ (In YOUR Azure subscription) │
│ │
│ What lives here: │
│ ├── ADLS Gen2 (your main data lake storage) │
│ ├── Azure Blob Storage │
│ └── DBFS root storage (Databricks internal) │
│ │
│ Key point: Your data stays in YOUR storage. │
│ Databricks never stores your data. │
└──────────────────────────────────────────────────┘
Why this matters for Amadeus: "Our passenger PII data never leaves our Azure subscription. Databricks Control Plane manages orchestration, but the actual data processing (Compute) and storage (ADLS Gen2) happen inside our own subscription — important for GDPR compliance."
Key Azure services that integrate with Databricks:
| Azure Service | What It Does With Databricks |
|---|---|
| ADLS Gen2 | Primary storage for all Delta tables (the data lake) |
| Azure Key Vault | Stores secrets (passwords, API keys) — Databricks reads them securely |
| Azure Entra ID (formerly Active Directory) | Single sign-on (SSO), user provisioning, Service Principals |
| Azure Event Hubs | Kafka-compatible messaging — used for CDC streaming |
| Azure DevOps | CI/CD pipelines to deploy Databricks jobs |
| Azure Data Factory | Can orchestrate Databricks jobs (alternative to Databricks Workflows) |
| Power BI | Connects directly to Databricks SQL Warehouse for BI reports |
Q12: What is Azure Key Vault integration? Why is it important?
Simple Explanation: You should NEVER put passwords, API keys, or connection strings directly in your code or notebooks. Anyone who sees the notebook can see the password. Instead, store secrets in Azure Key Vault (a secure vault in Azure), and Databricks reads them at runtime.
Real-world analogy: Instead of writing your house key code on a Post-it note stuck to the door, you keep it in a safe (Key Vault) and only authorized people can open the safe.
# BAD — NEVER do this! Password visible in notebook:
# oracle_password = "MySecretP@ssw0rd"
# GOOD — Read password from Azure Key Vault at runtime:
# Step 1: (Done ONCE by admin) Create a "secret scope" in Databricks
# that connects to your Azure Key Vault:
# databricks secrets create-scope \
# --scope amadeus-secrets \
# --scope-backend-type AZURE_KEYVAULT \
# --resource-id /subscriptions/.../vaults/amadeus-kv \
# --dns-name https://amadeus-kv.vault.azure.net/
# Step 2: In your notebook, read secrets using dbutils:
oracle_password = dbutils.secrets.get(
scope="amadeus-secrets", # Which vault to read from
key="oracle-prod-password" # Which secret to get
)
# The password is now in the variable — but if you try to print it,
# Databricks shows [REDACTED] in the output (security feature!)
kafka_connection = dbutils.secrets.get(
scope="amadeus-secrets",
key="eventhubs-connection-string" # Connection string for Azure Event Hubs
)
# Use these variables in your JDBC connections, Kafka configs, etc.
Q13: What are Service Principals? Why use them for production?
Simple Explanation: A Service Principal is like a "robot user" account in Azure. Instead of using a real person's login to run production jobs, you create a Service Principal (non-human identity) that represents your application.
Why not use a real person's account?
- What if the person leaves the company? All jobs break.
- What if they change their password? All jobs break.
- What if MFA (multi-factor authentication) pops up? Job can't enter the code.
- Audit logs show "Krishna ran this job" — was it Krishna manually, or the scheduled job?
Amadeus answer: "All our production pipelines run under Service Principals, not user accounts. This ensures jobs aren't affected when employees change teams, and audit logs clearly show which application accessed what data."
SECTION 6: DATA GOVERNANCE AT SCALE (1 hour)
Q14: How do you handle GDPR "Right to Be Forgotten" in Delta Lake?
Simple Explanation: GDPR (EU data protection law) gives every person the right to say: "Delete ALL my data from your systems." For Amadeus, this means: when a passenger requests deletion, we must remove their data from EVERY table — and prove it's really gone.
The challenge with Delta Lake: Delta keeps history (time travel). Even after you DELETE a row, the old data is still in the old Parquet files. You must also VACUUM to physically remove those old files.
Two approaches:
Approach 1: Hard Delete (complete removal)
-- Step 1: Delete the passenger from ALL tables that contain their data
DELETE FROM dim_passenger WHERE passenger_id = 'PAX-12345';
-- Removes this passenger's row from the dimension table
DELETE FROM fact_bookings WHERE passenger_id = 'PAX-12345';
-- Removes all their bookings
DELETE FROM silver_interactions WHERE passenger_id = 'PAX-12345';
-- Removes all their interaction records
-- Step 2: VACUUM with 0 hours retention to physically remove old files
-- WARNING: This breaks time travel! But GDPR requires it.
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
-- Must disable the safety check (Delta normally prevents 0-hour vacuum)
VACUUM dim_passenger RETAIN 0 HOURS;
-- Physically deletes ALL old files — data is truly gone
VACUUM fact_bookings RETAIN 0 HOURS;
VACUUM silver_interactions RETAIN 0 HOURS;
-- Step 3: Log the deletion for compliance audit
INSERT INTO gdpr_deletion_log VALUES (
'PAX-12345', -- Which passenger was deleted
current_timestamp(), -- When it was deleted
'completed', -- Status
'tables: dim_passenger, fact_bookings, silver_interactions' -- Which tables
);
-- This log PROVES to regulators that we deleted the data
Approach 2: Pseudonymization (preferred — keeps analytics working) Instead of deleting the entire row, replace PII with random/hashed values. The row stays for analytics, but the person can't be identified.
UPDATE dim_passenger SET
full_name = 'REDACTED', -- Can't identify the person
email = CONCAT(MD5(email), '@redacted.com'), -- Hashed (one-way, can't reverse)
phone = 'REDACTED',
address = 'REDACTED'
WHERE passenger_id = 'PAX-12345';
-- The booking data still exists for analytics (revenue, route stats)
-- But the person's identity is completely removed
Interview tip: Mention BOTH approaches and when to use each. Hard delete = when regulation requires complete removal. Pseudonymization = when you need to keep analytics data but remove identity.
Q15: How would you design a governance framework for 200+ data engineers?
Simple Explanation: With 200+ engineers, you can't rely on "everyone will follow the rules." You need automated governance — policies that are enforced at the platform level, not just documented in a wiki.
Q16: What is Delta Sharing?
Simple Explanation: Delta Sharing is a way to share data with external partners (other companies) without copying the data. The partner reads directly from YOUR storage. You control what they can access, and you see audit logs of what they read.
Why is this useful for Amadeus? Amadeus works with 200+ airlines. Each airline needs access to their own booking data. Instead of creating copies for each airline (expensive, hard to keep in sync), you share the data directly using Delta Sharing.
Real-world analogy: Instead of printing copies of a report for each airline (copies get outdated), you give each airline a "viewing link" to the live report. They always see the latest data, and you control who can view what.
How it works:
-- Step 1: Create a share (a named collection of shared tables)
CREATE SHARE airline_bookings_share;
-- Step 2: Add a table to the share — with FILTERING (only Lufthansa data!)
ALTER SHARE airline_bookings_share ADD TABLE travel_prod.bookings.fact_bookings
PARTITION (airline_code = 'LH');
-- Only rows where airline_code = 'LH' are shared. Lufthansa can't see other airlines.
-- Step 3: Create a recipient and grant access
CREATE RECIPIENT lufthansa WITH SHARING_IDENTIFIER 'lufthansa.databricks.com';
GRANT SELECT ON SHARE airline_bookings_share TO RECIPIENT lufthansa;
-- Lufthansa now has read-only access to their filtered booking data
New December 2025: Delta Sharing now supports Iceberg clients — recipients can use Snowflake, Trino, or Flink to read shared Delta data. The recipient doesn't even need Databricks!
SECTION 7: NEW 2026 FEATURES (30 min)
Q17: What are Multi-table Transactions?
Simple Explanation: Before March 2026, Delta Lake transactions were per-table. If you needed to update 3 tables atomically (all succeed or all fail), you had to write complex error-handling code.
Multi-table Transactions (Public Preview March 2026) let you wrap multiple table operations in a single atomic block using BEGIN ATOMIC ... END;. Either ALL operations succeed, or NONE of them apply.
Real-world analogy: Transferring money between bank accounts. You want to debit Account A AND credit Account B atomically. If the credit fails, the debit should also be rolled back. Multi-table transactions = this guarantee for your data tables.
-- Process a new booking: update 3 tables atomically
BEGIN ATOMIC
-- 1. Insert the booking into the fact table
INSERT INTO fact_bookings VALUES (...);
-- 2. Update the passenger's loyalty tier
UPDATE dim_passenger SET loyalty_tier = 'Gold'
WHERE passenger_id = 'PAX-123';
-- 3. Record in the audit log
INSERT INTO audit_log VALUES (...);
END;
-- If ANY of these 3 operations fails, ALL are rolled back
-- No half-completed bookings in the system!
Requirement: All tables must be Unity Catalog managed tables (not external tables).
Amadeus use case: "When processing a booking, we update the fact table, the passenger dimension, and the audit log atomically. Multi-table transactions guarantee we never have a booking without a corresponding audit entry."
Q18: What is Compatibility Mode?
Simple Explanation: Delta Lake uses a specific format that only Delta-compatible engines can read. But what if your partner uses Snowflake or AWS Athena (which don't natively support Delta)? Compatibility Mode makes your Delta table readable by non-Delta engines — they see it as plain Parquet.
-- Enable compatibility mode
ALTER TABLE bookings SET TBLPROPERTIES ('delta.compatibility.mode' = 'read');
-- Now Athena, Snowflake, Azure Fabric can read this table
-- They see it as regular Parquet files — no Delta SDK needed
When to use: When you need to share data with teams/partners who use non-Databricks tools.
Q19: What is Lakebase?
Simple Explanation: Delta tables are great for analytics (batch queries, BI). But what about when an application needs to look up a passenger's details in 10 milliseconds? Delta tables can't do that — they're designed for scanning large amounts of data, not fast single-row lookups.
Lakebase (GA on Azure March 2026) is a serverless PostgreSQL-compatible database built into Databricks. It fills the gap between analytics (Delta) and applications (need fast lookups).
Think of it as: "A real database inside Databricks for application use cases."
Key features:
- Scale-to-zero: When nobody is querying, it costs $0 (auto-shuts down)
- Database branching: Create an instant copy for testing (like git branch for databases!)
- Instant restore: Go back to any point in time
- Auto-failover HA: If one server fails, another takes over automatically
-- Create a Lakebase database (uses PostgreSQL syntax!)
CREATE DATABASE amadeus_app_db ENGINE = 'LAKEBASE';
-- Create a lookup table (standard PostgreSQL)
CREATE TABLE lookup_airports (
iata_code VARCHAR(3) PRIMARY KEY, -- Airport code: BLR, DEL, JFK
airport_name VARCHAR(100), -- Full name
city VARCHAR(50),
country VARCHAR(50)
);
-- This can serve API requests in milliseconds!
When to use Lakebase vs Delta tables:
| Use Case | Lakebase | Delta Tables |
|---|---|---|
| API backend (fast single-row lookups) | ✅ Perfect | ❌ Too slow for single rows |
| Batch analytics & BI reporting | ❌ Not designed for this | ✅ Perfect |
| Feature serving for ML (low-latency) | ✅ Good | ⚠️ OK for batch features |
| Application CRUD operations | ✅ Perfect | ❌ Not designed for this |
Q20: What is Predictive Optimization?
Simple Explanation: In Day 1, we learned that you need to run OPTIMIZE (compact files), VACUUM (clean up old files), and ANALYZE TABLE (refresh statistics) regularly to keep tables healthy. But scheduling these manually for 500+ tables is tedious.
Predictive Optimization does this AUTOMATICALLY. Databricks watches how your tables are used, learns patterns, and runs these maintenance commands at the right time. No manual scheduling needed.
Real-world analogy: Like a Roomba robot vacuum. Instead of manually vacuuming every room on a schedule, the Roomba learns your floor plan and cleans automatically when needed.
What it does:
- Auto-OPTIMIZE: Compacts small files into large ones when needed
- Auto-VACUUM: Cleans up old unused files to free storage
- Auto-ANALYZE: Refreshes table statistics for better query plans
- Learns patterns: If a table gets heavy writes at midnight, it optimizes at 2am
- Enabled by default on all new Unity Catalog managed tables (since 2025)
- No configuration needed — just use managed tables!
Amadeus answer: "For our 500+ Delta tables, Predictive Optimization eliminates manual maintenance scheduling. The platform learns each table's access patterns and optimizes automatically — saving our team hours of DevOps work every week."
QUICK REVISION CHECKLIST — DAY 3
Test yourself — can you answer each in 2-3 minutes?
- What is Unity Catalog? What are the 6 pillars? Explain the 3-level namespace. (Q1)
- How do you implement column masking and row-level security? Write the SQL. (Q2)
- What is data lineage? How does Unity Catalog track it automatically? (Q3)
- What are Storage Credentials → External Locations → External Tables? (Q4)
- How do you migrate from Hive Metastore to Unity Catalog? (Q5)
- What is ABAC? How is it different from RBAC? Write a tag-based policy. (Q6)
- What is Photon? When does it help? When does it NOT help? (Q7)
- What is Serverless compute? When to use vs standard clusters? (Q8)
- Job Cluster vs All-Purpose vs Serverless — when to use each? (Q10)
- Explain the Azure Databricks 3-plane architecture. (Q11)
- How does Key Vault integration work? Why never hardcode secrets? (Q12)
- What are Service Principals? Why use them for production? (Q13)
- How do you handle GDPR "right to be forgotten" in Delta Lake? (Q14)
- What is Delta Sharing? How would Amadeus share data with airline partners? (Q16)
- What are Multi-table Transactions? When to use? (Q17)
- What is Lakebase? When to use vs Delta tables? (Q19)
- What is Predictive Optimization? (Q20)