🧱
Databricks
Day 3: Azure Platform & Governance — Quick Recall Guide
🧱
🧱
Databricks · Section 14 of 17

Day 3: Azure Platform & Governance — Quick Recall Guide

Day 3: Azure Platform & Governance — Quick Recall Guide

🗺️ Memory Map
How to use this file:
  • ⚡ = Must remember (95% chance of being asked)
  • 🔑 = Key concept (core understanding needed)
  • ⚠️ = Common trap (interviewers love to test this)
  • 🧠 = Memory Map (mnemonic/acronym — memorize this!)
  • 📝 = One-liner (flash-card style — cover answer, test yourself)
Reading strategy: Read Memory Maps FIRST → then Direct Questions → then Mid-Level.

🧠 MASTER MEMORY MAP — Day 3

🧠 AZURE DATABRICKS PLATFORM = "UPGDS"
AZURE DATABRICKS PLATFORM"UPGDS"
UUnity Catalog (governance, security, lineage)
PPhoton Engine (fast C++ query engine)
GGDPR/Governance (PII masking, right to be forgotten)
DDelta Sharing (share data across organizations)
SServerless (no cluster management, instant start)
UNITY CATALOG HIERARCHY"M-C-S-T" (like McDonald's Chicken Sandwich is Tasty)
MMetastore (one per Azure region)
CCatalog (environment: prod, dev, staging)
SSchema (domain: bookings, flights, passengers)
TTable/View/Function/Model/Volume
AZURE ARCHITECTURE"3 Planes"
Control PlaneDatabricks-managed (in Databricks' Azure subscription)
Data PlaneCustomer-managed (YOUR Azure subscription — ADLS, VMs)
Web AppDatabricks UI (notebooks, SQL editor)

SECTION 1: UNITY CATALOG

🧠 Memory Map: Unity Catalog

🧠 UNITY CATALOG = "Security Guard + Librarian + Receptionist"
UNITY CATALOG"Security Guard + Librarian + Receptionist"
Security GuardAccess control (GRANT/REVOKE)
LibrarianDiscovery (search, browse all data assets)
ReceptionistAudit log (who accessed what, when)
6 PILLARS = "ADLAQS" (sounds like "AD-LACS")
AAccess Control (GRANT/REVOKE at any level)
DDiscovery (search + browse all tables)
LLineage (column-level data flow tracking)
AAuditing (every query logged)
QQuality Monitoring (anomaly detection)
SSharing (Delta Sharing protocol)
3-LEVEL NAMESPACE:
catalog.schema.table
travel_prod.bookings.fact_flights
^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^
environment domain object

⚡ MUST KNOW DIRECT QUESTIONS

Q1What is Unity Catalog?

Centralized governance solution for all Databricks data assets — tables, views, ML models, files, functions. Manages access control, lineage, auditing, and data discovery across all workspaces.

Q2What is the 3-level namespace?

catalog.schema.table — Example: travel_prod.bookings.fact_flights. Replaced the old 2-level Hive Metastore (database.table).

Q3What is a Metastore?
Pro Tip
Top-level container in Unity Catalog — one per Azure region. Holds all catalogs. Multiple workspaces can share one metastore.
Q4What objects does Unity Catalog manage?

Tables, Views, Functions, ML Models, Volumes (files like CSVs, images), Connections (external systems), Shares (Delta Sharing).

Q5What is row-level security?

Different users see different ROWS from the same table. Example: Lufthansa team sees only Lufthansa bookings, Air India team sees only Air India bookings.

Q6What is column masking?

Different users see different VALUES for the same column. Example: HR sees full email, analytics team sees kri***@gmail.com.

Q7How do you implement row-level security?

sql
CREATE FUNCTION airline_filter(airline STRING)
  RETURN IF(IS_MEMBER('admin_group'), true, airline = CURRENT_USER_AIRLINE());
ALTER TABLE bookings SET ROW FILTER airline_filter ON (airline_code);

Q8What is data lineage?

Automatic tracking of data flow — which table feeds into which table, column by column. Unity Catalog captures this from every query. No manual setup needed.

⚠️ Q9What is the difference between Managed and External tables in Unity Catalog?

Managed TableExternal Table
Data storedUnity Catalog's managed storageYour ADLS path (you control)
DROP TABLEDeletes data + metadataDeletes metadata ONLY
Use whenDefault — simplerData shared across platforms (Synapse, etc.)
NeedsNothing extraStorage Credential + External Location

Q10What is a Storage Credential?

Authentication to access external storage (ADLS Gen2). Uses Azure Service Principal or Managed Identity. Created by admin, referenced by External Locations.

Q11What is an External Location?

Maps a Storage Credential to a specific ADLS path. Example: "Service Principal X can access abfss://container@storage.dfs.core.windows.net/path/"

🔑 MID-LEVEL QUESTIONS

Q12How to migrate from Hive Metastore to Unity Catalog?
📝 Note
4 steps:
  1. Create Unity Catalog metastore + catalogs + schemas
  2. Create External Locations for existing ADLS paths
  3. CREATE TABLE uc_catalog.schema.table CLONE hive_metastore.db.table (DEEP or SHALLOW clone)
  4. Update all notebooks/jobs to use 3-level namespace ⚠️ Don't do big-bang migration — migrate one schema at a time, run old + new in parallel
Q13RBAC vs ABAC — what's the difference?

RBACABAC
Full nameRole-Based Access ControlAttribute-Based Access Control
How it worksGRANT to groups/rolesGRANT based on tags/attributes
ExampleGRANT SELECT ON bookings TO analystsIf table has tag pii=true, only data_stewards can access
ScaleNeed many GRANT statementsOne policy covers all PII tables automatically
When to useSimple org (<50 tables)Complex org (1000s of tables with varying sensitivity)
ABAC is new (Nov 2025) — mention it to show you're up to date!

SECTION 2: PHOTON ENGINE

🧠 Memory Map: Photon

PHOTON"C++ turbo engine for SQL queries"
Think: NORMAL CAR vs SPORTS CAR
Normal (JVM Spark) = Works fine, fuel-efficient
Sports (Photon) = 2-8x faster, costs more (higher DBU rate)
WHEN PHOTON HELPS
✓ SQL-heavy workloads (aggregations, joins, GROUP BY)
✓ Large table scans (full-table analytics)
✓ ETL with heavy transformations
WHEN PHOTON DOESN'T HELP:
✗ Python UDFs (Photon is C++, can't accelerate Python code)
✗ ML training (Photon is for SQL, not MLlib)
✗ Small tables (overhead > benefit)
✗ Streaming with low latency (Photon optimizes throughput, not latency)
Remember: "Photon = SQL turbo, not Python/ML turbo"

⚡ MUST KNOW DIRECT QUESTIONS

Q14What is Photon?

A C++ vectorized query engine built into Databricks. Runs SQL queries 2-8x faster than standard Spark. Replaces the JVM-based Spark SQL engine for supported operations.

Q15When should you NOT use Photon?

When using Python UDFs, ML training (MLlib), or streaming with latency requirements. Photon only accelerates SQL-style operations (scans, joins, aggregations).

⚠️ Q16Does Photon cost more?

Yes — Photon-enabled clusters use a higher DBU rate. But if queries finish 3x faster, total cost may be LOWER. Always benchmark both.

Q17Is Photon enabled by default?

Yes, on Databricks SQL Warehouses and Jobs clusters (since 2024). For interactive clusters, you choose photon-enabled runtime.

SECTION 3: SERVERLESS COMPUTE

🧠 Memory Map: Compute Types

📐 Architecture Diagram
3 COMPUTE TYPES = "JAG" (like Jaguar — fast!)
    J — Job Clusters (start for a job, auto-terminate after)
    A — All-Purpose Clusters (always on, for development)
    S — Serverless (no cluster management, instant start)

    ┌───────────────────────────────────────────────────────┐
    │ COST COMPARISON:                                      │
    │                                                       │
    │   All-Purpose  $$$$  (most expensive — always running)│
    │   Job Cluster  $$    (cheaper — runs only when needed)│
    │   Serverless   $-$$  (no idle cost, but higher DBU)   │
    │                                                       │
    │   Rule: Development → All-Purpose                     │
    │         Production  → Job Cluster or Serverless       │
    │         SQL queries → Serverless SQL Warehouse        │
    └───────────────────────────────────────────────────────┘

SERVERLESS WORKSPACES (new Jan 2026):
    = Entire workspace where ALL compute is serverless
    No cluster configuration AT ALL — just write code and run

⚡ MUST KNOW DIRECT QUESTIONS

Q18What is Serverless compute?

Databricks manages the infrastructure — no cluster configuration, instant start (~10 seconds), auto-scales, auto-terminates. You pay per query/job, no idle cost.

Q19What are the 3 compute types?

  1. All-Purpose Cluster — interactive development, always on, most expensive
  2. Job Cluster — starts for a job, auto-terminates after, production workloads
  3. Serverless — instant start, no config, pay-per-use, newest option

Q20What are Serverless Workspaces?

A workspace where ALL compute is serverless (GA January 2026). No cluster management at all. Engineers just write code and run — Databricks handles everything.

⚠️ Q21Job Cluster vs All-Purpose — when to use which?

  • All-Purpose: Development, exploration, ad-hoc analysis (need clusters always ready)
  • Job Cluster: Production ETL (spin up → run job → auto-terminate → save money)
  • ⚠️ NEVER use All-Purpose for production — wastes money even when idle

SECTION 4: AZURE-SPECIFIC INTEGRATION

🧠 Memory Map: Azure Integration

📐 Architecture Diagram
AZURE DATABRICKS ARCHITECTURE = "3 PLANES"

    ┌─────────────────────────────────┐
    │    CONTROL PLANE                │ ← Databricks' Azure subscription
    │    (Databricks manages this)    │    Notebook server, Web UI,
    │                                 │    Cluster manager, Job scheduler
    └────────────┬────────────────────┘
                 │ Secure connection
    ┌────────────┴────────────────────┐
    │    DATA PLANE                   │ ← YOUR Azure subscription
    │    (You manage this)            │    VMs (workers), ADLS Gen2,
    │                                 │    VNET, Key Vault, NSGs
    └─────────────────────────────────┘

KEY AZURE SERVICES:
    ADLS Gen2    → Storage (where Delta tables live)
    Key Vault    → Secrets (passwords, API keys, connection strings)
    Service Principal → Identity for automated jobs (no personal login)
    Azure DevOps → CI/CD pipelines (deploy notebooks, jobs)
    Event Grid   → Auto Loader notification mode trigger
    VNET         → Network isolation (compliance requirement)

Remember: "AKSEV" = ADLS, Key Vault, Service Principal, Event Grid, VNET

⚡ MUST KNOW DIRECT QUESTIONS

Q22What is ADLS Gen2?

Azure Data Lake Storage Gen2 — hierarchical cloud storage where all Delta tables, raw files, and landing zones live. Combines blob storage performance with file system semantics.

Q23What is Azure Key Vault?

Secure secret management service. Store database passwords, API keys, storage account keys. Databricks reads secrets at runtime — no secrets in code.

python — editable
password = dbutils.secrets.get(scope="key-vault-scope", key="oracle-password")

Q24What is a Service Principal?

An Azure identity for applications (not humans). Used for production jobs instead of personal logins. Has its own client ID + secret. Better for automation + auditing.

⚠️ Q25Why use Service Principal instead of personal login for production?

  1. Personal accounts can be disabled when employees leave → jobs break
  2. Service Principals can have minimal permissions (least privilege)
  3. Better audit trail — you know exactly which application accessed data
  4. No interactive login required — works in CI/CD pipelines

Q26What are the 3 planes of Azure Databricks?
📝 Note
  1. Control Plane (Databricks' subscription) — Web UI, notebook service, cluster manager
  2. Data Plane (YOUR subscription) — VMs, ADLS storage, VNET
  3. Web Application — The Databricks UI you interact with

SECTION 5: GDPR & DATA GOVERNANCE

🧠 Memory Map: GDPR

📋 Overview
GDPR"European law: users own their data"
KEY REQUIREMENT: "Right to be Forgotten"
= User requests: "Delete ALL my personal data"
= You MUST delete it from EVERY table, backup, and file
HOW TO DO IT IN DELTA LAKE
Step 1: DELETE FROM bookings WHERE passenger_id = 'P123';
Step 2: DELETE FROM passengers WHERE id = 'P123';
Step 3: VACUUM bookings RETAIN 0 HOURS; ← removes old files too!
⚠️Need: SET spark.databricks.delta.retentionDurationCheck.enabled = false
TWO APPROACHES
Hard DeleteActually remove data (VACUUM 0 HOURS) — simple but breaks time travel
PseudonymizationReplace real name with fake (keep data, remove identity) — preserves analytics
PII HANDLING in Unity Catalog:
Tag columns: ALTER TABLE passengers ALTER COLUMN email SET TAGS ('pii' = 'true');
Mask columns: Column masking functions (analysts see masked emails)
Audit: Who accessed PII columns?Unity Catalog audit logs

⚡ MUST KNOW DIRECT QUESTIONS

Q27What is GDPR "Right to be Forgotten"?

A user can request complete deletion of their personal data. You must delete from ALL tables + run VACUUM to remove from physical files.

Q28How do you implement Right to be Forgotten in Delta Lake?

  1. DELETE the user's rows from ALL tables
  2. Run VACUUM table RETAIN 0 HOURS to physically remove old files
  3. ⚠️ This breaks time travel — old versions with that user's data are gone

Q29What is pseudonymization?

Replace real PII with fake/hashed values. Example: "Krishna Yadav" → "User_A7B3C". Preserves data for analytics while removing identity. Alternative to hard delete.

Q30How do you tag PII columns in Unity Catalog?

sql
ALTER TABLE passengers ALTER COLUMN email SET TAGS ('pii' = 'true');
ALTER TABLE passengers ALTER COLUMN phone SET TAGS ('pii' = 'true');
Tags can be used with ABAC policies to auto-restrict access.

SECTION 6: DELTA SHARING & NEW FEATURES

🧠 Memory Map: Delta Sharing

DELTA SHARING"Share data WITHOUT copying"
Think: Google Docs SHARING (view link — they don't download a copy)
Provider: Amadeus (shares booking data)
Recipient: Partner airline (reads shared data)
Share: A named collection of tables to share
How it works:
1. Provider creates a SHARE with specific tables
2. Provider adds RECIPIENT (partner airline)
3. Recipient gets an activation link
4. Recipient reads data using open protocol (Spark, pandas, Power BI)
⚠️Data is NOT copied — recipient reads from provider's storage
Remember: "PSR" = Provider, Share, Recipient

⚡ MUST KNOW DIRECT QUESTIONS

Q31What is Delta Sharing?

Open protocol for sharing data across organizations WITHOUT copying. Provider grants access to specific tables, recipient reads live data. Works with Spark, pandas, Power BI.

Q32Why not just copy data to share it?
Pro Tip
Copying = stale data, storage cost, security risk (multiple copies). Delta Sharing = always fresh, no duplication, centralized access control.
Q33What is Compatibility Mode?

Allows external tools (that only understand Iceberg/Hive) to read your Delta tables. Unity Catalog translates metadata on-the-fly. No table conversion needed.

Q34What are Multi-table Transactions?

sql
BEGIN ATOMIC
  INSERT INTO bookings VALUES (...);
  UPDATE passenger_counts SET count = count + 1;
  INSERT INTO audit_log VALUES (...);
END
Either ALL 3 statements succeed, or NONE do. New in Databricks 2025.

🧠 FINAL REVISION — Day 3 Summary Card

📐 Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│               DAY 3: PLATFORM & GOVERNANCE                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Unity Catalog = Governance for ALL data assets             │
│  Namespace: catalog.schema.table (3 levels)                 │
│  6 pillars: Access, Discovery, Lineage, Audit, Quality,    │
│             Sharing ("ADLAQS")                              │
│  Row-level security = different users see different rows    │
│  Column masking = different users see masked values         │
│  ABAC (new) = tag-based access control (pii=true → block)  │
│                                                             │
│  Managed Table: DROP deletes data + metadata                │
│  External Table: DROP deletes metadata ONLY                 │
│  External needs: Storage Credential + External Location     │
│                                                             │
│  Photon = C++ SQL engine (2-8x faster)                      │
│  ⚠️ Doesn't help Python UDFs or ML training                 │
│                                                             │
│  Compute: All-Purpose (dev) → Job Cluster (prod) →          │
│           Serverless (newest, instant start)                 │
│  ⚠️ NEVER use All-Purpose for production!                   │
│                                                             │
│  Azure: ADLS Gen2 (storage), Key Vault (secrets),           │
│         Service Principal (production identity)             │
│  3 Planes: Control (Databricks) + Data (yours) + Web App   │
│                                                             │
│  GDPR: DELETE + VACUUM 0 HOURS = right to be forgotten     │
│  Alternative: Pseudonymization (hash PII, keep data)        │
│  PII tagging: ALTER COLUMN SET TAGS ('pii' = 'true')       │
│                                                             │
│  Delta Sharing: Share without copying (PSR pattern)         │
│  Multi-table Tx: BEGIN ATOMIC...END (all or nothing)        │
│  Compatibility Mode: Iceberg/Hive clients read Delta        │
│                                                             │
│  TOP 5 THINGS TO SAY IN INTERVIEW:                          │
│  1. "Unity Catalog: single governance for all data assets"  │
│  2. "Row/column security for multi-tenant airline data"     │
│  3. "Service Principals for production, Key Vault for secrets"│
│  4. "GDPR: DELETE + VACUUM 0 HOURS for right to forget"    │
│  5. "Delta Sharing for partner airlines without copying"    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
🗺️ Memory Map
Study tip: Read this file TWICE:
  1. First pass (30 min): Read only 🧠 Memory Maps + ⚡ Direct Questions
  2. Second pass (30 min): Read 🔑 Mid-Level Questions + ⚠️ Traps
  3. Before interview (15 min): Read ONLY the Final Revision Summary Card