Day 3: Azure Platform & Governance — Quick Recall Guide
- ⚡ = Must remember (95% chance of being asked)
- 🔑 = Key concept (core understanding needed)
- ⚠️ = Common trap (interviewers love to test this)
- 🧠 = Memory Map (mnemonic/acronym — memorize this!)
- 📝 = One-liner (flash-card style — cover answer, test yourself)
🧠 MASTER MEMORY MAP — Day 3
SECTION 1: UNITY CATALOG
🧠 Memory Map: Unity Catalog
⚡ MUST KNOW DIRECT QUESTIONS
Centralized governance solution for all Databricks data assets — tables, views, ML models, files, functions. Manages access control, lineage, auditing, and data discovery across all workspaces.
catalog.schema.table — Example: travel_prod.bookings.fact_flights. Replaced the old 2-level Hive Metastore (database.table).
Tables, Views, Functions, ML Models, Volumes (files like CSVs, images), Connections (external systems), Shares (Delta Sharing).
Different users see different ROWS from the same table. Example: Lufthansa team sees only Lufthansa bookings, Air India team sees only Air India bookings.
Different users see different VALUES for the same column. Example: HR sees full email, analytics team sees kri***@gmail.com.
CREATE FUNCTION airline_filter(airline STRING)
RETURN IF(IS_MEMBER('admin_group'), true, airline = CURRENT_USER_AIRLINE());
ALTER TABLE bookings SET ROW FILTER airline_filter ON (airline_code);
Automatic tracking of data flow — which table feeds into which table, column by column. Unity Catalog captures this from every query. No manual setup needed.
| Managed Table | External Table | |
|---|---|---|
| Data stored | Unity Catalog's managed storage | Your ADLS path (you control) |
| DROP TABLE | Deletes data + metadata | Deletes metadata ONLY |
| Use when | Default — simpler | Data shared across platforms (Synapse, etc.) |
| Needs | Nothing extra | Storage Credential + External Location |
Authentication to access external storage (ADLS Gen2). Uses Azure Service Principal or Managed Identity. Created by admin, referenced by External Locations.
Maps a Storage Credential to a specific ADLS path. Example: "Service Principal X can access abfss://container@storage.dfs.core.windows.net/path/"
🔑 MID-LEVEL QUESTIONS
- Create Unity Catalog metastore + catalogs + schemas
- Create External Locations for existing ADLS paths
CREATE TABLE uc_catalog.schema.table CLONE hive_metastore.db.table(DEEP or SHALLOW clone)- Update all notebooks/jobs to use 3-level namespace ⚠️ Don't do big-bang migration — migrate one schema at a time, run old + new in parallel
| RBAC | ABAC | |
|---|---|---|
| Full name | Role-Based Access Control | Attribute-Based Access Control |
| How it works | GRANT to groups/roles | GRANT based on tags/attributes |
| Example | GRANT SELECT ON bookings TO analysts | If table has tag pii=true, only data_stewards can access |
| Scale | Need many GRANT statements | One policy covers all PII tables automatically |
| When to use | Simple org (<50 tables) | Complex org (1000s of tables with varying sensitivity) |
SECTION 2: PHOTON ENGINE
🧠 Memory Map: Photon
⚡ MUST KNOW DIRECT QUESTIONS
A C++ vectorized query engine built into Databricks. Runs SQL queries 2-8x faster than standard Spark. Replaces the JVM-based Spark SQL engine for supported operations.
When using Python UDFs, ML training (MLlib), or streaming with latency requirements. Photon only accelerates SQL-style operations (scans, joins, aggregations).
Yes — Photon-enabled clusters use a higher DBU rate. But if queries finish 3x faster, total cost may be LOWER. Always benchmark both.
Yes, on Databricks SQL Warehouses and Jobs clusters (since 2024). For interactive clusters, you choose photon-enabled runtime.
SECTION 3: SERVERLESS COMPUTE
🧠 Memory Map: Compute Types
3 COMPUTE TYPES = "JAG" (like Jaguar — fast!)
J — Job Clusters (start for a job, auto-terminate after)
A — All-Purpose Clusters (always on, for development)
S — Serverless (no cluster management, instant start)
┌───────────────────────────────────────────────────────┐
│ COST COMPARISON: │
│ │
│ All-Purpose $$$$ (most expensive — always running)│
│ Job Cluster $$ (cheaper — runs only when needed)│
│ Serverless $-$$ (no idle cost, but higher DBU) │
│ │
│ Rule: Development → All-Purpose │
│ Production → Job Cluster or Serverless │
│ SQL queries → Serverless SQL Warehouse │
└───────────────────────────────────────────────────────┘
SERVERLESS WORKSPACES (new Jan 2026):
= Entire workspace where ALL compute is serverless
No cluster configuration AT ALL — just write code and run
⚡ MUST KNOW DIRECT QUESTIONS
Databricks manages the infrastructure — no cluster configuration, instant start (~10 seconds), auto-scales, auto-terminates. You pay per query/job, no idle cost.
- All-Purpose Cluster — interactive development, always on, most expensive
- Job Cluster — starts for a job, auto-terminates after, production workloads
- Serverless — instant start, no config, pay-per-use, newest option
A workspace where ALL compute is serverless (GA January 2026). No cluster management at all. Engineers just write code and run — Databricks handles everything.
- All-Purpose: Development, exploration, ad-hoc analysis (need clusters always ready)
- Job Cluster: Production ETL (spin up → run job → auto-terminate → save money)
- ⚠️ NEVER use All-Purpose for production — wastes money even when idle
SECTION 4: AZURE-SPECIFIC INTEGRATION
🧠 Memory Map: Azure Integration
AZURE DATABRICKS ARCHITECTURE = "3 PLANES"
┌─────────────────────────────────┐
│ CONTROL PLANE │ ← Databricks' Azure subscription
│ (Databricks manages this) │ Notebook server, Web UI,
│ │ Cluster manager, Job scheduler
└────────────┬────────────────────┘
│ Secure connection
┌────────────┴────────────────────┐
│ DATA PLANE │ ← YOUR Azure subscription
│ (You manage this) │ VMs (workers), ADLS Gen2,
│ │ VNET, Key Vault, NSGs
└─────────────────────────────────┘
KEY AZURE SERVICES:
ADLS Gen2 → Storage (where Delta tables live)
Key Vault → Secrets (passwords, API keys, connection strings)
Service Principal → Identity for automated jobs (no personal login)
Azure DevOps → CI/CD pipelines (deploy notebooks, jobs)
Event Grid → Auto Loader notification mode trigger
VNET → Network isolation (compliance requirement)
Remember: "AKSEV" = ADLS, Key Vault, Service Principal, Event Grid, VNET
⚡ MUST KNOW DIRECT QUESTIONS
Azure Data Lake Storage Gen2 — hierarchical cloud storage where all Delta tables, raw files, and landing zones live. Combines blob storage performance with file system semantics.
Secure secret management service. Store database passwords, API keys, storage account keys. Databricks reads secrets at runtime — no secrets in code.
password = dbutils.secrets.get(scope="key-vault-scope", key="oracle-password")
An Azure identity for applications (not humans). Used for production jobs instead of personal logins. Has its own client ID + secret. Better for automation + auditing.
- Personal accounts can be disabled when employees leave → jobs break
- Service Principals can have minimal permissions (least privilege)
- Better audit trail — you know exactly which application accessed data
- No interactive login required — works in CI/CD pipelines
- Control Plane (Databricks' subscription) — Web UI, notebook service, cluster manager
- Data Plane (YOUR subscription) — VMs, ADLS storage, VNET
- Web Application — The Databricks UI you interact with
SECTION 5: GDPR & DATA GOVERNANCE
🧠 Memory Map: GDPR
⚡ MUST KNOW DIRECT QUESTIONS
A user can request complete deletion of their personal data. You must delete from ALL tables + run VACUUM to remove from physical files.
- DELETE the user's rows from ALL tables
- Run
VACUUM table RETAIN 0 HOURSto physically remove old files - ⚠️ This breaks time travel — old versions with that user's data are gone
Replace real PII with fake/hashed values. Example: "Krishna Yadav" → "User_A7B3C". Preserves data for analytics while removing identity. Alternative to hard delete.
ALTER TABLE passengers ALTER COLUMN email SET TAGS ('pii' = 'true');
ALTER TABLE passengers ALTER COLUMN phone SET TAGS ('pii' = 'true');
SECTION 6: DELTA SHARING & NEW FEATURES
🧠 Memory Map: Delta Sharing
⚡ MUST KNOW DIRECT QUESTIONS
Open protocol for sharing data across organizations WITHOUT copying. Provider grants access to specific tables, recipient reads live data. Works with Spark, pandas, Power BI.
Allows external tools (that only understand Iceberg/Hive) to read your Delta tables. Unity Catalog translates metadata on-the-fly. No table conversion needed.
BEGIN ATOMIC
INSERT INTO bookings VALUES (...);
UPDATE passenger_counts SET count = count + 1;
INSERT INTO audit_log VALUES (...);
END
🧠 FINAL REVISION — Day 3 Summary Card
┌─────────────────────────────────────────────────────────────┐
│ DAY 3: PLATFORM & GOVERNANCE │
├─────────────────────────────────────────────────────────────┤
│ │
│ Unity Catalog = Governance for ALL data assets │
│ Namespace: catalog.schema.table (3 levels) │
│ 6 pillars: Access, Discovery, Lineage, Audit, Quality, │
│ Sharing ("ADLAQS") │
│ Row-level security = different users see different rows │
│ Column masking = different users see masked values │
│ ABAC (new) = tag-based access control (pii=true → block) │
│ │
│ Managed Table: DROP deletes data + metadata │
│ External Table: DROP deletes metadata ONLY │
│ External needs: Storage Credential + External Location │
│ │
│ Photon = C++ SQL engine (2-8x faster) │
│ ⚠️ Doesn't help Python UDFs or ML training │
│ │
│ Compute: All-Purpose (dev) → Job Cluster (prod) → │
│ Serverless (newest, instant start) │
│ ⚠️ NEVER use All-Purpose for production! │
│ │
│ Azure: ADLS Gen2 (storage), Key Vault (secrets), │
│ Service Principal (production identity) │
│ 3 Planes: Control (Databricks) + Data (yours) + Web App │
│ │
│ GDPR: DELETE + VACUUM 0 HOURS = right to be forgotten │
│ Alternative: Pseudonymization (hash PII, keep data) │
│ PII tagging: ALTER COLUMN SET TAGS ('pii' = 'true') │
│ │
│ Delta Sharing: Share without copying (PSR pattern) │
│ Multi-table Tx: BEGIN ATOMIC...END (all or nothing) │
│ Compatibility Mode: Iceberg/Hive clients read Delta │
│ │
│ TOP 5 THINGS TO SAY IN INTERVIEW: │
│ 1. "Unity Catalog: single governance for all data assets" │
│ 2. "Row/column security for multi-tenant airline data" │
│ 3. "Service Principals for production, Key Vault for secrets"│
│ 4. "GDPR: DELETE + VACUUM 0 HOURS for right to forget" │
│ 5. "Delta Sharing for partner airlines without copying" │
│ │
└─────────────────────────────────────────────────────────────┘
- First pass (30 min): Read only 🧠 Memory Maps + ⚡ Direct Questions
- Second pass (30 min): Read 🔑 Mid-Level Questions + ⚠️ Traps
- Before interview (15 min): Read ONLY the Final Revision Summary Card