ETL Patterns & Pipeline Design
💡 Interview Tip
Time: 6-7 hours | Priority: HIGH — ETL design is 25-30% of Databricks interviews
Context: Oracle CDC → Kafka → Delta Lake, flight booking pipelines, fare pricing updates
Approach: Every topic starts with simple explanation → then interview-level depth
SECTION 1: MEDALLION ARCHITECTURE (1 hour)
Q1: What is Medallion Architecture?
Simple Explanation: Medallion Architecture is a way to organize your data into 3 layers (like 3 levels of cleaning). Raw data comes in, gets cleaned step by step, and ends up ready for business reports and ML.
Think of it like cooking:
- Bronze = Raw ingredients from the market (dirty, uncut, as-is)
- Silver = Washed, cut, measured ingredients (clean, validated, ready to cook)
- Gold = Final dish served to the customer (aggregated, formatted for business use)
Why do we need it? Without layers, you'd have a mess — raw data mixed with cleaned data, no one knows what's trustworthy. Medallion gives you a clear pipeline: raw → clean → business-ready.
BRONZE (Raw Layer) SILVER (Clean Layer) GOLD (Business Layer)
────────────────── ───────────────── ─────────────────
What goes in: What goes in: What goes in:
• Raw data exactly as • Deduplicated data • Aggregated metrics
received from source • Validated (nulls removed) • Star schema for BI
• No transformations • Data types fixed • Pre-computed KPIs
• Just append, never update • Business keys applied • Wide tables for ML
Who reads it: Who reads it: Who reads it:
• Only data engineers • Data engineers + • Business analysts
(for debugging) data scientists • BI tools (Power BI)
• ML models
Metadata added: Transformations: Optimize