Day 1: Snowflake Architecture & Core Concepts
SECTION 1: SNOWFLAKE ARCHITECTURE (1.5 hours)
Q1: What is Snowflake? Explain its architecture.
Simple Explanation: Snowflake is a cloud data warehouse — a platform where you store and query massive amounts of data using SQL. Unlike traditional databases (Oracle, SQL Server), Snowflake was built from scratch for the cloud. Its biggest innovation: separate storage from compute — the disk where data lives and the computers that process it are completely independent.
Real-world analogy: Think of a library (storage) and reading desks (compute). In a traditional database, reading desks are built INTO the library — only 10 desks, and they're always there (wasting space when empty). Snowflake is like: the library is in one building, and you can bring in 1 desk or 1000 desks as needed. When no one's reading, send all desks home (auto-suspend). The books (data) are always there regardless.
Why do we need it?
- Traditional databases: scale storage = must scale compute too (expensive, wasteful)
- Snowflake: scale each independently — run 10 warehouses on the same data simultaneously
The 3-Layer Architecture:
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 3: CLOUD SERVICES │
│ The "brain" — always running, managed by Snowflake │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Query │ │ Metadata │ │ Auth & │ │ Optimizer │ │
│ │ Compiler │ │ Manager │ │ Security │ │ & Planner │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │
│ • Parses SQL → builds execution plan │
│ • Manages metadata (table schemas, clustering info) │
│ • Handles login, RBAC, encryption │
│ • Optimizes queries (pruning, caching decisions) │
└───────────────────────────┬──────────────────────────────────────┘
│
┌───────────────────────────▼──────────────────────────────────────┐
│ LAYER 2: COMPUTE (Virtual Warehouses) │
│ The "muscles" — you turn them on when needed │
│ │
│ Warehouse A (XS) Warehouse B (L) Warehouse C (XL) │
│ [ETL pipelines] [BI dashboards] [ML training] │
│ ↕ ↕ ↕ │
│ Reads from storage Reads from storage Reads from storage │
│ │
│ • Independent from each other — no resource contention │
│ • Auto-suspend (stop billing when idle) │
│ • Auto-resume (start automatically when query arrives) │
└───────────────────────────┬──────────────────────────────────────┘
│
┌───────────────────────────▼──────────────────────────────────────┐
│ LAYER 1: STORAGE │
│ The "filing cabinet" — always on, centralized │
│ │
│ Data stored in Snowflake's proprietary compressed format │
│ (micro-partitions — columnar, compressed Parquet-like files) │
│ On AWS S3 / Azure ADLS Gen2 / Google Cloud Storage │
│ │
│ • Charged separately from compute (per TB/month) │
│ • Stores table data + metadata + Time Travel versions │
│ • Encrypted at rest (AES-256) │
└──────────────────────────────────────────────────────────────────┘
Key properties:
- Storage: Compressed columnar files in cloud object storage (S3/ADLS/GCS)
- Compute: Virtual warehouses — separate clusters, independent scaling
- Cloud Services: Query compilation, metadata, security — always on, ~10% of cost
Interview tip: "Snowflake's key differentiator is true separation of storage and compute. Multiple virtual warehouses can query the same data simultaneously without competing for resources. That's impossible in traditional MPP databases like Redshift or Synapse."
Q2: What are Micro-Partitions? How are they different from Hive partitions?
Simple Explanation: All data in Snowflake is automatically split into small pieces called micro-partitions. You don't define them — Snowflake creates them automatically as data is loaded. Each micro-partition is like a small drawer in a filing cabinet, containing 50-500 MB of compressed data (stored in columnar format).
Real-world analogy: Imagine a huge bookshelf of flight booking records. Instead of one massive file, Snowflake automatically splits records into drawers of roughly equal size. On the label of each drawer: "min date = Jan 1, max date = Jan 15, airlines: AI, LH, EK". When you ask for March bookings, Snowflake reads only the labels and skips drawers that can't possibly contain March data.
Why do we need them? Without micro-partitions: read the ENTIRE dataset even for one day's data. With micro-partitions: Snowflake reads labels (metadata) first, skips irrelevant micro-partitions → only reads 1% of data for a filtered query.
Technical details:
-- Query: Get all Lufthansa bookings in March 2026
SELECT * FROM bookings
WHERE booking_date BETWEEN '2026-03-01' AND '2026-03-31'
AND airline_code = 'LH';
-- What Snowflake does (PARTITION PRUNING):
-- 1. Reads metadata of ALL micro-partitions (from Cloud Services layer — fast!)
-- 2. Skips micro-partitions where max_date < March 1 OR min_date > March 31
-- 3. Of remaining, skips where 'LH' not in airlines list
-- 4. Reads ONLY relevant micro-partitions
-- Result: Instead of reading 10B rows, reads maybe 50M rows → 200x faster!
Micro-partitions vs Hive/Databricks partitions:
| Feature | Snowflake Micro-partitions | Hive/Databricks Partitions |
|---|---|---|
| Created by | Snowflake automatically | You must define manually |
| Granularity | 50-500 MB each (fine-grained) | Entire folder per partition value |
| Too many? | Never — Snowflake manages automatically | Yes — too many partitions = metadata overload |
| Metadata | Min/max per column per micro-partition | Just folder structure |
| Overlap | Can overlap (managed by clustering) | No overlap by design |
| Maintenance | Automatic (reclustering if needed) | Manual (must re-partition) |
⚠️ Key Interview Point: In Snowflake, you do NOT manually partition tables like in Hive. Micro-partitions are automatic. You may define clustering keys to optimize micro-partition layout — but this is separate from partitioning.
Interview tip: "Unlike Hive where you define partition columns and manage folder structure yourself, Snowflake automatically creates micro-partitions for every table. When data layout degrades over time (high overlap), I define a clustering key which triggers automatic reclustering — no manual maintenance needed."
Q3: What are Clustering Keys? When should you use them?
Simple Explanation: When data is inserted into Snowflake over time, micro-partitions can become mixed up — a partition might contain data from January, March, and July all mixed together. When you query for March data, Snowflake has to read ALL partitions. This is called high overlap or poor clustering.
A clustering key tells Snowflake: "When organizing micro-partitions, try to group rows with similar values for this column together." Snowflake then automatically re-clusters the table in the background.
Real-world analogy: Your filing cabinet started organized by date (Jan → Feb → March). Then you added 1 million new records randomly, and now each drawer has records from all months mixed in. Clustering key = hiring an assistant to re-sort and re-organize the drawers back to chronological order.
-- Create a table WITH clustering key (good for large tables queried by date)
CREATE TABLE bookings (
booking_id VARCHAR,
booking_date DATE,
airline_code VARCHAR,
passenger_id VARCHAR,
amount NUMBER
)
CLUSTER BY (booking_date, airline_code);
-- This tells Snowflake: "Group micro-partitions by booking_date first, then airline_code"
-- Snowflake will keep data with same booking_date in the same or nearby micro-partitions
-- Check clustering health of a table
SELECT SYSTEM$CLUSTERING_INFORMATION('bookings', '(booking_date, airline_code)');
-- Returns: average_depth (lower = better), average_overlaps (lower = better)
-- Manually trigger reclustering (usually not needed — it's automatic)
ALTER TABLE bookings RECLUSTER;
When to use clustering keys:
Interview tip: "For our Amadeus booking table with 10 billion rows, I'd cluster by (booking_date, airline_code) since 90% of queries filter by date range and specific airline. But first I'd check SYSTEM$CLUSTERING_INFORMATION to confirm the current overlap is actually high — clustering costs credits, so don't add it unless needed."
Q4: What is a Virtual Warehouse? Explain sizing and multi-cluster.
Simple Explanation: A Virtual Warehouse is the compute in Snowflake — a cluster of cloud VMs (servers) that execute your queries. You size it (XS, S, M, L, XL, 2XL, 3XL, 4XL), it auto-starts when a query arrives, runs the query, and auto-suspends when idle. You pay only when it's running.
Real-world analogy: A virtual warehouse is like a team of analysts in an office. XS = 1 analyst, XL = 32 analysts. They all work on the same data files (the library). When no queries come in, they go home (auto-suspend). When a query arrives, they come back (auto-resume). You pay only for the hours they work.
Sizing:
Multi-Cluster Warehouse:
-- Create a multi-cluster warehouse for BI dashboards
-- (handles many users running queries simultaneously)
CREATE WAREHOUSE bi_dashboard_wh
WAREHOUSE_SIZE = 'MEDIUM'
MIN_CLUSTER_COUNT = 1 -- At minimum, 1 cluster running
MAX_CLUSTER_COUNT = 5 -- Scale up to 5 clusters during peak load
SCALING_POLICY = 'STANDARD'; -- Add clusters when there's queuing
-- 'ECONOMY' policy = add cluster only when existing cluster is fully busy for 6 min
-- When 50 BI users hit the dashboard at 9 AM:
-- Cluster 1 handles users 1-10
-- Cluster 2 auto-added for users 11-20 (scale-out!)
-- Cluster 3 auto-added for users 21-30
-- etc.
-- At 3 PM, usage drops → clusters 2-5 auto-removed
-- Auto-suspend (stop billing after 5 minutes idle)
ALTER WAREHOUSE bi_dashboard_wh SET AUTO_SUSPEND = 300; -- 300 seconds = 5 minutes
-- Auto-resume (start automatically when query arrives)
ALTER WAREHOUSE bi_dashboard_wh SET AUTO_RESUME = TRUE;
Scaling UP vs Scaling OUT:
| Scale UP | Scale OUT | |
|---|---|---|
| What | Bigger size (M → L → XL) | More clusters (1→2→3) |
| Fixes | Slow queries (one complex query) | Queue (many users waiting) |
| Example | Slow JOIN on 10B rows | 100 analysts hitting dashboard at once |
| Cost | Higher credits/hour | More clusters × credits/hour |
Interview tip: "For Amadeus's BI dashboards with 200+ analysts, I'd use a multi-cluster warehouse (MIN=1, MAX=5) with ECONOMY scaling. For overnight ETL jobs, I'd use a single Large or XL warehouse since it's one complex pipeline, not many concurrent users. Different workloads → different warehouses."
Q5: Explain the 3 levels of Caching in Snowflake.
Simple Explanation: Snowflake has 3 types of cache — each one faster than the previous, and free. When Snowflake runs your query, it checks cache first at each level before doing actual work.
Real-world analogy:
- Level 1 (Result Cache): Your professor asks the same exam question twice. You just say the same answer — no need to re-think.
- Level 2 (Local Disk Cache): You already have the book open at the right page — read from there.
- Level 3 (Remote Disk Cache): You know which shelf the book is on — much faster than searching the whole library.
Technical details:
⚠️ Common trap: "If you suspend and resume a warehouse, the Level 2 local cache is CLEARED. For dashboards with repeated queries, keep the warehouse running with a longer auto-suspend (15-30 min) so the local cache stays warm."
Q6: What is Time Travel? What is Fail-safe?
Simple Explanation: Time Travel = ability to query your data as it was in the past (up to 90 days). "Show me the bookings table as it was yesterday at 3 PM." Fail-safe = an emergency recovery window AFTER time travel ends. Only Snowflake support can use it.
Real-world analogy:
- Time Travel = a "rewind" button. You can rewind your table to any point in the past (within the retention window) and read it.
- Fail-safe = the data recovery team at your cloud provider. After your rewind window closes, they still have the data locked away for 7 more days (but only THEY can access it for disaster recovery).
-- TIME TRAVEL EXAMPLES:
-- Query as of exact timestamp (useful for debugging)
SELECT * FROM bookings
AT (TIMESTAMP => '2026-03-20 09:00:00'::TIMESTAMP_LTZ);
-- Query as of N minutes ago
SELECT * FROM bookings AT (OFFSET => -60 * 30); -- 30 minutes ago
-- Query as of a specific statement ID (query ID)
SELECT * FROM bookings
BEFORE (STATEMENT => '8e5d0ca9-005e-44e6-b858-a8f5b37c5726');
-- Restore an accidentally dropped table
UNDROP TABLE bookings; -- Works within time travel window
-- Restore to a previous version (overwrite current with past version)
CREATE OR REPLACE TABLE bookings CLONE bookings
AT (TIMESTAMP => '2026-03-24 00:00:00'::TIMESTAMP_LTZ);
-- See what changed between two points in time (audit)
SELECT a.booking_id, a.amount as old_amount, b.amount as new_amount
FROM bookings AT (TIMESTAMP => '2026-03-24 00:00:00') a
JOIN bookings AT (TIMESTAMP => '2026-03-25 00:00:00') b
ON a.booking_id = b.booking_id
WHERE a.amount != b.amount; -- Find what changed!
Time Travel vs Fail-safe:
┌────────────────────────────────────────────────────────────────┐ │ Timeline of a table after DELETION: │ │ │ │ [Data Deleted]─────────────────────────────────────────────→ │ │ │ │ │ ├── 0-90 days: TIME TRAVEL (YOU can access) │ │ │ Default: 1 day (free plan), up to 90 days (paid) │ │ │ You can: UNDROP, query old data, clone old version │ │ │ │ │ └── 90-97 days: FAIL-SAFE (SNOWFLAKE SUPPORT only) │ │ 7 days fixed — you cannot access directly │ │ Only for catastrophic disasters (entire account wipe)│ └────────────────────────────────────────────────────────────────┘ SET TIME TRAVEL on a table: ALTER TABLE bookings SET DATA_RETENTION_TIME_IN_DAYS = 30; -- 30 days for production tables (costs more storage) -- 0 days for staging/temp tables (saves storage)
Interview tip: "For Amadeus booking tables, I'd set time travel to 30 days for production tables — covers month-end reporting cycles. For temp/staging tables, I'd set it to 0 to save storage costs. And I'd document that fail-safe exists but only Snowflake support can use it — it's NOT a self-service feature."
Q7: What is Zero-Copy Cloning?
Simple Explanation: Zero-Copy Clone = instantly create an exact copy of a table, schema, or database — without copying ANY data. The clone points to the same micro-partitions as the original. Storage is only used when data in the clone DIVERGES from the original.
Real-world analogy: Imagine making a "copy" of a 10 TB hard drive, but instead of actually copying 10 TB of data, you just create a new label that says "copy of drive A". When someone writes new data to the copy, THAT new data is stored separately. But unchanged data is shared — no extra storage.
-- Clone a table (instant, no data copied!)
CREATE TABLE bookings_dev CLONE bookings;
-- Creates bookings_dev that shares ALL micro-partitions with bookings
-- Takes < 1 second regardless of table size (10 GB or 10PB)!
-- Clone a schema (all tables in the schema, instantly)
CREATE SCHEMA analytics_dev CLONE analytics_prod;
-- Clone a database (all schemas + tables, instantly!)
CREATE DATABASE prod_backup CLONE prod_db;
-- Clone at a specific point in time (combine with time travel)
CREATE TABLE bookings_march_snapshot CLONE bookings
AT (TIMESTAMP => '2026-03-01 00:00:00'::TIMESTAMP_LTZ);
-- Creates a copy as it was on March 1 — perfect for month-end snapshots
When to use:
- Dev/test environments: Clone prod database for developers — no data copying, instant
- Data experiments: Clone table before a risky transformation — easy rollback
- Monthly snapshots: Clone at month-end for point-in-time reporting
- Parallel pipelines: Two teams working on different transformations of same data
⚠️ Common trap: "Zero-copy means zero INITIAL copy. But over time, as data in the clone changes, it diverges and uses its own storage for the changed micro-partitions. If you write a lot to the clone, storage cost grows."
Q8: What is VARIANT? How does Snowflake handle semi-structured data?
Simple Explanation: VARIANT is Snowflake's special data type for storing ANY semi-structured data — JSON, Avro, Parquet, ORC, XML — in a single column. Instead of flattening JSON into separate columns before loading (which is hard), you load it as-is into a VARIANT column and query it with dot notation.
Real-world analogy: VARIANT is like a "whatever" drawer. Instead of sorting everything before putting it in the drawer (JSON key1 → column1, key2 → column2), you just throw the whole JSON document in. Then when you need something, you reach in and say "give me the .passenger.email" and Snowflake finds it.
-- Create table with VARIANT column (for raw JSON data)
CREATE TABLE raw_bookings (
load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
source VARCHAR(50),
raw_data VARIANT -- Can hold ANY JSON/Avro/Parquet structure
);
-- Load JSON data into VARIANT column
COPY INTO raw_bookings (source, raw_data)
FROM (
SELECT 'booking_api', PARSE_JSON($1) -- $1 = first column from file
FROM @booking_stage/bookings.json
);
-- Query VARIANT data with dot notation (: or . both work)
SELECT
raw_data:booking_id::VARCHAR AS booking_id,
-- ::VARCHAR casts the JSON value to the right SQL type
raw_data:passenger.first_name::VARCHAR AS passenger_name,
-- Nested object: passenger.first_name
raw_data:segments[0]:origin::VARCHAR AS departure_airport,
-- Array access: segments[0] = first segment
raw_data:segments[0]:destination::VARCHAR AS arrival_airport,
raw_data:fare.amount::NUMBER(10,2) AS fare_amount
FROM raw_bookings
WHERE raw_data:status::VARCHAR = 'CONFIRMED';
-- Filter inside VARIANT with dot notation + cast
-- FLATTEN: Convert array inside JSON into rows
SELECT
b.raw_data:booking_id::VARCHAR AS booking_id,
f.value:origin::VARCHAR AS segment_origin,
f.value:destination::VARCHAR AS segment_destination,
f.index AS segment_number
FROM raw_bookings b,
LATERAL FLATTEN(input => b.raw_data:segments) f;
-- LATERAL FLATTEN expands the "segments" array into separate rows
-- Each segment becomes its own row with booking_id repeated
-- Result:
-- booking_id segment_origin segment_destination segment_number
-- BK001 DEL DXB 0
-- BK001 DXB LHR 1
-- BK002 BOM SIN 0
New in 2025-2026: Structured ARRAY, OBJECT, MAP types
-- Instead of untyped VARIANT, Snowflake now supports typed structured types
-- These give better performance and compile-time schema checking
CREATE TABLE bookings_v2 (
booking_id VARCHAR,
segments ARRAY(OBJECT(origin VARCHAR, destination VARCHAR, duration INT)),
-- Typed array of objects — Snowflake validates schema at insert time!
tags MAP(VARCHAR, VARCHAR)
-- Key-value pairs with typed keys and values
);
-- Note: VARIANT still works and is more flexible. Typed arrays/objects are faster.
Interview tip: "I'd load raw JSON into VARIANT at Bronze layer — no schema definition needed. Then at Silver layer, use FLATTEN and dot notation to extract and cast into proper typed columns. This way Bronze handles any schema change gracefully, and Silver catches quality issues."
Q9: What are Stages? Internal vs External?
Simple Explanation: A Stage in Snowflake is a pointer to a location where files live before or after loading. Think of it as a "loading dock" — data files sit in the dock, then COPY INTO loads them into Snowflake tables. Without a stage, you can't load data into Snowflake.
Real-world analogy: A stage is like the receiving dock of a warehouse. Trucks (data sources) drop files at the dock (stage). The warehouse workers (COPY INTO) pick up files from the dock and put them in the shelves (tables). The dock can be Snowflake's own parking lot (internal stage) or an external parking lot you own (external stage like ADLS/S3).
-- ══════════════════════════════════════════════════════
-- TYPE 1: USER STAGE (per-user, automatic, always exists)
-- ══════════════════════════════════════════════════════
-- Each user gets their own private stage: @~
PUT file:///local/bookings.csv @~; -- Upload local file to your user stage
LIST @~; -- See your files
COPY INTO bookings FROM @~/bookings.csv; -- Load into table
-- ══════════════════════════════════════════════════════
-- TYPE 2: TABLE STAGE (per-table, automatic, always exists)
-- ══════════════════════════════════════════════════════
-- Every table has a built-in stage: @%table_name
PUT file:///local/bookings.csv @%bookings; -- Upload to bookings table's stage
COPY INTO bookings FROM @%bookings; -- Load from table's stage
-- ══════════════════════════════════════════════════════
-- TYPE 3: NAMED STAGE (create your own — most flexible)
-- ══════════════════════════════════════════════════════
-- Internal Named Stage (files stored IN Snowflake's storage)
CREATE STAGE internal_booking_stage
COMMENT = 'Internal stage for booking files';
PUT file:///local/bookings_*.csv @internal_booking_stage; -- Upload CSV files
COPY INTO bookings FROM @internal_booking_stage PATTERN='.*bookings.*\\.csv';
-- External Named Stage (files in YOUR Azure ADLS / S3 / GCS)
CREATE STAGE azure_booking_stage
URL = 'azure://mystorageaccount.blob.core.windows.net/bookings-container/landing/'
CREDENTIALS = (AZURE_SAS_TOKEN = 'sv=2020-08...')
FILE_FORMAT = (TYPE = 'PARQUET');
-- Points to ADLS container where your pipeline drops files
COPY INTO bookings FROM @azure_booking_stage; -- Load from ADLS
Internal vs External Stage comparison:
| Internal Stage | External Stage | |
|---|---|---|
| Data stored | Snowflake's managed storage | Your S3/ADLS/GCS |
| Control | Snowflake manages it | You manage it |
| Egress cost | No egress to load | May incur egress if cross-region |
| Sharing | Only inside Snowflake | External tools can also access it |
| Best for | Small/simple loads, testing | Production ETL where pipeline writes to ADLS |
Q10: [DIRECT] Common Basic Questions
📝 What is Snowflake edition difference (Standard, Enterprise, Business Critical)?
Standard → Basic features. Enterprise → Time Travel up to 90 days, multi-cluster warehouses, materialized views. Business Critical → HIPAA/PCI compliance, private link, column-level security, tri-secret secure.
📝 What cloud providers does Snowflake support?
AWS, Azure, and GCP. When you create a Snowflake account, you choose the cloud + region. Data stored in that cloud's object storage (S3/ADLS/GCS).
📝 What is the difference between a database, schema, and table in Snowflake?
Database → top-level container (like a folder). Schema → sub-container inside a database. Table → data object inside a schema. Full path: database.schema.table.
📝 What is an Account Identifier?
Unique identifier for your Snowflake account — format: orgname-accountname (e.g., amadeus-prod). Used in JDBC URLs and Snowpark connections.
📝 What is SnowSQL?
Snowflake's command-line SQL client. Connect to Snowflake, run queries, load data — all from terminal. Alternative to the Snowflake web UI.
SECTION 2: SNOWFLAKE vs DATABRICKS (Critical Comparison)
Q11: Snowflake vs Databricks — When to use which?
Simple Explanation: This is the #1 comparison question you'll get since you're preparing for both. They solve DIFFERENT problems — they're often used TOGETHER in modern data platforms.
When to use Snowflake:
- BI dashboards with 100+ concurrent users
- SQL analysts without engineering skills
- Data sharing with external partners
- Mixed JSON + structured data analytics
- When you want zero infrastructure management
When to use Databricks:
- Complex Python ETL (pandas, PySpark)
- Machine learning pipeline + MLflow
- Streaming data from Kafka
- You need open-source portability
- Large-scale data transformation (not just querying)
Real-world at Amadeus: "Databricks handles the heavy ingestion and transformation (Oracle CDC → Kafka → Bronze → Silver layers). The clean Silver data is then shared with Snowflake for BI analysts to query with SQL. Both tools, one pipeline."
Interview tip: "Snowflake and Databricks are complementary, not competing. I'd use Databricks for the data engineering heavy lifting and Snowflake for SQL-heavy analytics and partner data sharing. Many enterprise platforms today use both."
SECTION 3: SCENARIO-BASED QUESTIONS
Q12: Scenario — Design a Snowflake architecture for Amadeus booking analytics
Context: 10 billion bookings/year, 200+ airline partners, 500 BI analysts, GDPR compliance, real-time pricing queries, historical reporting up to 5 years.
Answer:
┌─────────────────────────────────────────────────────────────────┐ │ AMADEUS SNOWFLAKE ANALYTICS PLATFORM │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ INGESTION: │ │ Databricks Silver tables → Snowflake via Snowpipe Streaming │ │ Partner files (CSV/JSON) → External Stage (ADLS) → COPY INTO │ │ Real-time prices (Kafka) → Snowpipe (auto-ingest) │ │ │ │ DATABASES: │ │ RAW_DB → Bronze (raw data, 90-day time travel) │ │ ANALYTICS_DB → Silver/Gold (clean data, 30-day time travel) │ │ SHARE_DB → Partner-specific views (data sharing) │ │ │ │ VIRTUAL WAREHOUSES (workload isolation): │ │ ETL_WH → XL, loading & transformation (overnight) │ │ BI_WH → M, multi-cluster (1-10), BI analysts (9-6 PM) │ │ PRICING_WH → L, real-time fare queries (always on) │ │ DS_WH → XL, data science queries (on demand) │ │ │ │ CLUSTERING KEYS: │ │ bookings_fact: CLUSTER BY (booking_date, airline_code) │ │ passengers: CLUSTER BY (passenger_id) │ │ │ │ SECURITY: │ │ Unity hierarchy: ACCOUNTADMIN > SYSADMIN > team roles │ │ Column masking: email, phone, passport_no (PII/GDPR) │ │ Row access policies: each airline sees only their bookings │ │ │ │ SHARING: │ │ Secure Share per airline → each airline's secure portal │ │ Data Clean Room: cross-airline analysis without raw PII │ │ │ └─────────────────────────────────────────────────────────────────┘
Q13: Scenario — Why is a specific query slow? How to diagnose?
-- Step 1: Find the slow query in query history
SELECT query_id, query_text, total_elapsed_time/1000 as seconds,
bytes_scanned/1024/1024/1024 as gb_scanned,
partitions_scanned, partitions_total,
-- Low ratio = bad pruning (scanning too many partitions)
ROUND(partitions_scanned/partitions_total * 100, 1) as pct_scanned
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE user_name = 'ANALYST_USER'
ORDER BY total_elapsed_time DESC
LIMIT 10;
-- Step 2: Check if clustering is helping
SELECT SYSTEM$CLUSTERING_INFORMATION('bookings', '(booking_date)');
-- Look at: average_depth (>6 = needs reclustering)
-- average_overlaps (>5 = partitions heavily mixed)
-- Step 3: Check execution details
SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS('query-id-here'));
-- Shows each operator: TableScan, Join, Aggregate
-- Find which step takes the most time
-- Step 4: Check if result cache helped
SELECT result_from_cache FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_id = 'your-query-id';
-- TRUE = cached result used (free!)
-- FALSE = full execution happened