❄️

Snowflake

Day 1: Snowflake Architecture & Core Concepts

❄️

Snowflake · Section 3 of 8

Day 1: Snowflake Architecture & Core Concepts

✅ Pro Tip

Time: 5-6 hours | Priority: HIGHEST — Architecture is 35-40% of any Snowflake interview Context: Travel booking platform with billions of daily transactions, multi-airline data, GDPR Approach: Every topic — Simple explanation → Real-world analogy → Technical depth → Code → Interview tip

SECTION 1: SNOWFLAKE ARCHITECTURE (1.5 hours)

Q1: What is Snowflake? Explain its architecture.

Simple Explanation: Snowflake is a cloud data warehouse — a platform where you store and query massive amounts of data using SQL. Unlike traditional databases (Oracle, SQL Server), Snowflake was built from scratch for the cloud. Its biggest innovation: separate storage from compute — the disk where data lives and the computers that process it are completely independent.

Real-world analogy: Think of a library (storage) and reading desks (compute). In a traditional database, reading desks are built INTO the library — only 10 desks, and they're always there (wasting space when empty). Snowflake is like: the library is in one building, and you can bring in 1 desk or 1000 desks as needed. When no one's reading, send all desks home (auto-suspend). The books (data) are always there regardless.

Why do we need it?

Traditional databases: scale storage = must scale compute too (expensive, wasteful)
Snowflake: scale each independently — run 10 warehouses on the same data simultaneously

The 3-Layer Architecture:

📐 Architecture Diagram

┌──────────────────────────────────────────────────────────────────┐
│                    LAYER 3: CLOUD SERVICES                        │
│   The "brain" — always running, managed by Snowflake             │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐   │
│   │ Query    │  │ Metadata │  │ Auth &   │  │ Optimizer    │   │
│   │ Compiler │  │ Manager  │  │ Security │  │ & Planner    │   │
│   └──────────┘  └──────────┘  └──────────┘  └──────────────┘   │
│   • Parses SQL → builds execution plan                           │
│   • Manages metadata (table schemas, clustering info)            │
│   • Handles login, RBAC, encryption                             │
│   • Optimizes queries (pruning, caching decisions)              │
└───────────────────────────┬──────────────────────────────────────┘
                            │
┌───────────────────────────▼──────────────────────────────────────┐
│                  LAYER 2: COMPUTE (Virtual Warehouses)            │
│   The "muscles" — you turn them on when needed                   │
│                                                                  │
│   Warehouse A (XS)    Warehouse B (L)    Warehouse C (XL)        │
│   [ETL pipelines]     [BI dashboards]    [ML training]           │
│        ↕                    ↕                   ↕               │
│   Reads from storage  Reads from storage  Reads from storage     │
│                                                                  │
│   • Independent from each other — no resource contention        │
│   • Auto-suspend (stop billing when idle)                       │
│   • Auto-resume (start automatically when query arrives)        │
└───────────────────────────┬──────────────────────────────────────┘
                            │
┌───────────────────────────▼──────────────────────────────────────┐
│                    LAYER 1: STORAGE                               │
│   The "filing cabinet" — always on, centralized                  │
│                                                                  │
│   Data stored in Snowflake's proprietary compressed format       │
│   (micro-partitions — columnar, compressed Parquet-like files)  │
│   On AWS S3 / Azure ADLS Gen2 / Google Cloud Storage            │
│                                                                  │
│   • Charged separately from compute (per TB/month)              │
│   • Stores table data + metadata + Time Travel versions         │
│   • Encrypted at rest (AES-256)                                 │
└──────────────────────────────────────────────────────────────────┘

Key properties:

Storage: Compressed columnar files in cloud object storage (S3/ADLS/GCS)
Compute: Virtual warehouses — separate clusters, independent scaling
Cloud Services: Query compilation, metadata, security — always on, ~10% of cost

Interview tip: "Snowflake's key differentiator is true separation of storage and compute. Multiple virtual warehouses can query the same data simultaneously without competing for resources. That's impossible in traditional MPP databases like Redshift or Synapse."

Q2: What are Micro-Partitions? How are they different from Hive partitions?

Simple Explanation: All data in Snowflake is automatically split into small pieces called micro-partitions. You don't define them — Snowflake creates them automatically as data is loaded. Each micro-partition is like a small drawer in a filing cabinet, containing 50-500 MB of compressed data (stored in columnar format).

Real-world analogy: Imagine a huge bookshelf of flight booking records. Instead of one massive file, Snowflake automatically splits records into drawers of roughly equal size. On the label of each drawer: "min date = Jan 1, max date = Jan 15, airlines: AI, LH, EK". When you ask for March bookings, Snowflake reads only the labels and skips drawers that can't possibly contain March data.

Why do we need them? Without micro-partitions: read the ENTIRE dataset even for one day's data. With micro-partitions: Snowflake reads labels (metadata) first, skips irrelevant micro-partitions → only reads 1% of data for a filtered query.

Technical details:

🗂️Snowflake Table: bookings (10 billion rows)

Micro-partition 1: 50-500 MB compressed

│ Min booking_date: 2026-01-01

│ Max booking_date: 2026-01-15

│ Airlines: ['AI', 'LH', 'EK']

│ Rows: ~500,000

│ Stored in: columnar format (all booking_dates together, all amounts together)

Micro-partition 2: 50-500 MB compressed

│ Min booking_date: 2026-01-16

│ Max booking_date: 2026-01-31

│ Airlines: ['QR', 'EK', 'UA']

│ Rows: ~480,000

Micro-partition N: ...

sql

-- Query: Get all Lufthansa bookings in March 2026
SELECT * FROM bookings
WHERE booking_date BETWEEN '2026-03-01' AND '2026-03-31'
  AND airline_code = 'LH';

-- What Snowflake does (PARTITION PRUNING):
-- 1. Reads metadata of ALL micro-partitions (from Cloud Services layer — fast!)
-- 2. Skips micro-partitions where max_date < March 1 OR min_date > March 31
-- 3. Of remaining, skips where 'LH' not in airlines list
-- 4. Reads ONLY relevant micro-partitions
-- Result: Instead of reading 10B rows, reads maybe 50M rows → 200x faster!

Micro-partitions vs Hive/Databricks partitions:

Feature	Snowflake Micro-partitions	Hive/Databricks Partitions
Created by	Snowflake automatically	You must define manually
Granularity	50-500 MB each (fine-grained)	Entire folder per partition value
Too many?	Never — Snowflake manages automatically	Yes — too many partitions = metadata overload
Metadata	Min/max per column per micro-partition	Just folder structure
Overlap	Can overlap (managed by clustering)	No overlap by design
Maintenance	Automatic (reclustering if needed)	Manual (must re-partition)

⚠️ Key Interview Point: In Snowflake, you do NOT manually partition tables like in Hive. Micro-partitions are automatic. You may define clustering keys to optimize micro-partition layout — but this is separate from partitioning.

Interview tip: "Unlike Hive where you define partition columns and manage folder structure yourself, Snowflake automatically creates micro-partitions for every table. When data layout degrades over time (high overlap), I define a clustering key which triggers automatic reclustering — no manual maintenance needed."

Q3: What are Clustering Keys? When should you use them?

Simple Explanation: When data is inserted into Snowflake over time, micro-partitions can become mixed up — a partition might contain data from January, March, and July all mixed together. When you query for March data, Snowflake has to read ALL partitions. This is called high overlap or poor clustering.

A clustering key tells Snowflake: "When organizing micro-partitions, try to group rows with similar values for this column together." Snowflake then automatically re-clusters the table in the background.

Real-world analogy: Your filing cabinet started organized by date (Jan → Feb → March). Then you added 1 million new records randomly, and now each drawer has records from all months mixed in. Clustering key = hiring an assistant to re-sort and re-organize the drawers back to chronological order.

sql

-- Create a table WITH clustering key (good for large tables queried by date)
CREATE TABLE bookings (
    booking_id      VARCHAR,
    booking_date    DATE,
    airline_code    VARCHAR,
    passenger_id    VARCHAR,
    amount          NUMBER
)
CLUSTER BY (booking_date, airline_code);
-- This tells Snowflake: "Group micro-partitions by booking_date first, then airline_code"
-- Snowflake will keep data with same booking_date in the same or nearby micro-partitions

-- Check clustering health of a table
SELECT SYSTEM$CLUSTERING_INFORMATION('bookings', '(booking_date, airline_code)');
-- Returns: average_depth (lower = better), average_overlaps (lower = better)

-- Manually trigger reclustering (usually not needed — it's automatic)
ALTER TABLE bookings RECLUSTER;

When to use clustering keys:

USE CLUSTERING KEY WHEN

✓ Table is very large (>1 TB or >100 million rows)

✓ You always filter by the same column(s)

✓ Your queries are slow despite having a virtual warehouse

✓ SYSTEM$CLUSTERING_INFORMATION shows high overlap/depth

DO NOT USE WHEN

✗ Table is small (<1 TB) — overhead not worth it

✗ You filter by many different columns each query — no single good key

✗ Data is already naturally clustered by insert order (time-series data)

✗ Table is write-heavy — reclustering runs constantly, costs credits

Interview tip: "For our travel-platform's booking table with 10 billion rows, I'd cluster by (booking_date, airline_code) since 90% of queries filter by date range and specific airline. But first I'd check SYSTEM$CLUSTERING_INFORMATION to confirm the current overlap is actually high — clustering costs credits, so don't add it unless needed."

Q4: What is a Virtual Warehouse? Explain sizing and multi-cluster.

Simple Explanation: A Virtual Warehouse is the compute in Snowflake — a cluster of cloud VMs (servers) that execute your queries. You size it (XS, S, M, L, XL, 2XL, 3XL, 4XL), it auto-starts when a query arrives, runs the query, and auto-suspends when idle. You pay only when it's running.

Real-world analogy: A virtual warehouse is like a team of analysts in an office. XS = 1 analyst, XL = 32 analysts. They all work on the same data files (the library). When no queries come in, they go home (auto-suspend). When a query arrives, they come back (auto-resume). You pay only for the hours they work.

Sizing:

🧠 SIZE → T-SHIRT SIZE → SNOWFLAKE CREDITS/HOUR → BEST FOR

SIZET-SHIRT SIZE → SNOWFLAKE CREDITS/HOUR → BEST FOR

────────────────────────────────────────────────────────────

XSExtra Small → 1 credit/hour → Dev, small queries, testing

SSmall → 2 credits/hour → Light BI queries

MMedium → 4 credits/hour → Most BI/analytics workloads

LLarge → 8 credits/hour → Heavy queries, large datasets

XLExtra Large → 16 credits/hour → Complex joins, large MERGE

2XL→2X Large → 32 credits/hour → Very large ETL, ML feature prep

3XL→3X Large → 64 credits/hour → Data science at scale

4XL→4X Large → 128 credits/hour → Extreme workloads

RULE: Each size-up = 2x the credits + 2x the compute power (roughly 2x faster)

Multi-Cluster Warehouse:

sql

-- Create a multi-cluster warehouse for BI dashboards
-- (handles many users running queries simultaneously)
CREATE WAREHOUSE bi_dashboard_wh
    WAREHOUSE_SIZE = 'MEDIUM'
    MIN_CLUSTER_COUNT = 1     -- At minimum, 1 cluster running
    MAX_CLUSTER_COUNT = 5     -- Scale up to 5 clusters during peak load
    SCALING_POLICY = 'STANDARD';  -- Add clusters when there's queuing
-- 'ECONOMY' policy = add cluster only when existing cluster is fully busy for 6 min

-- When 50 BI users hit the dashboard at 9 AM:
-- Cluster 1 handles users 1-10
-- Cluster 2 auto-added for users 11-20  (scale-out!)
-- Cluster 3 auto-added for users 21-30
-- etc.
-- At 3 PM, usage drops → clusters 2-5 auto-removed

-- Auto-suspend (stop billing after 5 minutes idle)
ALTER WAREHOUSE bi_dashboard_wh SET AUTO_SUSPEND = 300;  -- 300 seconds = 5 minutes

-- Auto-resume (start automatically when query arrives)
ALTER WAREHOUSE bi_dashboard_wh SET AUTO_RESUME = TRUE;

Scaling UP vs Scaling OUT:

	Scale UP	Scale OUT
What	Bigger size (M → L → XL)	More clusters (1→2→3)
Fixes	Slow queries (one complex query)	Queue (many users waiting)
Example	Slow JOIN on 10B rows	100 analysts hitting dashboard at once
Cost	Higher credits/hour	More clusters × credits/hour

Interview tip: "For the company's BI dashboards with 200+ analysts, I'd use a multi-cluster warehouse (MIN=1, MAX=5) with ECONOMY scaling. For overnight ETL jobs, I'd use a single Large or XL warehouse since it's one complex pipeline, not many concurrent users. Different workloads → different warehouses."

Q5: Explain the 3 levels of Caching in Snowflake.

Simple Explanation: Snowflake has 3 types of cache — each one faster than the previous, and free. When Snowflake runs your query, it checks cache first at each level before doing actual work.

Real-world analogy:

Level 1 (Result Cache): Your professor asks the same exam question twice. You just say the same answer — no need to re-think.
Level 2 (Local Disk Cache): You already have the book open at the right page — read from there.
Level 3 (Remote Disk Cache): You know which shelf the book is on — much faster than searching the whole library.

🧠 QUERY ARRIVES → CHECK CACHE LEVEL 1 (Result Cache)

QUERY ARRIVESCHECK CACHE LEVEL 1 (Result Cache)

│ ├── HIT: Return exact same result (free, instant!)

│ │ 24 hours TTL, invalidated if data changes

│ └── MISS: Check Level 2

│

├──────────────→CHECK CACHE LEVEL 2 (Local Disk / SSD Cache)

│ ├── HIT: Return from virtual warehouse SSD (fast!)

│ │ Lost when warehouse suspends

│ └── MISS: Check Level 3

│

└──────────────→READ FROM STORAGE (Remote / S3/ADLS)

Always available, slowest, but micro-partition pruning helps

Technical details:

🧠 Memory Map

LEVEL 1: RESULT CACHE (Cloud Services Layer)

────────────────────────────────────────────

WHERE: Stored in Cloud Services layer (always on)

WHAT: Exact query results from previous runs

DURATION: 24 hours (reset if underlying data changes)

FREE: Yes — no warehouse credits charged!

GOTCHA: Same SQL + same data = cache hit. One space difference→cache miss.

EXAMPLE

9:00 AM: SELECT SUM(amount) FROM bookings WHERE date='2026-03-25'

→ Runs on warehouse, takes 30 seconds

9:30 AM: Same query by different user

→ Instant! Returns from Result Cache. Zero credits used.

LEVEL 2: LOCAL DISK CACHE (Virtual Warehouse SSD)

──────────────────────────────────────────────────

WHERE: SSD attached to each warehouse cluster node

WHAT: Raw micro-partition data that was recently read from storage

DURATION: As long as warehouse is running (cleared on suspend!)

FREE: Yes — no extra cost, but warehouse must be running

GOTCHA: If you suspend + resume warehouse→cache is GONE.

Set AUTO_SUSPEND to longer duration for frequently-hit tables.

LEVEL 3: REMOTE DISK / STORAGE (S3/ADLS)

─────────────────────────────────────────

WHERE: Cloud object storage (AWS S3, Azure ADLS Gen2, GCS)

WHAT: All micro-partitions of all tables

ALWAYS AVAILABLE: Yes — never clears

COST: Storage cost per TB/month + data transfer

⚠️ Common trap: "If you suspend and resume a warehouse, the Level 2 local cache is CLEARED. For dashboards with repeated queries, keep the warehouse running with a longer auto-suspend (15-30 min) so the local cache stays warm."

Q6: What is Time Travel? What is Fail-safe?

Simple Explanation: Time Travel = ability to query your data as it was in the past (up to 90 days). "Show me the bookings table as it was yesterday at 3 PM." Fail-safe = an emergency recovery window AFTER time travel ends. Only Snowflake support can use it.

Real-world analogy:

Time Travel = a "rewind" button. You can rewind your table to any point in the past (within the retention window) and read it.
Fail-safe = the data recovery team at your cloud provider. After your rewind window closes, they still have the data locked away for 7 more days (but only THEY can access it for disaster recovery).

sql

-- TIME TRAVEL EXAMPLES:

-- Query as of exact timestamp (useful for debugging)
SELECT * FROM bookings
AT (TIMESTAMP => '2026-03-20 09:00:00'::TIMESTAMP_LTZ);

-- Query as of N minutes ago
SELECT * FROM bookings AT (OFFSET => -60 * 30);  -- 30 minutes ago

-- Query as of a specific statement ID (query ID)
SELECT * FROM bookings
BEFORE (STATEMENT => '8e5d0ca9-005e-44e6-b858-a8f5b37c5726');

-- Restore an accidentally dropped table
UNDROP TABLE bookings;  -- Works within time travel window

-- Restore to a previous version (overwrite current with past version)
CREATE OR REPLACE TABLE bookings CLONE bookings
  AT (TIMESTAMP => '2026-03-24 00:00:00'::TIMESTAMP_LTZ);

-- See what changed between two points in time (audit)
SELECT a.booking_id, a.amount as old_amount, b.amount as new_amount
FROM bookings AT (TIMESTAMP => '2026-03-24 00:00:00') a
JOIN bookings AT (TIMESTAMP => '2026-03-25 00:00:00') b
ON a.booking_id = b.booking_id
WHERE a.amount != b.amount;  -- Find what changed!

Time Travel vs Fail-safe:

📐 Architecture Diagram

┌────────────────────────────────────────────────────────────────┐
│  Timeline of a table after DELETION:                           │
│                                                                │
│  [Data Deleted]─────────────────────────────────────────────→ │
│       │                                                        │
│       ├── 0-90 days: TIME TRAVEL (YOU can access)             │
│       │   Default: 1 day (free plan), up to 90 days (paid)    │
│       │   You can: UNDROP, query old data, clone old version  │
│       │                                                        │
│       └── 90-97 days: FAIL-SAFE (SNOWFLAKE SUPPORT only)      │
│           7 days fixed — you cannot access directly           │
│           Only for catastrophic disasters (entire account wipe)│
└────────────────────────────────────────────────────────────────┘

SET TIME TRAVEL on a table:
ALTER TABLE bookings SET DATA_RETENTION_TIME_IN_DAYS = 30;
-- 30 days for production tables (costs more storage)
-- 0 days for staging/temp tables (saves storage)

Interview tip: "For travel booking tables, I'd set time travel to 30 days for production tables — covers month-end reporting cycles. For temp/staging tables, I'd set it to 0 to save storage costs. And I'd document that fail-safe exists but only Snowflake support can use it — it's NOT a self-service feature."

Q7: What is Zero-Copy Cloning?

Simple Explanation: Zero-Copy Clone = instantly create an exact copy of a table, schema, or database — without copying ANY data. The clone points to the same micro-partitions as the original. Storage is only used when data in the clone DIVERGES from the original.

Real-world analogy: Imagine making a "copy" of a 10 TB hard drive, but instead of actually copying 10 TB of data, you just create a new label that says "copy of drive A". When someone writes new data to the copy, THAT new data is stored separately. But unchanged data is shared — no extra storage.

sql

-- Clone a table (instant, no data copied!)
CREATE TABLE bookings_dev CLONE bookings;
-- Creates bookings_dev that shares ALL micro-partitions with bookings
-- Takes < 1 second regardless of table size (10 GB or 10PB)!

-- Clone a schema (all tables in the schema, instantly)
CREATE SCHEMA analytics_dev CLONE analytics_prod;

-- Clone a database (all schemas + tables, instantly!)
CREATE DATABASE prod_backup CLONE prod_db;

-- Clone at a specific point in time (combine with time travel)
CREATE TABLE bookings_march_snapshot CLONE bookings
    AT (TIMESTAMP => '2026-03-01 00:00:00'::TIMESTAMP_LTZ);
-- Creates a copy as it was on March 1 — perfect for month-end snapshots

When to use:

Dev/test environments: Clone prod database for developers — no data copying, instant
Data experiments: Clone table before a risky transformation — easy rollback
Monthly snapshots: Clone at month-end for point-in-time reporting
Parallel pipelines: Two teams working on different transformations of same data

⚠️ Common trap: "Zero-copy means zero INITIAL copy. But over time, as data in the clone changes, it diverges and uses its own storage for the changed micro-partitions. If you write a lot to the clone, storage cost grows."

Q8: What is VARIANT? How does Snowflake handle semi-structured data?

Simple Explanation: VARIANT is Snowflake's special data type for storing ANY semi-structured data — JSON, Avro, Parquet, ORC, XML — in a single column. Instead of flattening JSON into separate columns before loading (which is hard), you load it as-is into a VARIANT column and query it with dot notation.

Real-world analogy: VARIANT is like a "whatever" drawer. Instead of sorting everything before putting it in the drawer (JSON key1 → column1, key2 → column2), you just throw the whole JSON document in. Then when you need something, you reach in and say "give me the .passenger.email" and Snowflake finds it.

sql

-- Create table with VARIANT column (for raw JSON data)
CREATE TABLE raw_bookings (
    load_time   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    source      VARCHAR(50),
    raw_data    VARIANT    -- Can hold ANY JSON/Avro/Parquet structure
);

-- Load JSON data into VARIANT column
COPY INTO raw_bookings (source, raw_data)
FROM (
    SELECT 'booking_api', PARSE_JSON($1)   -- $1 = first column from file
    FROM @booking_stage/bookings.json
);

-- Query VARIANT data with dot notation (: or . both work)
SELECT
    raw_data:booking_id::VARCHAR           AS booking_id,
    -- ::VARCHAR casts the JSON value to the right SQL type
    raw_data:passenger.first_name::VARCHAR AS passenger_name,
    -- Nested object: passenger.first_name
    raw_data:segments[0]:origin::VARCHAR   AS departure_airport,
    -- Array access: segments[0] = first segment
    raw_data:segments[0]:destination::VARCHAR AS arrival_airport,
    raw_data:fare.amount::NUMBER(10,2)     AS fare_amount
FROM raw_bookings
WHERE raw_data:status::VARCHAR = 'CONFIRMED';
-- Filter inside VARIANT with dot notation + cast

-- FLATTEN: Convert array inside JSON into rows
SELECT
    b.raw_data:booking_id::VARCHAR   AS booking_id,
    f.value:origin::VARCHAR          AS segment_origin,
    f.value:destination::VARCHAR     AS segment_destination,
    f.index                          AS segment_number
FROM raw_bookings b,
LATERAL FLATTEN(input => b.raw_data:segments) f;
-- LATERAL FLATTEN expands the "segments" array into separate rows
-- Each segment becomes its own row with booking_id repeated

-- Result:
-- booking_id  segment_origin  segment_destination  segment_number
-- BK001       DEL             DXB                  0
-- BK001       DXB             LHR                  1
-- BK002       BOM             SIN                  0

New in 2025-2026: Structured ARRAY, OBJECT, MAP types

sql

-- Instead of untyped VARIANT, Snowflake now supports typed structured types
-- These give better performance and compile-time schema checking

CREATE TABLE bookings_v2 (
    booking_id  VARCHAR,
    segments    ARRAY(OBJECT(origin VARCHAR, destination VARCHAR, duration INT)),
    -- Typed array of objects — Snowflake validates schema at insert time!
    tags        MAP(VARCHAR, VARCHAR)
    -- Key-value pairs with typed keys and values
);
-- Note: VARIANT still works and is more flexible. Typed arrays/objects are faster.

Interview tip: "I'd load raw JSON into VARIANT at Bronze layer — no schema definition needed. Then at Silver layer, use FLATTEN and dot notation to extract and cast into proper typed columns. This way Bronze handles any schema change gracefully, and Silver catches quality issues."

Q9: What are Stages? Internal vs External?

Simple Explanation: A Stage in Snowflake is a pointer to a location where files live before or after loading. Think of it as a "loading dock" — data files sit in the dock, then COPY INTO loads them into Snowflake tables. Without a stage, you can't load data into Snowflake.

Real-world analogy: A stage is like the receiving dock of a warehouse. Trucks (data sources) drop files at the dock (stage). The warehouse workers (COPY INTO) pick up files from the dock and put them in the shelves (tables). The dock can be Snowflake's own parking lot (internal stage) or an external parking lot you own (external stage like ADLS/S3).

sql

-- ══════════════════════════════════════════════════════
-- TYPE 1: USER STAGE (per-user, automatic, always exists)
-- ══════════════════════════════════════════════════════
-- Each user gets their own private stage: @~
PUT file:///local/bookings.csv @~;       -- Upload local file to your user stage
LIST @~;                                 -- See your files
COPY INTO bookings FROM @~/bookings.csv; -- Load into table

-- ══════════════════════════════════════════════════════
-- TYPE 2: TABLE STAGE (per-table, automatic, always exists)
-- ══════════════════════════════════════════════════════
-- Every table has a built-in stage: @%table_name
PUT file:///local/bookings.csv @%bookings;    -- Upload to bookings table's stage
COPY INTO bookings FROM @%bookings;           -- Load from table's stage

-- ══════════════════════════════════════════════════════
-- TYPE 3: NAMED STAGE (create your own — most flexible)
-- ══════════════════════════════════════════════════════

-- Internal Named Stage (files stored IN Snowflake's storage)
CREATE STAGE internal_booking_stage
    COMMENT = 'Internal stage for booking files';

PUT file:///local/bookings_*.csv @internal_booking_stage;  -- Upload CSV files
COPY INTO bookings FROM @internal_booking_stage PATTERN='.*bookings.*\\.csv';

-- External Named Stage (files in YOUR Azure ADLS / S3 / GCS)
CREATE STAGE azure_booking_stage
    URL = 'azure://mystorageaccount.blob.core.windows.net/bookings-container/landing/'
    CREDENTIALS = (AZURE_SAS_TOKEN = 'sv=2020-08...')
    FILE_FORMAT = (TYPE = 'PARQUET');
    -- Points to ADLS container where your pipeline drops files

COPY INTO bookings FROM @azure_booking_stage;  -- Load from ADLS

Internal vs External Stage comparison:

	Internal Stage	External Stage
Data stored	Snowflake's managed storage	Your S3/ADLS/GCS
Control	Snowflake manages it	You manage it
Egress cost	No egress to load	May incur egress if cross-region
Sharing	Only inside Snowflake	External tools can also access it
Best for	Small/simple loads, testing	Production ETL where pipeline writes to ADLS

Q10: [DIRECT] Common Basic Questions

📝 What is Snowflake edition difference (Standard, Enterprise, Business Critical)?

Standard → Basic features. Enterprise → Time Travel up to 90 days, multi-cluster warehouses, materialized views. Business Critical → HIPAA/PCI compliance, private link, column-level security, tri-secret secure.

📝 What cloud providers does Snowflake support?

AWS, Azure, and GCP. When you create a Snowflake account, you choose the cloud + region. Data stored in that cloud's object storage (S3/ADLS/GCS).

📝 What is the difference between a database, schema, and table in Snowflake?

Database → top-level container (like a folder). Schema → sub-container inside a database. Table → data object inside a schema. Full path: database.schema.table.

📝 What is an Account Identifier?

Unique identifier for your Snowflake account — format: orgname-accountname (e.g., travelco-prod). Used in JDBC URLs and Snowpark connections.

📝 What is SnowSQL?

Snowflake's command-line SQL client. Connect to Snowflake, run queries, load data — all from terminal. Alternative to the Snowflake web UI.

SECTION 2: SNOWFLAKE vs DATABRICKS (Critical Comparison)

Q11: Snowflake vs Databricks — When to use which?

Simple Explanation: This is the #1 comparison question you'll get since you're preparing for both. They solve DIFFERENT problems — they're often used TOGETHER in modern data platforms.

SNOWFLAKE DATABRICKS

─────────────────────────────────── ─────────────────────────────────────

WHAT IT IS: WHAT IT IS:

Cloud Data WAREHOUSE Data Lakehouse (lake + warehouse)

SQL-first platform Python/Scala/R + SQL platform

Fully managed SaaS Open-source foundation (Apache Spark)

BEST AT: BEST AT:

SQL analytics & BI Complex ETL/ELT pipelines

Concurrent multi-user queries Machine learning & AI/ML

Data sharing with partners Streaming data (Kafka → Delta)

Semi-structured data (VARIANT) Python data engineering

Zero-copy cloning Large-scale batch processing

STORAGE: STORAGE:

Proprietary micro-partitions Open format: Delta Lake (Parquet)

Automatic management You manage: OPTIMIZE, VACUUM

Not portable to other tools Portable — any tool can read Parquet

COMPUTE: COMPUTE:

Virtual Warehouses (SQL engine) Apache Spark (distributed processing)

Pure SQL, no code needed Python/Scala DataFrames + SQL

Auto-scales by default Manual cluster configuration

NEW FEATURES 2026: NEW FEATURES 2026:

Cortex AI (LLM in SQL) Lakeflow Declarative Pipelines

Snowflake Postgres (managed PG) Serverless Workspaces

Hybrid Tables / Unistore Lakebase (OLTP on Delta)

Open Catalog (Polaris/Iceberg) Multi-table Transactions

Gen2 Warehouses (2.1x faster) Predictive Optimization

PRICING: PRICING:

Credits (per second of compute) DBUs (per second of compute)

Storage separate (per TB) Storage + compute separate

No cluster management overhead Cluster startup time (~2-10 min)

When to use Snowflake:

BI dashboards with 100+ concurrent users
SQL analysts without engineering skills
Data sharing with external partners
Mixed JSON + structured data analytics
When you want zero infrastructure management

When to use Databricks:

Complex Python ETL (pandas, PySpark)
Machine learning pipeline + MLflow
Streaming data from Kafka
You need open-source portability
Large-scale data transformation (not just querying)

Real-world at a large enterprise: "Databricks handles the heavy ingestion and transformation (Oracle CDC → Kafka → Bronze → Silver layers). The clean Silver data is then shared with Snowflake for BI analysts to query with SQL. Both tools, one pipeline."

Interview tip: "Snowflake and Databricks are complementary, not competing. I'd use Databricks for the data engineering heavy lifting and Snowflake for SQL-heavy analytics and partner data sharing. Many enterprise platforms today use both."

SECTION 3: SCENARIO-BASED QUESTIONS

Q12: Scenario — Design a Snowflake architecture for a travel platform's booking analytics

Context: 10 billion bookings/year, 200+ airline partners, 500 BI analysts, GDPR compliance, real-time pricing queries, historical reporting up to 5 years.

Answer:

📐 Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              TRAVEL PLATFORM SNOWFLAKE ANALYTICS               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  INGESTION:                                                     │
│  Databricks Silver tables → Snowflake via Snowpipe Streaming    │
│  Partner files (CSV/JSON) → External Stage (ADLS) → COPY INTO  │
│  Real-time prices (Kafka) → Snowpipe (auto-ingest)             │
│                                                                 │
│  DATABASES:                                                     │
│  RAW_DB       → Bronze (raw data, 90-day time travel)          │
│  ANALYTICS_DB → Silver/Gold (clean data, 30-day time travel)   │
│  SHARE_DB     → Partner-specific views (data sharing)          │
│                                                                 │
│  VIRTUAL WAREHOUSES (workload isolation):                       │
│  ETL_WH     → XL, loading & transformation (overnight)         │
│  BI_WH      → M, multi-cluster (1-10), BI analysts (9-6 PM)   │
│  PRICING_WH → L, real-time fare queries (always on)           │
│  DS_WH      → XL, data science queries (on demand)            │
│                                                                 │
│  CLUSTERING KEYS:                                               │
│  bookings_fact: CLUSTER BY (booking_date, airline_code)        │
│  passengers: CLUSTER BY (passenger_id)                         │
│                                                                 │
│  SECURITY:                                                      │
│  Unity hierarchy: ACCOUNTADMIN > SYSADMIN > team roles         │
│  Column masking: email, phone, passport_no (PII/GDPR)          │
│  Row access policies: each airline sees only their bookings    │
│                                                                 │
│  SHARING:                                                       │
│  Secure Share per airline → each airline's secure portal       │
│  Data Clean Room: cross-airline analysis without raw PII       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Q13: Scenario — Why is a specific query slow? How to diagnose?

sql

-- Step 1: Find the slow query in query history
SELECT query_id, query_text, total_elapsed_time/1000 as seconds,
       bytes_scanned/1024/1024/1024 as gb_scanned,
       partitions_scanned, partitions_total,
       -- Low ratio = bad pruning (scanning too many partitions)
       ROUND(partitions_scanned/partitions_total * 100, 1) as pct_scanned
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE user_name = 'ANALYST_USER'
ORDER BY total_elapsed_time DESC
LIMIT 10;

-- Step 2: Check if clustering is helping
SELECT SYSTEM$CLUSTERING_INFORMATION('bookings', '(booking_date)');
-- Look at: average_depth (>6 = needs reclustering)
--          average_overlaps (>5 = partitions heavily mixed)

-- Step 3: Check execution details
SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS('query-id-here'));
-- Shows each operator: TableScan, Join, Aggregate
-- Find which step takes the most time

-- Step 4: Check if result cache helped
SELECT result_from_cache FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_id = 'your-query-id';
-- TRUE = cached result used (free!)
-- FALSE = full execution happened

← Day 1: Snowflake Architecture & Core — Quick Recall GuidePrevious Day 2: Pipelines & Performance — Quick Recall GuideNext →