❄️
Snowflake
Day 1: Snowflake Architecture & Core — Quick Recall Guide
❄️
❄️
Snowflake · Section 3 of 7

Day 1: Snowflake Architecture & Core — Quick Recall Guide

Day 1: Snowflake Architecture & Core — Quick Recall Guide

🗺️ Memory Map
How to use this file:
  • ⚡ = Must remember (95% chance of being asked)
  • 🔑 = Key concept (core understanding needed)
  • ⚠️ = Common trap (interviewers love to test this)
  • 🧠 = Memory Map (mnemonic/acronym — memorize this!)
  • 📝 = One-liner (flash-card style — cover answer, test yourself)
Reading strategy: Memory Maps FIRST → Direct Questions → Mid-Level

🧠 MASTER MEMORY MAP — Day 1

🧠 SNOWFLAKE ARCHITECTURE = "3 layers — CSC" (top to bottom)
SNOWFLAKE ARCHITECTURE"3 layers — CSC" (top to bottom)
CCloud Services (the BRAIN — always on, free 10%)
CCompute / Virtual Warehouses (the MUSCLES — pay when running)
SStorage (the FILING CABINET — always on, pay per TB)
Remember: "BMS" = Brain, Muscles, Storage (top → bottom)
VIRTUAL WAREHOUSE"VW" (like the car brand — sizes matter!)
XSS → M → L → XL → 2XL → 3XL → 4XL
Each step UP = 2x credits, ~2x faster queries
Each step OUT (multi-cluster) = handle more CONCURRENT users
CACHING"RLS" (Result, Local-disk, Storage)
RResult cache (24h, free, exact same query)
LLocal disk cache (SSD on warehouse, cleared on suspend)
SStorage (remote, always available, slowest)
Remember: "RLS" = ResultsLocal → Storage (fast → slow)
MICRO-PARTITIONS = "Smart automatic drawers"
50-500 MB each
Columnar format
Automatic (NOT manual like Hive partitions)
Metadata: min/max per columnpruning!
TIME TRAVEL + FAIL-SAFE = "TF-27"
TTime Travel (YOU access: 0-90 days, default 1 day)
FFail-safe (SNOWFLAKE access: 7 days after TT ends)
27 = 20 days TT + 7 days FS = max 97 days total protection

SECTION 1: ARCHITECTURE

🧠 Memory Map: 3 Layers

LAYER 3 (TOP) = CLOUD SERVICES — "The Brain"
✓ Query parsing, optimization, execution planning
✓ Metadata management (table schemas, clustering info)
✓ Authentication, security (RBAC)
✓ Result Cache lives here
✓ Always ON, ~10% of Snowflake cost, no warehouse needed
LAYER 2 (MIDDLE) = COMPUTE — "The Muscles"
✓ Virtual Warehouses — your clusters of VMs
✓ Execute SQL queries, load data
✓ Auto-suspend (stop paying when idle)
✓ Auto-resume (start on next query)
✓ Multiple warehouses share the SAME storage
LAYER 1 (BOTTOM) = STORAGE — "The Filing Cabinet"
✓ Micro-partitions (50-500 MB, columnar, compressed)
✓ Stores all table data + time travel versions
✓ Always ON (data always accessible)
✓ AWS S3 / Azure ADLS Gen2 / GCS underneath
✓ Charged per TB/month
KEY BENEFIT: Scale COMPUTE independently from STORAGE!
10 warehousesall read same storage simultaneously → NO conflict

⚡ MUST KNOW DIRECT QUESTIONS

Q1What is Snowflake?

A fully-managed cloud data warehouse with complete separation of storage and compute. Runs on AWS, Azure, or GCP. Uses SQL for all operations.

Q2What are the 3 layers of Snowflake?

  1. Cloud Services (brain — query parsing, auth, metadata, result cache)
  2. Compute (virtual warehouses — run queries, auto-suspend/resume)
  3. Storage (micro-partitioned columnar data on S3/ADLS/GCS)

Q3What makes Snowflake different from traditional databases?
Pro Tip
Separation of compute and storage. Multiple virtual warehouses can query the same data at the same time without competing. Scale each independently. Pay only for compute when running.
Q4What are micro-partitions?

Automatically-created columnar storage units, each 50-500 MB compressed. Snowflake stores min/max metadata per column per partition — used for partition pruning to skip irrelevant data.

⚠️ Q5Are Snowflake micro-partitions the same as Hive partitions?

NO! Hive partitions = you manually define them, separate folders per value, you manage them. Micro-partitions = Snowflake creates automatically, fine-grained, stores min/max metadata, you never touch them. Much more efficient.

Q6What is a clustering key?

An optional column (or expression) that tells Snowflake how to organize micro-partitions. When data is clustered, similar values are in the same partition → better pruning → faster queries.

Q7What is partition pruning?

Snowflake skips micro-partitions that cannot contain matching rows based on their min/max metadata. Example: WHERE date = '2026-03' → skip all partitions where max_date < March.

Q8What is a Virtual Warehouse?

A cluster of compute VMs that execute queries. You size it (XS to 4XL), it auto-starts when needed and auto-suspends when idle. Pay only when running.

Q9What is a multi-cluster warehouse?

A warehouse with MIN and MAX cluster counts. When many users query simultaneously, extra clusters are added automatically. When load drops, extra clusters removed. Solves concurrency (not speed).

⚠️ Q10Scale UP vs Scale OUT — what's the difference?

  • Scale UP (bigger size: M→L→XL) = fix SLOW queries (one complex query needs more power)
  • Scale OUT (multi-cluster: 1→3→5 clusters) = fix QUEUE (many users waiting, not one slow query)

🔑 MID-LEVEL QUESTIONS

Q11How does Snowflake read a query? (End-to-end flow)

  1. SQL arrives → Cloud Services layer parses + compiles it
  2. Optimizer checks Result Cache → if hit, return immediately (free!)
  3. Optimizer builds physical plan (which micro-partitions to read, which to skip)
  4. Virtual warehouse is woken up (auto-resume if needed)
  5. Warehouse checks Local Disk Cache (SSD) for needed micro-partitions
  6. Missing micro-partitions fetched from Storage (S3/ADLS)
  7. Query executes, result returned to user

Q12What happens when warehouse is auto-suspended?

The warehouse nodes (VMs) are stopped. The Local Disk Cache (SSD) is CLEARED. The next query triggers auto-resume (~5-15 seconds startup). Micro-partitions still in Storage — no data loss. Only the cache is lost.

Q13How to decide virtual warehouse size?

Start with Medium for most workloads. If query is slow, scale UP (L, XL). If many users are queuing, add clusters (multi-cluster). Use Query Profile to find if bottleneck is compute (scale up) or concurrency (scale out).

Q14Explain clustering depth and overlap

  • Clustering depth = how many micro-partitions are stacked containing the same key value (lower = better, ideal = 1)
  • Clustering overlap = how many partitions contain overlapping value ranges
  • Check with: SYSTEM$CLUSTERING_INFORMATION('table', '(col)')
  • High depth/overlap = poor clustering → queries scan too many partitions

SECTION 2: CACHING

🧠 Memory Map: 3 Cache Levels

SPEED: Result Cache >>>>>> Local Disk Cache >>> Storage
(milliseconds) (seconds) (minutes for TB)
RESULT CACHE (Level 1 — Cloud Services Layer)
Duration: 24 hours
Cleared when: underlying data changes OR 24h pass
Cost: FREE (no warehouse needed!)
⚠️TRAP: Exact same SQL required. One extra space = cache miss!
⚠️TRAP: If table has any INSERT after the query → cache invalidated!
LOCAL DISK CACHE (Level 2 — Virtual Warehouse SSD)
Duration: As long as warehouse is RUNNING
Cleared when: warehouse SUSPENDS (⚠️ important!)
Cost: Free (warehouse is already running)
⚠️TRAP: Keep warehouse running longer (higher AUTO_SUSPEND)
for frequently queried "hot" tables
STORAGE (Level 3 — S3/ADLS/GCS)
Duration: Always there
Cost: Storage cost per TB/month
Speed: Slowest, but micro-partition pruning reduces data read

⚡ MUST KNOW DIRECT QUESTIONS

Q15What are the 3 caching levels in Snowflake?

  1. Result Cache (Cloud Services — 24h, free, same query returns instantly)
  2. Local Disk Cache (warehouse SSD — cleared on suspend)
  3. Storage (S3/ADLS — always available, slowest)

Q16When is Result Cache used?

When the exact same SQL is run again within 24 hours AND the underlying data hasn't changed. Returns instantly with ZERO compute credits.

⚠️ Q17Why might Result Cache NOT be used even for the same query?

  1. The underlying table data changed (INSERT/UPDATE/DELETE since last run)
  2. The SQL text is different (even one space, different case, different parameter)
  3. 24 hours have passed
  4. Non-deterministic functions used (CURRENT_TIMESTAMP, RANDOM())

Q18What happens to cache when warehouse is suspended?

Local Disk Cache (Level 2) is completely cleared. Next query after resume reads from Storage again. Result Cache (Level 1) is unaffected — it's in Cloud Services layer.

SECTION 3: TIME TRAVEL & CLONING

🧠 Memory Map: Time Travel

🧠 TIME TRAVEL = "3 ways to go back in time"
TIME TRAVEL"3 ways to go back in time"
AT (TIMESTAMP => ...) — go to exact time
AT (OFFSET => -3600) — go back N seconds
BEFORE (STATEMENT => 'query-id') — go to just BEFORE that query ran
TIME TRAVEL DURATION
Standard edition: 0-1 days
Enterprise+: 0-90 days (default 1 day)
Temp tables: Always 0 days (can't change)
AFTER TIME TRAVEL: FAIL-SAFE (7 days, Snowflake only, not self-service)
COMMANDS
UNDROP TABLE name;recover dropped table (within TT window)
RESTOREclone from past snapshot: CREATE TABLE t CLONE t AT (...)
ZERO-COPY CLONE = "Instant copy, no data duplicated"
CREATE TABLE dev_bookings CLONE prod_bookings;
→ Takes < 1 second regardless of size
→ Shares micro-partitions with original
→ Storage used only when clone DIVERGES from original

⚡ MUST KNOW DIRECT QUESTIONS

Q19What is Time Travel?

Ability to query, clone, or restore Snowflake data as it was at any point within the retention period (0-90 days). Works because Snowflake keeps old micro-partitions.

Q20What is the default time travel period?

1 day. Can be set 0-90 days with Enterprise+ edition. Set per-table: ALTER TABLE t SET DATA_RETENTION_TIME_IN_DAYS = 30.

⚠️ Q21What is the difference between Time Travel and Fail-safe?

  • Time Travel (0-90 days) → YOU can access, query, restore yourself
  • Fail-safe (7 days after TT ends) → ONLY Snowflake support can recover, not self-service, for disasters only

Q22What is Zero-Copy Cloning?

CREATE TABLE clone CLONE original — instantly creates a copy that shares all micro-partitions with the original. No data is duplicated at creation. Storage only grows when clone data diverges.

Q23What are the use cases for Zero-Copy Clone?

  1. Instant dev/test environments from prod (no wait, no cost)
  2. Point-in-time snapshots for month-end reporting
  3. Safe experimentation — clone before risky transformation, easy rollback
  4. Parallel development — two teams work on same data independently

SECTION 4: SEMI-STRUCTURED DATA

🧠 Memory Map: VARIANT

🧠 VARIANT = "The 'whatever' column — accepts any JSON/Avro/Parquet"
VARIANT"The 'whatever' column — accepts any JSON/Avro/Parquet"
QUERY SYNTAX
col:keyAccess JSON key (dot notation)
col:nested.keyNested access
col:array[0]Array element (0-indexed)
col:key::VARCHARCast to SQL type (always needed!)
FLATTEN"Explode arrays into rows" (like explode() in Spark)
LATERAL FLATTEN(input => col:array_field) f
f.valuethe element, f.index → position, f.key → key (for objects)
2025-2026 NEW: Typed structured columns
ARRAY(OBJECT(col1 TYPE, col2 TYPE))validated schema + better perf
MAP(VARCHAR, VARCHAR)key-value typed pairs
Still use VARIANT for fully flexible/unknown schemas

⚡ MUST KNOW DIRECT QUESTIONS

Q24What is the VARIANT data type?

Snowflake's data type for semi-structured data (JSON, Avro, Parquet, ORC). Stores any structure in one column — no schema definition needed. Max 128 MB per value (expanded in 2025).

Q25How do you access a JSON field from a VARIANT column?

Using colon notation + cast: raw_data:passenger.email::VARCHAR — reads the email field inside passenger object and casts to VARCHAR.

Q26What is FLATTEN?
Pro Tip
A Snowflake table function that converts an array or object inside a VARIANT column into multiple rows — one row per array element. Used with LATERAL FLATTEN(input => col:array_field).
⚠️ Q27Why must you cast VARIANT values?

VARIANT returns data in Snowflake's internal format. Without ::VARCHAR / ::NUMBER / ::DATE, comparisons may fail and values display oddly. Always cast: col:key::VARCHAR.

🧠 FINAL REVISION — Day 1 Summary Card

📐 Architecture Diagram
┌──────────────────────────────────────────────────────────────────┐
│               DAY 1: SNOWFLAKE ARCHITECTURE                       │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  3 LAYERS = "BMS" (Brain, Muscles, Storage)                      │
│  Cloud Services (brain) → Compute (muscles) → Storage (disk)     │
│  Key insight: storage & compute are SEPARATE → scale each alone  │
│                                                                  │
│  MICRO-PARTITIONS = automatic, columnar, 50-500 MB                │
│  NOT like Hive partitions (those are manual, folder-based)       │
│  Metadata: min/max per column → partition PRUNING                │
│  Clustering key: optional hint to organize partitions            │
│                                                                  │
│  VIRTUAL WAREHOUSE = compute cluster (XS to 4XL)                │
│  Scale UP (M→XL) = fix slow queries                             │
│  Scale OUT (multi-cluster) = fix concurrency (many users)        │
│  Auto-suspend = stop billing when idle                          │
│  ⚠️ Suspend = Local Disk Cache CLEARED                          │
│                                                                  │
│  CACHING = "RLS" (Result, Local-disk, Storage)                   │
│  Result Cache: 24h, free, exact SQL match required              │
│  Local Cache: cleared on suspend!                               │
│                                                                  │
│  TIME TRAVEL: 0-90 days (default 1 day)                         │
│  FAIL-SAFE: 7 days after TT, Snowflake only (not self-service)  │
│  CLONE: instant copy, no data duplicated, shared micro-partitions│
│                                                                  │
│  VARIANT: store any JSON/semi-structured data                    │
│  Access: col:key::VARCHAR (always cast!)                        │
│  FLATTEN: array → rows (like Spark explode())                   │
│                                                                  │
│  SNOWFLAKE vs DATABRICKS:                                        │
│  Snowflake = SQL warehouse, BI analytics, data sharing          │
│  Databricks = Python ETL, ML, streaming, open formats           │
│  Use BOTH: Databricks for ETL → Snowflake for analytics         │
│                                                                  │
│  TOP 5 THINGS TO SAY IN INTERVIEW:                               │
│  1. "Separate compute & storage — multiple WH on same data"     │
│  2. "Micro-partitions auto-created, pruning skips irrelevant"   │
│  3. "Multi-cluster warehouse for 200+ concurrent BI analysts"   │
│  4. "Result cache returns exact same query free in 24h"         │
│  5. "Zero-copy clone: instant dev environment from prod"        │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
🗺️ Memory Map
Study tip: Read Memory Maps + Direct Questions first (30 min), then Mid-Level (20 min). Before interview: Summary Card only (10 min).