Day 1: Snowflake Architecture & Core — Quick Recall Guide
- ⚡ = Must remember (95% chance of being asked)
- 🔑 = Key concept (core understanding needed)
- ⚠️ = Common trap (interviewers love to test this)
- 🧠 = Memory Map (mnemonic/acronym — memorize this!)
- 📝 = One-liner (flash-card style — cover answer, test yourself)
🧠 MASTER MEMORY MAP — Day 1
SECTION 1: ARCHITECTURE
🧠 Memory Map: 3 Layers
⚡ MUST KNOW DIRECT QUESTIONS
A fully-managed cloud data warehouse with complete separation of storage and compute. Runs on AWS, Azure, or GCP. Uses SQL for all operations.
- Cloud Services (brain — query parsing, auth, metadata, result cache)
- Compute (virtual warehouses — run queries, auto-suspend/resume)
- Storage (micro-partitioned columnar data on S3/ADLS/GCS)
Automatically-created columnar storage units, each 50-500 MB compressed. Snowflake stores min/max metadata per column per partition — used for partition pruning to skip irrelevant data.
NO! Hive partitions = you manually define them, separate folders per value, you manage them. Micro-partitions = Snowflake creates automatically, fine-grained, stores min/max metadata, you never touch them. Much more efficient.
An optional column (or expression) that tells Snowflake how to organize micro-partitions. When data is clustered, similar values are in the same partition → better pruning → faster queries.
Snowflake skips micro-partitions that cannot contain matching rows based on their min/max metadata. Example: WHERE date = '2026-03' → skip all partitions where max_date < March.
A cluster of compute VMs that execute queries. You size it (XS to 4XL), it auto-starts when needed and auto-suspends when idle. Pay only when running.
A warehouse with MIN and MAX cluster counts. When many users query simultaneously, extra clusters are added automatically. When load drops, extra clusters removed. Solves concurrency (not speed).
- Scale UP (bigger size: M→L→XL) = fix SLOW queries (one complex query needs more power)
- Scale OUT (multi-cluster: 1→3→5 clusters) = fix QUEUE (many users waiting, not one slow query)
🔑 MID-LEVEL QUESTIONS
- SQL arrives → Cloud Services layer parses + compiles it
- Optimizer checks Result Cache → if hit, return immediately (free!)
- Optimizer builds physical plan (which micro-partitions to read, which to skip)
- Virtual warehouse is woken up (auto-resume if needed)
- Warehouse checks Local Disk Cache (SSD) for needed micro-partitions
- Missing micro-partitions fetched from Storage (S3/ADLS)
- Query executes, result returned to user
The warehouse nodes (VMs) are stopped. The Local Disk Cache (SSD) is CLEARED. The next query triggers auto-resume (~5-15 seconds startup). Micro-partitions still in Storage — no data loss. Only the cache is lost.
Start with Medium for most workloads. If query is slow, scale UP (L, XL). If many users are queuing, add clusters (multi-cluster). Use Query Profile to find if bottleneck is compute (scale up) or concurrency (scale out).
- Clustering depth = how many micro-partitions are stacked containing the same key value (lower = better, ideal = 1)
- Clustering overlap = how many partitions contain overlapping value ranges
- Check with:
SYSTEM$CLUSTERING_INFORMATION('table', '(col)') - High depth/overlap = poor clustering → queries scan too many partitions
SECTION 2: CACHING
🧠 Memory Map: 3 Cache Levels
⚡ MUST KNOW DIRECT QUESTIONS
- Result Cache (Cloud Services — 24h, free, same query returns instantly)
- Local Disk Cache (warehouse SSD — cleared on suspend)
- Storage (S3/ADLS — always available, slowest)
When the exact same SQL is run again within 24 hours AND the underlying data hasn't changed. Returns instantly with ZERO compute credits.
- The underlying table data changed (INSERT/UPDATE/DELETE since last run)
- The SQL text is different (even one space, different case, different parameter)
- 24 hours have passed
- Non-deterministic functions used (CURRENT_TIMESTAMP, RANDOM())
Local Disk Cache (Level 2) is completely cleared. Next query after resume reads from Storage again. Result Cache (Level 1) is unaffected — it's in Cloud Services layer.
SECTION 3: TIME TRAVEL & CLONING
🧠 Memory Map: Time Travel
⚡ MUST KNOW DIRECT QUESTIONS
Ability to query, clone, or restore Snowflake data as it was at any point within the retention period (0-90 days). Works because Snowflake keeps old micro-partitions.
1 day. Can be set 0-90 days with Enterprise+ edition. Set per-table: ALTER TABLE t SET DATA_RETENTION_TIME_IN_DAYS = 30.
- Time Travel (0-90 days) → YOU can access, query, restore yourself
- Fail-safe (7 days after TT ends) → ONLY Snowflake support can recover, not self-service, for disasters only
CREATE TABLE clone CLONE original — instantly creates a copy that shares all micro-partitions with the original. No data is duplicated at creation. Storage only grows when clone data diverges.
- Instant dev/test environments from prod (no wait, no cost)
- Point-in-time snapshots for month-end reporting
- Safe experimentation — clone before risky transformation, easy rollback
- Parallel development — two teams work on same data independently
SECTION 4: SEMI-STRUCTURED DATA
🧠 Memory Map: VARIANT
ARRAY(OBJECT(col1 TYPE, col2 TYPE))→validated schema + better perfMAP(VARCHAR, VARCHAR)→key-value typed pairs⚡ MUST KNOW DIRECT QUESTIONS
Snowflake's data type for semi-structured data (JSON, Avro, Parquet, ORC). Stores any structure in one column — no schema definition needed. Max 128 MB per value (expanded in 2025).
Using colon notation + cast: raw_data:passenger.email::VARCHAR — reads the email field inside passenger object and casts to VARCHAR.
LATERAL FLATTEN(input => col:array_field).VARIANT returns data in Snowflake's internal format. Without ::VARCHAR / ::NUMBER / ::DATE, comparisons may fail and values display oddly. Always cast: col:key::VARCHAR.
🧠 FINAL REVISION — Day 1 Summary Card
┌──────────────────────────────────────────────────────────────────┐
│ DAY 1: SNOWFLAKE ARCHITECTURE │
├──────────────────────────────────────────────────────────────────┤
│ │
│ 3 LAYERS = "BMS" (Brain, Muscles, Storage) │
│ Cloud Services (brain) → Compute (muscles) → Storage (disk) │
│ Key insight: storage & compute are SEPARATE → scale each alone │
│ │
│ MICRO-PARTITIONS = automatic, columnar, 50-500 MB │
│ NOT like Hive partitions (those are manual, folder-based) │
│ Metadata: min/max per column → partition PRUNING │
│ Clustering key: optional hint to organize partitions │
│ │
│ VIRTUAL WAREHOUSE = compute cluster (XS to 4XL) │
│ Scale UP (M→XL) = fix slow queries │
│ Scale OUT (multi-cluster) = fix concurrency (many users) │
│ Auto-suspend = stop billing when idle │
│ ⚠️ Suspend = Local Disk Cache CLEARED │
│ │
│ CACHING = "RLS" (Result, Local-disk, Storage) │
│ Result Cache: 24h, free, exact SQL match required │
│ Local Cache: cleared on suspend! │
│ │
│ TIME TRAVEL: 0-90 days (default 1 day) │
│ FAIL-SAFE: 7 days after TT, Snowflake only (not self-service) │
│ CLONE: instant copy, no data duplicated, shared micro-partitions│
│ │
│ VARIANT: store any JSON/semi-structured data │
│ Access: col:key::VARCHAR (always cast!) │
│ FLATTEN: array → rows (like Spark explode()) │
│ │
│ SNOWFLAKE vs DATABRICKS: │
│ Snowflake = SQL warehouse, BI analytics, data sharing │
│ Databricks = Python ETL, ML, streaming, open formats │
│ Use BOTH: Databricks for ETL → Snowflake for analytics │
│ │
│ TOP 5 THINGS TO SAY IN INTERVIEW: │
│ 1. "Separate compute & storage — multiple WH on same data" │
│ 2. "Micro-partitions auto-created, pruning skips irrelevant" │
│ 3. "Multi-cluster warehouse for 200+ concurrent BI analysts" │
│ 4. "Result cache returns exact same query free in 24h" │
│ 5. "Zero-copy clone: instant dev environment from prod" │
│ │
└──────────────────────────────────────────────────────────────────┘