❄️
Snowflake
Day 2: Pipelines & Performance — Quick Recall Guide
❄️
❄️
Snowflake · Section 5 of 7

Day 2: Pipelines & Performance — Quick Recall Guide

Day 2: Pipelines & Performance — Quick Recall Guide

Must remember🔑Key concept⚠️Common trap🧠Memory Map📝One-liner

🧠 MASTER MEMORY MAP — Day 2

🧠 DATA LOADING = "CSP" (Cheap→Smart→Push)
DATA LOADING"CSP" (Cheap → Smart → Push)
CCOPY INTO (batch, cheapest, manual trigger)
SSnowpipe (auto-ingest, ~1 min latency, event-driven)
PSnowpipe Streaming (push via API, sub-second, Kafka)
CDC INSIDE SNOWFLAKE"Streams + Tasks" OR "Dynamic Tables"
Streams = track what CHANGED in a table (INSERT/UPDATE/DELETE)
Tasks = scheduled SQL (run stream processor every N min)
Dynamic Tables = newer, declarative, replaces most Streams+Tasks
STREAMS"A camera watching your table"
3 metadata columns:
METADATA$ACTION = 'INSERT' or 'DELETE'
METADATA$ISUPDATE = TRUE if row is part of an UPDATE
METADATA$ROW_ID = unique row identifier
⚠️UPDATE = DELETE old + INSERT new (two rows in stream!)
TASKS"Cron jobs inside Snowflake"
Root task: has SCHEDULE
Child tasks: have AFTER parent_task
Only RESUME root task — children auto-follow
Serverless tasks: no WAREHOUSE needed (10% cheaper)
DYNAMIC TABLES"Declarative ELT — you say WHAT, Snowflake does HOW"
CREATE DYNAMIC TABLE ... TARGET_LAG = '5 minutes' AS SELECT ...
Min lag = 1 minute
INCREMENTAL refresh by default (only processes changes)
⚠️Cannot do SCD Type 2 (use Streams+Tasks for that)
PERFORMANCE"CQMWS"
CClustering key (organize micro-partitions by filter columns)
QQuery Profile (find bottleneck: pruning? memory? compute?)
MMaterialized Views (pre-compute heavy aggregations)
WWarehouse sizing (scale UP for slow, scale OUT for concurrency)
SSearch Optimization (point lookups: WHERE id = 'X')

SECTION 1: DATA LOADING

⚡ MUST KNOW DIRECT QUESTIONS

Q1What is COPY INTO?

Snowflake's bulk loading command — reads files from a stage and loads into a table in parallel. Much faster than INSERT for large volumes. Idempotent (tracks loaded files, safe to re-run).

Q2What is a Stage?

A pointer to a file location (loading dock). Internal stage = Snowflake's storage. External stage = your ADLS/S3/GCS. Files land in stage, COPY INTO loads them into tables.

Q3What is Snowpipe?

Auto-ingest loading — monitors a stage and automatically runs COPY INTO when new files arrive. Uses event notifications (Azure Event Grid / AWS SQS). Latency: ~30 seconds to 1 minute.

Q4What is Snowpipe Streaming?

Row-level push API (Python/Java SDK) for very low latency ingestion. Data pushed directly to Snowflake rows. Used with Kafka Connect, Flink. Latency: sub-second.

Q5COPY INTO vs Snowpipe vs Snowpipe Streaming — when to use which?

COPY INTOSnowpipeSnowpipe Streaming
TriggerManual/TaskFile eventAPI push
LatencySchedule interval~1 minSub-second
CostCheapestMidHighest
Use forBatch ETL (hourly, daily)Near-real-time filesReal-time rows (Kafka)

⚠️ Q6Is COPY INTO safe to re-run?

Yes! It's idempotent — Snowflake tracks which files were already loaded in its metadata. Re-running the same COPY INTO skips already-loaded files. To force reload: FORCE = TRUE.

Q7What happens with bad rows in COPY INTO?

Depends on ON_ERROR setting:

  • ABORT_STATEMENT (default) — stops everything on first error
  • CONTINUE — skips bad rows, continues loading rest
  • SKIP_FILE — skips entire file on any error

SECTION 2: STREAMS

🧠 Memory Map: Streams

🧠 STREAM = "Change tracking on a table"
STREAM"Change tracking on a table"
Think: Stream is like a CHANGELOG — every INSERT/UPDATE/DELETE
is recorded with metadata. When you consume it, it advances.
STREAM ON TABLE bookingscaptures all changes to bookings
Stream TYPES:
Standard: INSERT + UPDATE + DELETE (full CDC)
Append-only: INSERT only (lighter, for append-only tables)
Insert-only: For external tables (only new files)
METADATA COLUMNS (always present)
METADATA$ACTION = 'INSERT' or 'DELETE'
METADATA$ISUPDATE = TRUE if part of UPDATE operation
METADATA$ROW_ID = unique Snowflake row identifier
HOW UPDATE APPEARS IN STREAM
UPDATE rowTWO records in stream:
Row 1: METADATA$ACTION='DELETE', METADATA$ISUPDATE=TRUE ← old value
Row 2: METADATA$ACTION='INSERT', METADATA$ISUPDATE=TRUE ← new value
Filter for new value: WHERE METADATA$ACTION='INSERT' AND METADATA$ISUPDATE=TRUE
Filter for old value: WHERE METADATA$ACTION='DELETE' AND METADATA$ISUPDATE=TRUE
STREAM "ADVANCES" after consumption:
Read stream in a TaskTask commits → stream advances
Stream now shows ONLY newer changes (old consumed ones gone)
⚠️Must consume stream in a DML transaction to advance it

⚡ MUST KNOW DIRECT QUESTIONS

Q8What is a Stream?

A Snowflake object that tracks row-level changes (INSERT/UPDATE/DELETE) on a table. Returns only the changes since last consumption. Used for incremental ELT pipelines.

Q9What are the 3 metadata columns in a Stream?

METADATA$ACTION (INSERT/DELETE), METADATA$ISUPDATE (TRUE if part of UPDATE), METADATA$ROW_ID (unique row identifier).

⚠️ Q10How does an UPDATE appear in a Standard Stream?

As TWO rows: one DELETE (the old value, ISUPDATE=TRUE) and one INSERT (the new value, ISUPDATE=TRUE). To get current value: filter METADATA$ACTION='INSERT' AND METADATA$ISUPDATE=TRUE.

Q11What is an Append-Only Stream?

Tracks only INSERTs, not UPDATEs or DELETEs. Lighter weight (less metadata overhead). Use when your source table is append-only (like Bronze raw landing tables).

Q12When does a Stream "advance"?

After a successful DML transaction that consumes the stream (usually inside a Task). The consumed rows are no longer in the stream — new changes accumulate from that point forward.

SECTION 3: TASKS

🧠 Memory Map: Task DAG

🧠 TASK DAG = "Pipeline of SQL steps, like Databricks Workflows"
TASK DAG"Pipeline of SQL steps, like Databricks Workflows"
ROOT TASK
Has SCHEDULE (or trigger)
Is the starting point of the DAG
CHILD TASKS
Have AFTER parent_task (run when parent succeeds)
Multiple children allowed (fan-out)
No schedule (inherit from root)
IMPORTANT RULE
RESUME only the ROOT taskSnowflake auto-manages children
If you RESUME a child task separately, it runs independently!
SERVERLESS TASKS
No WAREHOUSE clauseSnowflake manages compute
10% cheaper than warehouse-based tasks
Good for: lightweight processing, simple SQL
TASK HISTORY CHECK
SELECT * FROM INFORMATION_SCHEMA.TASK_HISTORY(TASK_NAME => 'MY_TASK');

⚡ MUST KNOW DIRECT QUESTIONS

Q13What is a Task?

A scheduled Snowflake object that runs a SQL statement (or stored procedure). Like a cron job inside Snowflake. Tasks can be chained into DAGs for multi-step pipelines.

Q14How do you chain Tasks into a pipeline?

Use AFTER parent_task_name when creating child tasks. Only RESUME the root task — children automatically start after parent succeeds.

⚠️ Q15What is the difference between warehouse-based and serverless tasks?

  • Warehouse task: uses your virtual warehouse, same credit rate as queries
  • Serverless task: Snowflake manages compute, ~10% cheaper, no warehouse needed
  • Use serverless for lightweight, frequent tasks

Q16How do you monitor task failures?

SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(TASK_NAME => 'task_name')) — shows status (SUCCEEDED/FAILED), start/end time, error messages.

SECTION 4: DYNAMIC TABLES

🧠 Memory Map: Dynamic Tables

🧠 DYNAMIC TABLE = "Self-maintaining ELT table"
DYNAMIC TABLE"Self-maintaining ELT table"
One statement replaces: stream + task + merge logic!
CREATE DYNAMIC TABLE silver_bookings
TARGET_LAG = '5 minutes' ← How stale is OK?
WAREHOUSE'ETL_WH'
AS SELECT ... FROM raw_bookings; ← What should it look like?
TARGET_LAG values:
'5 minutes'refresh if data is 5+ min stale
'1 hour'refresh hourly
'DOWNSTREAM'refresh only when someone reads it
REFRESH MODES
INCREMENTALprocesses only changes (most efficient, default when possible)
FULLreprocesses everything (fallback for complex queries)
CHECK: SELECT refresh_mode FROM INFORMATION_SCHEMA.DYNAMIC_TABLES;
If FULLsimplify query → might get INCREMENTAL
CHAIN THEM: DT_Silver reads rawDT_Gold reads DT_Silver
Snowflake auto-detects dependency, refreshes in order!
⚠️WHEN NOT TO USE:
× SCD Type 2 (need before/after UPDATE image → use Streams+Tasks)
× Latency < 1 minute (min lag = 60 seconds)
× External functions or stored procedures in logic

⚡ MUST KNOW DIRECT QUESTIONS

Q17What is a Dynamic Table?

A Snowflake table defined by a SELECT query with a TARGET_LAG. Snowflake automatically refreshes it when source data changes, usually incrementally. Replaces most Streams + Tasks patterns.

Q18What is TARGET_LAG?

How stale the Dynamic Table is allowed to be. '5 minutes' means Snowflake refreshes within 5 minutes of source changes. Minimum = 1 minute.

Q19Dynamic Tables vs Streams + Tasks — when to use which?

  • Dynamic Tables: 95% of ELT transformations — simpler, declarative, auto-managed
  • Streams + Tasks: SCD Type 2, latency < 1 min, complex stored procedures, external functions

Q20What is INCREMENTAL refresh vs FULL refresh?

  • INCREMENTAL (preferred): Snowflake processes only changed rows → faster, cheaper
  • FULL: reprocesses entire source table → slower, more expensive, fallback only

⚠️ Q21How does a Dynamic Table handle an UPDATE in source?

Automatically — you don't write MERGE logic. Snowflake detects which rows changed and updates the Dynamic Table accordingly. Much simpler than writing a Stream + MERGE task.

SECTION 5: PERFORMANCE

🧠 Memory Map: Performance Tuning

🧠 PERFORMANCE CHECKLIST = "CQMWS"
PERFORMANCE CHECKLIST"CQMWS"
1. CLUSTERING (C):
Problem: queries scan many partitions (bad pruning)
Fix: ALTER TABLE t CLUSTER BY (filter_column)
Check: SYSTEM$CLUSTERING_INFORMATIONdepth > 6 = recluster needed
2. QUERY PROFILE (Q):
Snowflake UIQuery History → click query → Query Profile
Look for: "Spilling to disk"more memory (bigger WH)
"Partitions scanned >> total"need clustering
"Bytes from cache" lowcache cold → don't suspend WH
3. MATERIALIZED VIEWS (M):
Problem: same heavy aggregation query runs 1000 times/day
Fix: CREATE MATERIALIZED VIEW mv AS SELECT ...GROUP BY...
Result: query hits MV instead of raw table100x faster
4. WAREHOUSE SIZE (W):
Scale UP: queries individually slowM→L→XL
Scale OUT: many users queuingadd clusters (multi-cluster)
⚠️Spilling to disk = DEFINITELY scale up!
5. SEARCH OPTIMIZATION (S):
Problem: WHERE id = 'specific_value' on massive unordered table
Fix: ALTER TABLE t ADD SEARCH OPTIMIZATION;
Use for: point lookups, equality on high-cardinality columns
NOT for: range queries (use clustering instead)

⚡ MUST KNOW DIRECT QUESTIONS

Q22What is Query Profile?

Visual execution tree in Snowflake UI showing how a query ran — which operators ran, how long each took, bytes scanned, spill to disk. Used to identify bottlenecks.

Q23What does "spilling to disk" mean?

The virtual warehouse ran out of memory → data was written to SSD. Much slower than in-memory. Fix: scale UP to a larger warehouse size.

Q24What is a Materialized View?

A pre-computed table based on a SELECT query. Snowflake auto-refreshes it when source data changes. Best for: heavy aggregations queried repeatedly by many users.

Q25What is Search Optimization Service?

A feature that builds an extra index structure for point-lookup queries (WHERE id = 'X', WHERE email = 'user@co.com'). Great for high-cardinality column equality searches.

⚠️ Q26Clustering key vs Search Optimization — when to use which?

  • Clustering key → range queries (WHERE date BETWEEN ... AND ...), sorted scans
  • Search Optimization → exact lookups (WHERE id = 'X'), unordered high-cardinality columns

🧠 FINAL REVISION — Day 2 Summary Card

📐 Architecture Diagram
┌──────────────────────────────────────────────────────────────────┐
│               DAY 2: PIPELINES & PERFORMANCE                      │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LOADING = "CSP": COPY INTO → Snowpipe → Snowpipe Streaming      │
│  COPY INTO: batch, idempotent, cheapest                         │
│  Snowpipe: auto on file arrival, ~1 min latency                 │
│  Snowpipe Streaming: push API, sub-second, Kafka                │
│                                                                  │
│  STREAMS: track INSERT/UPDATE/DELETE on tables                  │
│  UPDATE = TWO rows: DELETE(old) + INSERT(new)                   │
│  Filter new value: METADATA$ACTION='INSERT' AND ISUPDATE=TRUE   │
│  Types: Standard (all) / Append-only (insert only)              │
│                                                                  │
│  TASKS: scheduled SQL + AFTER chain = pipeline DAG              │
│  RESUME only ROOT task (children auto-follow)                   │
│  Serverless tasks = 10% cheaper, no warehouse                  │
│                                                                  │
│  DYNAMIC TABLES: declarative, replaces Streams+Tasks (90%)      │
│  TARGET_LAG = how stale is OK (min 1 minute)                   │
│  INCREMENTAL refresh = only processes changes (preferred)       │
│  ⚠️ Cannot do SCD Type 2 → use Streams+Tasks for that          │
│                                                                  │
│  SNOWPARK: Python/Java/Scala DataFrame API in Snowflake         │
│  Pushes computation to Snowflake — no data leaves               │
│  UDFs: Python functions run inside Snowflake                    │
│  2026: Intelligent Hybrid Execution (auto local vs cloud)       │
│                                                                  │
│  PERFORMANCE = "CQMWS":                                         │
│  Clustering → fix partition pruning (range queries)             │
│  Query Profile → find bottleneck (spill? pruning? concurrency?) │
│  Materialized Views → pre-compute aggregations                  │
│  Warehouse → scale UP (slow) or OUT (concurrent users)         │
│  Search Optimization → fix point lookups (= equality)          │
│                                                                  │
│  TOP 5 THINGS TO SAY IN INTERVIEW:                               │
│  1. "Snowpipe for near-real-time, COPY INTO for batch"          │
│  2. "Streams track CDC: UPDATE = DELETE+INSERT pair"            │
│  3. "Dynamic Tables replace most Streams+Tasks — declarative"   │
│  4. "Snowpark: Python that runs INSIDE Snowflake compute"       │
│  5. "CQMWS checklist: Clustering, Query Profile, MV, WH, Search"│
│                                                                  │
└──────────────────────────────────────────────────────────────────┘