Day 2: Pipelines & Performance — Quick Recall Guide
🧠 MASTER MEMORY MAP — Day 2
SECTION 1: DATA LOADING
⚡ MUST KNOW DIRECT QUESTIONS
Snowflake's bulk loading command — reads files from a stage and loads into a table in parallel. Much faster than INSERT for large volumes. Idempotent (tracks loaded files, safe to re-run).
A pointer to a file location (loading dock). Internal stage = Snowflake's storage. External stage = your ADLS/S3/GCS. Files land in stage, COPY INTO loads them into tables.
Auto-ingest loading — monitors a stage and automatically runs COPY INTO when new files arrive. Uses event notifications (Azure Event Grid / AWS SQS). Latency: ~30 seconds to 1 minute.
Row-level push API (Python/Java SDK) for very low latency ingestion. Data pushed directly to Snowflake rows. Used with Kafka Connect, Flink. Latency: sub-second.
| COPY INTO | Snowpipe | Snowpipe Streaming | |
|---|---|---|---|
| Trigger | Manual/Task | File event | API push |
| Latency | Schedule interval | ~1 min | Sub-second |
| Cost | Cheapest | Mid | Highest |
| Use for | Batch ETL (hourly, daily) | Near-real-time files | Real-time rows (Kafka) |
Yes! It's idempotent — Snowflake tracks which files were already loaded in its metadata. Re-running the same COPY INTO skips already-loaded files. To force reload: FORCE = TRUE.
Depends on ON_ERROR setting:
ABORT_STATEMENT(default) — stops everything on first errorCONTINUE— skips bad rows, continues loading restSKIP_FILE— skips entire file on any error
SECTION 2: STREAMS
🧠 Memory Map: Streams
⚡ MUST KNOW DIRECT QUESTIONS
A Snowflake object that tracks row-level changes (INSERT/UPDATE/DELETE) on a table. Returns only the changes since last consumption. Used for incremental ELT pipelines.
METADATA$ACTION (INSERT/DELETE), METADATA$ISUPDATE (TRUE if part of UPDATE), METADATA$ROW_ID (unique row identifier).
As TWO rows: one DELETE (the old value, ISUPDATE=TRUE) and one INSERT (the new value, ISUPDATE=TRUE). To get current value: filter METADATA$ACTION='INSERT' AND METADATA$ISUPDATE=TRUE.
Tracks only INSERTs, not UPDATEs or DELETEs. Lighter weight (less metadata overhead). Use when your source table is append-only (like Bronze raw landing tables).
After a successful DML transaction that consumes the stream (usually inside a Task). The consumed rows are no longer in the stream — new changes accumulate from that point forward.
SECTION 3: TASKS
🧠 Memory Map: Task DAG
⚡ MUST KNOW DIRECT QUESTIONS
A scheduled Snowflake object that runs a SQL statement (or stored procedure). Like a cron job inside Snowflake. Tasks can be chained into DAGs for multi-step pipelines.
Use AFTER parent_task_name when creating child tasks. Only RESUME the root task — children automatically start after parent succeeds.
- Warehouse task: uses your virtual warehouse, same credit rate as queries
- Serverless task: Snowflake manages compute, ~10% cheaper, no warehouse needed
- Use serverless for lightweight, frequent tasks
SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(TASK_NAME => 'task_name')) — shows status (SUCCEEDED/FAILED), start/end time, error messages.
SECTION 4: DYNAMIC TABLES
🧠 Memory Map: Dynamic Tables
⚡ MUST KNOW DIRECT QUESTIONS
A Snowflake table defined by a SELECT query with a TARGET_LAG. Snowflake automatically refreshes it when source data changes, usually incrementally. Replaces most Streams + Tasks patterns.
How stale the Dynamic Table is allowed to be. '5 minutes' means Snowflake refreshes within 5 minutes of source changes. Minimum = 1 minute.
- Dynamic Tables: 95% of ELT transformations — simpler, declarative, auto-managed
- Streams + Tasks: SCD Type 2, latency < 1 min, complex stored procedures, external functions
- INCREMENTAL (preferred): Snowflake processes only changed rows → faster, cheaper
- FULL: reprocesses entire source table → slower, more expensive, fallback only
Automatically — you don't write MERGE logic. Snowflake detects which rows changed and updates the Dynamic Table accordingly. Much simpler than writing a Stream + MERGE task.
SECTION 5: PERFORMANCE
🧠 Memory Map: Performance Tuning
⚡ MUST KNOW DIRECT QUESTIONS
Visual execution tree in Snowflake UI showing how a query ran — which operators ran, how long each took, bytes scanned, spill to disk. Used to identify bottlenecks.
The virtual warehouse ran out of memory → data was written to SSD. Much slower than in-memory. Fix: scale UP to a larger warehouse size.
A pre-computed table based on a SELECT query. Snowflake auto-refreshes it when source data changes. Best for: heavy aggregations queried repeatedly by many users.
A feature that builds an extra index structure for point-lookup queries (WHERE id = 'X', WHERE email = 'user@co.com'). Great for high-cardinality column equality searches.
- Clustering key → range queries (
WHERE date BETWEEN ... AND ...), sorted scans - Search Optimization → exact lookups (
WHERE id = 'X'), unordered high-cardinality columns
🧠 FINAL REVISION — Day 2 Summary Card
┌──────────────────────────────────────────────────────────────────┐
│ DAY 2: PIPELINES & PERFORMANCE │
├──────────────────────────────────────────────────────────────────┤
│ │
│ LOADING = "CSP": COPY INTO → Snowpipe → Snowpipe Streaming │
│ COPY INTO: batch, idempotent, cheapest │
│ Snowpipe: auto on file arrival, ~1 min latency │
│ Snowpipe Streaming: push API, sub-second, Kafka │
│ │
│ STREAMS: track INSERT/UPDATE/DELETE on tables │
│ UPDATE = TWO rows: DELETE(old) + INSERT(new) │
│ Filter new value: METADATA$ACTION='INSERT' AND ISUPDATE=TRUE │
│ Types: Standard (all) / Append-only (insert only) │
│ │
│ TASKS: scheduled SQL + AFTER chain = pipeline DAG │
│ RESUME only ROOT task (children auto-follow) │
│ Serverless tasks = 10% cheaper, no warehouse │
│ │
│ DYNAMIC TABLES: declarative, replaces Streams+Tasks (90%) │
│ TARGET_LAG = how stale is OK (min 1 minute) │
│ INCREMENTAL refresh = only processes changes (preferred) │
│ ⚠️ Cannot do SCD Type 2 → use Streams+Tasks for that │
│ │
│ SNOWPARK: Python/Java/Scala DataFrame API in Snowflake │
│ Pushes computation to Snowflake — no data leaves │
│ UDFs: Python functions run inside Snowflake │
│ 2026: Intelligent Hybrid Execution (auto local vs cloud) │
│ │
│ PERFORMANCE = "CQMWS": │
│ Clustering → fix partition pruning (range queries) │
│ Query Profile → find bottleneck (spill? pruning? concurrency?) │
│ Materialized Views → pre-compute aggregations │
│ Warehouse → scale UP (slow) or OUT (concurrent users) │
│ Search Optimization → fix point lookups (= equality) │
│ │
│ TOP 5 THINGS TO SAY IN INTERVIEW: │
│ 1. "Snowpipe for near-real-time, COPY INTO for batch" │
│ 2. "Streams track CDC: UPDATE = DELETE+INSERT pair" │
│ 3. "Dynamic Tables replace most Streams+Tasks — declarative" │
│ 4. "Snowpark: Python that runs INSIDE Snowflake compute" │
│ 5. "CQMWS checklist: Clustering, Query Profile, MV, WH, Search"│
│ │
└──────────────────────────────────────────────────────────────────┘