Day 2: Data Loading, Pipelines & Performance
SECTION 1: DATA LOADING (1 hour)
Q1: What is COPY INTO? How is it different from INSERT?
Simple Explanation: COPY INTO is Snowflake's bulk loading command — it reads files from a stage (ADLS, S3, internal) and loads them into a table. It is MUCH faster than INSERT for large volumes because it runs in parallel across the warehouse.
Why do we need it? INSERT processes rows one by one — fine for 100 rows, terrible for 100 million rows. COPY INTO parallelizes across the warehouse and processes files in parallel batches.
Real-world analogy: INSERT = one person typing data manually. COPY INTO = 100 workers simultaneously moving pallets from a loading dock into the warehouse.
-- Basic COPY INTO from external stage (Azure ADLS)
COPY INTO raw_bookings
FROM @azure_booking_stage/2026/03/25/ -- Read from this ADLS path
FILE_FORMAT = (
TYPE = 'PARQUET' -- File type
)
ON_ERROR = 'CONTINUE'; -- Skip bad files, continue loading
-- Options: ABORT_STATEMENT (default), CONTINUE, SKIP_FILE
-- COPY INTO with transformation (load + clean in one step)
COPY INTO silver_bookings (booking_id, booking_date, amount, airline_code)
FROM (
SELECT
$1:booking_id::VARCHAR, -- Cast from VARIANT
$1:booking_date::DATE,
$1:fare.amount::NUMBER(10,2),
UPPER($1:airline_code::VARCHAR) -- Transform during load
FROM @azure_booking_stage/bookings.json
)
FILE_FORMAT = (TYPE = 'JSON');
-- COPY INTO with error handling — see what failed
COPY INTO raw_bookings FROM @stage/
ON_ERROR = 'CONTINUE';
-- After loading, check what errors occurred:
SELECT * FROM TABLE(VALIDATE(raw_bookings, JOB_ID => '_last'));
-- Shows exactly which files + rows failed and why
-- COPY INTO is idempotent (safe to re-run!)
-- Snowflake tracks which files were already loaded (load history table)
-- Re-running the same COPY INTO skips already-loaded files automatically
-- To force reload: COPY INTO ... FORCE = TRUE;
Key COPY INTO options:
| Option | Values | Use Case |
|---|---|---|
ON_ERROR | ABORT / CONTINUE / SKIP_FILE | Handle bad rows |
PURGE = TRUE | — | Auto-delete files from stage after loading |
FORCE = TRUE | — | Re-load already-loaded files |
PATTERN = '.*\.csv' | Regex | Load only matching file names |
FILES = ('f1.csv','f2.csv') | List | Load specific files |
Q2: What is Snowpipe? How does it differ from COPY INTO?
Simple Explanation: COPY INTO = you manually trigger the load (or schedule with a Task). Snowpipe = load happens AUTOMATICALLY when new files arrive. Snowpipe monitors your stage and runs COPY INTO for you the moment a new file appears.
Real-world analogy: COPY INTO = you call a taxi (trigger manually). Snowpipe = Uber pool running continuously — jumps in automatically whenever you're ready (file arrives).
-- Create a pipe that auto-loads from a stage
CREATE PIPE booking_pipe
AUTO_INGEST = TRUE -- Enabled for cloud event notifications
AS
COPY INTO raw_bookings -- What to run when files arrive
FROM @azure_booking_stage
FILE_FORMAT = (TYPE = 'PARQUET');
-- Check pipe status
SELECT SYSTEM$PIPE_STATUS('booking_pipe');
-- Shows: pendingFileCount, lastIngestedTimestamp, status
-- See load history (what files loaded, errors)
SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
TABLE_NAME => 'raw_bookings',
START_TIME => DATEADD('hour', -24, CURRENT_TIMESTAMP)
));
Snowpipe vs COPY INTO vs Snowpipe Streaming:
SECTION 2: STREAMS & TASKS (CDC in Snowflake)
Q3: What Are Snowflake Streams? How Do They Enable CDC?
Simple Explanation: A Stream is Snowflake's native CDC (Change Data Capture) mechanism. It tracks every INSERT, UPDATE, and DELETE on a table — without modifying the original table or storing a separate copy. When you query the stream, you see "what changed since I last consumed it?" Once you consume the changes inside a DML transaction, the stream automatically advances its offset — the consumed rows disappear from the stream view.
Real-world analogy: Think of a hotel check-in/check-out log. The front desk doesn't copy the entire guest database every hour — it just records new arrivals (INSERT), room changes (UPDATE), and departures (DELETE). The housekeeping team reads the log, services the rooms, and tears off the page. Tomorrow there's a fresh page with only new changes.
Why do we need it? Without streams, you'd have to either (a) reload the entire table every time (expensive, slow), or (b) build custom CDC logic using timestamps or audit columns (fragile, error-prone). Streams give you a zero-maintenance, guaranteed-consistent way to see exactly what changed — perfect for incremental ETL pipelines.
How Streams Track Changes (The Offset Model):
TABLE: raw_bookings
┌────────────────────────────────────────────┐
│ Row 1 │ Row 2 │ Row 3 │ Row 4 │
│ (old) │ (old) │ (new!) │ (new!) │
└────────────────────────────────────────────┘
▲
│ Stream offset points here
│ "I've seen everything before Row 3"
│
STREAM VIEW: shows Row 3 and Row 4 only (the delta)
After you consume Row 3 & 4 in a DML (INSERT/MERGE):
→ Offset advances to Row 4
→ Stream view is now empty (until new changes arrive)
Streams use Snowflake's internal versioning (Time Travel metadata) to track the offset. No data is copied — the stream is essentially a pointer into the table's change history.
Stream Metadata Columns:
Every stream row includes three hidden metadata columns:
| Column | Type | Meaning |
|---|---|---|
METADATA$ACTION | VARCHAR | 'INSERT' or 'DELETE' |
METADATA$ISUPDATE | BOOLEAN | TRUE if this row is part of an UPDATE |
METADATA$ROW_ID | VARCHAR | Unique internal row identifier |
An UPDATE appears as TWO rows in the stream: a DELETE of the old value + an INSERT of the new value, both with METADATA$ISUPDATE = TRUE.
Stream Types:
| Type | Tracks | Use Case |
|---|---|---|
| Standard (default) | INSERT + UPDATE + DELETE | Full CDC — SCD Type 2, MERGE pipelines |
| Append-only | INSERT only | Log/event tables where rows are never updated |
| Insert-only | INSERT only (external tables) | Tracking new files arriving in external stages |
-- ========================================
-- CREATING STREAMS
-- ========================================
-- Standard stream (tracks all DML changes)
CREATE STREAM booking_changes_stream
ON TABLE raw_bookings
SHOW_INITIAL_ROWS = FALSE; -- Only track NEW changes from this point
-- SHOW_INITIAL_ROWS = TRUE → backfill: include all existing rows as INSERTs
-- Append-only stream (lighter weight — only new inserts)
CREATE STREAM new_bookings_stream
ON TABLE raw_bookings
APPEND_ONLY = TRUE;
-- Use for event/log tables where rows are never updated or deleted
-- Insert-only stream on an external table
CREATE STREAM new_files_stream
ON EXTERNAL TABLE ext_booking_files
INSERT_ONLY = TRUE;
-- Fires when new files appear in the external stage
-- ========================================
-- QUERYING A STREAM (peek at changes)
-- ========================================
SELECT *,
METADATA$ACTION, -- 'INSERT' or 'DELETE'
METADATA$ISUPDATE, -- TRUE if part of an UPDATE
METADATA$ROW_ID -- Internal row identifier
FROM booking_changes_stream;
-- This is a READ — it does NOT consume the stream
-- Stream is only consumed when changes are used in a DML statement
-- ========================================
-- CONSUMING A STREAM (the MERGE pattern — #1 interview pattern!)
-- ========================================
-- The Stream + MERGE pattern: upsert changes into a target table
MERGE INTO silver_bookings AS target
USING (
SELECT booking_id, booking_date, amount, airline_code,
METADATA$ACTION AS action,
METADATA$ISUPDATE AS is_update
FROM booking_changes_stream
) AS src
ON target.booking_id = src.booking_id
-- Handle inserts (new rows)
WHEN NOT MATCHED AND src.action = 'INSERT'
THEN INSERT (booking_id, booking_date, amount, airline_code)
VALUES (src.booking_id, src.booking_date, src.amount, src.airline_code)
-- Handle updates (new value of updated rows)
WHEN MATCHED AND src.action = 'INSERT' AND src.is_update = TRUE
THEN UPDATE SET
target.amount = src.amount,
target.booking_date = src.booking_date,
target.airline_code = src.airline_code
-- Handle deletes
WHEN MATCHED AND src.action = 'DELETE' AND src.is_update = FALSE
THEN DELETE;
-- After MERGE commits → stream advances → consumed rows disappear
-- Next query to stream shows only changes AFTER this MERGE
-- ========================================
-- STALE STREAMS — how to avoid them
-- ========================================
-- A stream becomes STALE if you don't consume it within the table's
-- DATA_RETENTION_TIME_IN_DAYS (Time Travel period, default 1 day).
-- Once stale, the stream is BROKEN — you must recreate it.
-- Check if a stream is stale:
SHOW STREAMS LIKE 'booking_changes_stream';
-- Look at "stale_after" column — if past, the stream is stale
-- Prevention: consume streams regularly OR extend retention:
ALTER TABLE raw_bookings SET DATA_RETENTION_TIME_IN_DAYS = 14;
-- Now stream has 14 days before going stale
-- Trade-off: longer retention = more storage cost (Time Travel storage)
Interview Tip: The Stream + MERGE pattern is the single most asked Snowflake pipeline question. Know it cold. Also explain that streams are "transactional consumers" — they only advance when the consuming DML commits. If the MERGE fails, the stream stays where it was (no data loss). Mention stale streams and retention as a follow-up — it shows production experience.
Q4: What Are Snowflake Tasks? How Do You Schedule SQL?
Simple Explanation: A Task is Snowflake's built-in scheduler — it runs a SQL statement or stored procedure on a defined schedule (cron or interval). Tasks can be chained into a DAG (Directed Acyclic Graph) where child tasks run only after their parent succeeds. Combined with Streams, Tasks form the classic Snowflake incremental pipeline.
Real-world analogy: Think of a factory assembly line with a foreman. The foreman (Task scheduler) blows a whistle every 5 minutes. Station 1 (bronze) starts work. When Station 1 finishes, Station 2 (silver) automatically begins. Station 3 (gold) waits for Station 2. If the foreman sees no raw materials on the conveyor belt (SYSTEM$STREAM_HAS_DATA() = FALSE), he skips that cycle — no wasted work.
Why do we need it? Without Tasks, you'd need an external scheduler (Airflow, cron, Azure Data Factory) just to run SQL on a timer. Tasks keep everything inside Snowflake — no external dependencies, no credentials to manage, no network calls. For simple-to-medium pipelines, Tasks eliminate an entire orchestration layer.
-- ========================================
-- CREATING TASKS
-- ========================================
-- Simple task: run SQL every 5 minutes using a warehouse
CREATE TASK load_bronze_task
WAREHOUSE = 'ETL_WH' -- Uses this warehouse's credits
SCHEDULE = '5 MINUTE' -- Interval schedule
AS
COPY INTO raw_bookings
FROM @azure_booking_stage;
-- Cron-based schedule
CREATE TASK nightly_aggregate_task
WAREHOUSE = 'ETL_WH'
SCHEDULE = 'USING CRON 0 2 * * * UTC' -- 2:00 AM UTC daily
-- Cron: minute hour day month weekday timezone
AS
CALL refresh_daily_aggregates();
-- !! CRITICAL: Tasks are SUSPENDED by default !!
-- You MUST resume them or they will never run
ALTER TASK load_bronze_task RESUME;
ALTER TASK load_bronze_task SUSPEND; -- Pause it
-- ========================================
-- SERVERLESS TASKS vs WAREHOUSE-BASED TASKS
-- ========================================
-- Warehouse-based: uses a specific warehouse (you control size & cost)
CREATE TASK warehouse_task
WAREHOUSE = 'ETL_WH'
SCHEDULE = '10 MINUTE'
AS SELECT 1;
-- Serverless: Snowflake manages compute — no warehouse needed
CREATE TASK serverless_task
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'MEDIUM'
-- Omit WAREHOUSE = ... to make it serverless
SCHEDULE = '5 MINUTE'
AS
INSERT INTO audit_log VALUES (CURRENT_TIMESTAMP, 'heartbeat');
-- Serverless tasks:
-- ✓ ~10% cheaper than warehouse-based for equivalent work
-- ✓ No warehouse to manage — auto-scales to workload
-- ✓ Best for lightweight, frequent tasks
-- ✗ Less control over compute resources
-- ✗ Compute size is approximate (Snowflake decides)
-- ========================================
-- TASK TREES (DAG — parent → child dependencies)
-- ========================================
-- Root task: has a SCHEDULE (the only task that needs one)
CREATE TASK bronze_task
WAREHOUSE = 'ETL_WH'
SCHEDULE = 'USING CRON 0 2 * * * UTC' -- 2 AM daily
AS
CALL load_bronze_data();
-- Child task: uses AFTER clause (no SCHEDULE — triggered by parent)
CREATE TASK silver_task
WAREHOUSE = 'ETL_WH'
AFTER bronze_task -- Runs ONLY after bronze_task succeeds
AS
CALL transform_silver_data();
-- Grandchild task: runs after silver
CREATE TASK gold_task
WAREHOUSE = 'ETL_WH'
AFTER silver_task
AS
CALL aggregate_gold_metrics();
-- A child can depend on MULTIPLE parents (fan-in):
CREATE TASK final_report_task
WAREHOUSE = 'ETL_WH'
AFTER silver_task, gold_task -- Waits for BOTH to finish
AS
CALL generate_report();
-- !! Resume ONLY the root task — children auto-activate !!
-- !! But you must RESUME children BEFORE resuming the root !!
ALTER TASK gold_task RESUME;
ALTER TASK silver_task RESUME;
ALTER TASK bronze_task RESUME; -- Root task last
-- ========================================
-- CONDITIONAL EXECUTION: SYSTEM$STREAM_HAS_DATA()
-- ========================================
-- Only run the task if the stream has new data (skip empty cycles)
CREATE TASK process_booking_changes
WAREHOUSE = 'ETL_WH'
SCHEDULE = '5 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('booking_changes_stream')
-- If stream is empty → task is SKIPPED (no warehouse credit used!)
AS
MERGE INTO silver_bookings AS target
USING booking_changes_stream AS src
ON target.booking_id = src.booking_id
WHEN NOT MATCHED AND src.METADATA$ACTION = 'INSERT'
THEN INSERT (booking_id, booking_date, amount, airline_code)
VALUES (src.booking_id, src.booking_date, src.amount, src.airline_code)
WHEN MATCHED AND src.METADATA$ACTION = 'INSERT' AND src.METADATA$ISUPDATE = TRUE
THEN UPDATE SET target.amount = src.amount
WHEN MATCHED AND src.METADATA$ACTION = 'DELETE' AND src.METADATA$ISUPDATE = FALSE
THEN DELETE;
ALTER TASK process_booking_changes RESUME;
-- ========================================
-- COMPLETE STREAM + TASK PIPELINE (end-to-end example)
-- ========================================
-- Step 1: Source table exists
-- CREATE TABLE raw_bookings (booking_id VARCHAR, booking_date DATE, ...);
-- Step 2: Create stream to track changes
CREATE STREAM booking_cdc_stream ON TABLE raw_bookings;
-- Step 3: Create task that consumes stream via MERGE
CREATE TASK cdc_pipeline_task
WAREHOUSE = 'ETL_WH'
SCHEDULE = '5 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('booking_cdc_stream')
AS
MERGE INTO silver_bookings AS t
USING booking_cdc_stream AS s
ON t.booking_id = s.booking_id
WHEN MATCHED AND s.METADATA$ACTION = 'DELETE' AND s.METADATA$ISUPDATE = FALSE
THEN DELETE
WHEN MATCHED AND s.METADATA$ACTION = 'INSERT' AND s.METADATA$ISUPDATE = TRUE
THEN UPDATE SET t.amount = s.amount, t.booking_date = s.booking_date
WHEN NOT MATCHED AND s.METADATA$ACTION = 'INSERT'
THEN INSERT (booking_id, booking_date, amount)
VALUES (s.booking_id, s.booking_date, s.amount);
-- Step 4: Resume the task
ALTER TASK cdc_pipeline_task RESUME;
-- Now: any INSERT/UPDATE/DELETE on raw_bookings is automatically
-- captured by the stream and merged into silver_bookings every 5 minutes
-- ========================================
-- MONITORING TASKS
-- ========================================
-- View task run history (recent runs, status, errors)
SELECT name, state, scheduled_time, completed_time,
error_code, error_message,
DATEDIFF('second', query_start_time, completed_time) AS duration_sec
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
TASK_NAME => 'CDC_PIPELINE_TASK',
SCHEDULED_TIME_RANGE_START => DATEADD('day', -1, CURRENT_TIMESTAMP)
))
ORDER BY scheduled_time DESC;
-- state: 'SUCCEEDED', 'FAILED', 'SKIPPED' (WHEN condition was false)
-- Account-level task history (all tasks)
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TASK_HISTORY
WHERE scheduled_time > DATEADD('hour', -24, CURRENT_TIMESTAMP)
ORDER BY scheduled_time DESC;
Interview Tip: Always mention three things: (1) Tasks are suspended by default — forgetting ALTER TASK RESUME is the #1 production mistake. (2) Use WHEN SYSTEM$STREAM_HAS_DATA() to avoid wasting credits on empty runs. (3) In a Task DAG, resume children BEFORE the root task, and only the root task needs a SCHEDULE. If asked "Streams+Tasks vs Airflow?" — answer: Tasks for simple Snowflake-only pipelines; Airflow when you need cross-system orchestration (Snowflake + Spark + APIs).
Q5: What Are Dynamic Tables? How Are They Different from Streams + Tasks?
Simple Explanation: Dynamic Tables (GA 2024, recommended approach 2025-2026) are Snowflake's declarative ELT mechanism. You define WHAT the result should look like using a SELECT query, and Snowflake handles the WHEN (scheduling) and HOW (incremental refresh). You never write MERGE logic, manage stream offsets, or build Task DAGs — Snowflake does it all automatically.
Real-world analogy:
- Streams + Tasks = writing a recipe step by step, setting timers, and telling someone exactly how to cook each ingredient
- Dynamic Tables = showing someone a photo of the finished dish and saying "keep it looking like this, and make sure it's no more than 5 minutes stale"
Why do we need them? Streams + Tasks pipelines work but require 30+ lines of boilerplate per stage (create stream, create task, write MERGE, handle edge cases, resume tasks). For a 5-stage pipeline, that's 150+ lines of procedural code to maintain. Dynamic Tables reduce each stage to a single CREATE statement. Snowflake auto-detects dependencies, auto-schedules refreshes, and auto-computes incremental diffs.
The TARGET_LAG Parameter (freshness guarantee):
-- ========================================
-- CREATING DYNAMIC TABLES
-- ========================================
-- Silver layer: clean + transform raw data
CREATE DYNAMIC TABLE silver_bookings
TARGET_LAG = '5 minutes' -- Freshness guarantee
WAREHOUSE = 'ETL_WH' -- Warehouse for refresh compute
AS
SELECT
booking_id,
booking_date,
UPPER(airline_code) AS airline_code,
amount,
CURRENT_TIMESTAMP AS processed_at
FROM raw_bookings
WHERE amount > 0
AND booking_id IS NOT NULL;
-- Snowflake AUTOMATICALLY:
-- 1. Detects when raw_bookings changes
-- 2. Computes ONLY the incremental diff (not full reprocess)
-- 3. Updates silver_bookings within 5 minutes of source change
-- Gold layer: aggregate (auto-chains to silver!)
CREATE DYNAMIC TABLE gold_daily_revenue
TARGET_LAG = '1 hour'
WAREHOUSE = 'ETL_WH'
AS
SELECT
booking_date,
airline_code,
SUM(amount) AS daily_revenue,
COUNT(*) AS booking_count
FROM silver_bookings -- Reads from dynamic table above!
GROUP BY booking_date, airline_code;
-- Snowflake auto-detects the silver → gold dependency
-- No AFTER clause needed — it just works
-- Manually refresh (force update now, useful for testing)
ALTER DYNAMIC TABLE silver_bookings REFRESH;
-- Suspend / resume
ALTER DYNAMIC TABLE silver_bookings SUSPEND;
ALTER DYNAMIC TABLE silver_bookings RESUME;
-- Check status of all dynamic tables
SELECT name, schema_name, target_lag, refresh_mode,
refresh_mode_reason, data_timestamp
FROM INFORMATION_SCHEMA.DYNAMIC_TABLES;
-- refresh_mode: 'INCREMENTAL' (optimal) or 'FULL' (fallback if query is too complex)
-- data_timestamp: when the data was last refreshed to
Comparison: Dynamic Tables vs Streams + Tasks vs Materialized Views:
| Dynamic Tables | Streams + Tasks | Materialized Views | |
|---|---|---|---|
| Style | Declarative (WHAT) | Procedural (HOW) | Declarative (WHAT) |
| Setup complexity | Single CREATE | 30+ lines (stream + task + MERGE) | Single CREATE |
| Joins supported | Full SQL | Full SQL | No joins (single table only) |
| Aggregations | Yes | Yes | Yes |
| Incremental refresh | Automatic | Manual (you write MERGE) | Automatic |
| Dependency mgmt | Auto-detected | Manual (AFTER clause) | N/A (no chaining) |
| Min freshness | 1 minute (TARGET_LAG) | Seconds (task schedule) | Seconds (auto-refresh) |
| SCD Type 2 | Not supported | Supported (stream tracks before/after) | Not supported |
| Complex logic | SQL only | SQL + stored procedures + external functions | Simple aggregations only |
| Best for | Multi-step ELT pipelines | Complex CDC, SCD Type 2, sub-minute latency | BI dashboard acceleration |
When to use which:
- Dynamic Tables — Default choice for new ELT pipelines. Multi-stage bronze → silver → gold transformations.
- Streams + Tasks — When you need SCD Type 2, sub-minute latency, stored procedures, or external function calls.
- Materialized Views — Single-table aggregations for BI dashboards (no joins, no UDFs, read-heavy/write-light tables).
Interview Tip: Lead with Dynamic Tables as the modern approach, then say "but Streams + Tasks are still necessary for SCD Type 2 and sub-minute latency." Mention TARGET_LAG as the key differentiator — you set a freshness SLA, not a schedule. If asked about cost: Dynamic Tables can be slightly more expensive because Snowflake may refresh more aggressively than needed to meet the TARGET_LAG guarantee.
Q5b: Explain Materialized Views in Snowflake
Simple Explanation: A Materialized View (MV) is a pre-computed, physically stored result of a query. Unlike a regular view (which re-runs the query every time you SELECT from it), a Materialized View stores the results on disk and Snowflake automatically keeps them in sync with the base table. Queries that hit the MV read pre-computed results instead of scanning the full table — often 10-100x faster.
Real-world analogy: A regular view is like asking the librarian "how many books do we have by genre?" every time — she counts them all each time you ask. A Materialized View is like a summary card pinned to the wall that says "Fiction: 500, Science: 200, History: 150." The librarian updates the card whenever new books arrive. You just read the card — instant answer.
Why do we need it? BI dashboards often run the same expensive aggregation queries hundreds of times per day. Without MVs, each dashboard refresh scans the full fact table. With MVs, the aggregation is pre-computed and auto-refreshed — the dashboard reads the small, pre-aggregated result instead.
How auto-refresh works: Snowflake automatically detects when the base table changes and refreshes the MV in the background. You do not schedule refreshes — they happen transparently. The refresh is incremental (only processes changed micro-partitions), so it's efficient.
-- ========================================
-- CREATING A MATERIALIZED VIEW
-- ========================================
CREATE MATERIALIZED VIEW daily_revenue_mv AS
SELECT
booking_date,
airline_code,
SUM(amount) AS total_revenue,
COUNT(*) AS booking_count,
AVG(amount) AS avg_booking_value
FROM fact_bookings
GROUP BY booking_date, airline_code;
-- Snowflake physically stores this result
-- Auto-refreshes when fact_bookings changes
-- Query the MV (same as querying a table — but pre-computed!)
SELECT * FROM daily_revenue_mv
WHERE booking_date = '2026-03-25' AND airline_code = 'LH';
-- Returns instantly — reads pre-aggregated result, not the full fact table
-- Snowflake's optimizer can AUTOMATICALLY route queries to the MV
-- Even if you query the BASE TABLE, Snowflake may use the MV if it
-- detects the query matches the MV's definition (query rewrite)
-- Drop a materialized view
DROP MATERIALIZED VIEW daily_revenue_mv;
-- Check MV refresh status
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.MATERIALIZED_VIEW_REFRESH_HISTORY
WHERE materialized_view_name = 'DAILY_REVENUE_MV'
ORDER BY start_time DESC;
Limitations (important for interviews!):
| Limitation | Detail |
|---|---|
| No joins | MV can only reference a SINGLE base table |
| No UDFs | Cannot use user-defined functions |
| No subqueries | Cannot use subqueries in the MV definition |
| No window functions | ROW_NUMBER(), RANK(), etc. not allowed |
| No UNION | Cannot combine multiple queries |
| Single table only | Cannot join fact + dimension tables in MV |
| Write-light tables only | Frequent writes = frequent (expensive) refreshes |
Cost implications:
- Storage cost: The MV physically stores data (additional storage credits).
- Compute cost: Auto-refresh consumes serverless compute credits (background maintenance). Snowflake charges for this automatically — you don't control when it runs.
- Rule of thumb: MVs save money when reads >> writes. If the base table changes every minute and the MV is queried once per day, the refresh cost exceeds the query savings — don't use an MV.
When to use Materialized Views:
Interview Tip: The #1 interview follow-up is "MV vs Dynamic Table — when would you use each?" Answer: MVs for simple single-table aggregations on read-heavy tables (BI acceleration). Dynamic Tables for anything involving joins, multi-step transformations, or complex SQL. Also mention that MVs support automatic query rewrite — even if analysts query the base table, Snowflake can transparently use the MV if the query pattern matches. This is a feature Dynamic Tables do NOT have.
SECTION 3: SNOWPARK
Q6: What is Snowpark? When is it useful?
Simple Explanation: Snowpark is Snowflake's API for Python (also Java/Scala) that lets you write DataFrame-style code that runs INSIDE Snowflake's compute — like PySpark, but for Snowflake. Instead of extracting data to a Python environment and processing it there, Snowpark pushes the computation to Snowflake.
Real-world analogy: Normally, if you want to analyze data in Snowflake with Python, you'd pull it to your laptop (expensive network transfer), process it, then push back (another transfer). Snowpark = run Python code on Snowflake's servers directly. No data leaves Snowflake.
# Connect to Snowflake with Snowpark
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum, upper, current_timestamp
session = Session.builder.configs({
"account": "amadeus-prod",
"user": "data_engineer",
"password": "***",
"warehouse": "ETL_WH",
"database": "ANALYTICS_DB",
"schema": "BOOKINGS"
}).create()
# Create a Snowpark DataFrame (lazy — nothing runs yet)
raw_df = session.table("RAW_BOOKINGS") # Points to Snowflake table
# Transform using DataFrame API (exactly like PySpark!)
silver_df = (raw_df
.filter(col("AMOUNT") > 0) # Filter bad rows
.filter(col("BOOKING_ID").isNotNull())
.select(
col("BOOKING_ID"),
col("BOOKING_DATE"),
upper(col("AIRLINE_CODE")).alias("AIRLINE_CODE"), # Transform
col("AMOUNT"),
current_timestamp().alias("PROCESSED_AT")
)
)
# Write to Snowflake table (this EXECUTES the computation — runs in Snowflake!)
silver_df.write.mode("append").save_as_table("SILVER_BOOKINGS")
# All processing happened INSIDE Snowflake — no data left Snowflake!
# Snowpark UDF: Write Python function, run in Snowflake
from snowflake.snowpark.functions import udf
import snowflake.snowpark.types as T
@udf(return_type=T.FloatType(), input_types=[T.FloatType(), T.StringType()])
def calculate_tax(amount: float, country: str) -> float:
"""Run Python code INSIDE Snowflake — no data transfer"""
tax_rates = {"IN": 0.18, "UK": 0.20, "DE": 0.19}
return amount * tax_rates.get(country, 0.0)
# Use the UDF in SQL or DataFrame
result = silver_df.select(
col("AMOUNT"),
calculate_tax(col("AMOUNT"), col("COUNTRY")).alias("TAX")
)
New 2026: Snowpark Intelligent Hybrid Execution
# Snowpark now AUTOMATICALLY decides:
# - Run small operations locally on your machine (faster for tiny data)
# - Push large operations to Snowflake (for big data)
# No code change needed — it detects which is optimal automatically
import modin.pandas as mpd # Drop-in pandas replacement
df = mpd.read_snowflake("SELECT * FROM bookings WHERE date > '2026-01-01'")
# Snowpark decides: big table → run in Snowflake; result small → return locally
SECTION 4: PERFORMANCE TUNING
Q7: How do you optimize query performance in Snowflake?
The PERFORMANCE CHECKLIST ("CQMWS"):
-- STEP 1: Find slow queries
SELECT query_text, total_elapsed_time/1000 as seconds,
bytes_scanned/1024/1024/1024 as gb_scanned,
partitions_scanned, partitions_total,
ROUND(partitions_scanned*100.0/partitions_total, 1) as pct_scanned
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE execution_status = 'SUCCESS'
AND warehouse_name = 'BI_WH'
ORDER BY total_elapsed_time DESC
LIMIT 20;
-- HIGH pct_scanned (>80%) = bad pruning → add clustering key
-- LOW gb_scanned but slow = compute bottleneck → scale up warehouse
-- STEP 2: Check clustering health
SELECT SYSTEM$CLUSTERING_INFORMATION('bookings', '(booking_date, airline_code)');
-- average_depth > 6 → reclustering needed
-- average_overlaps > 5 → partitions heavily mixed
-- STEP 3: Use Materialized Views for repeated heavy aggregations
CREATE MATERIALIZED VIEW daily_revenue_mv AS
SELECT booking_date, airline_code, SUM(amount) as revenue
FROM silver_bookings
GROUP BY booking_date, airline_code;
-- Snowflake pre-computes and auto-refreshes this view
-- BI queries that aggregate by date+airline now hit MV instead of raw table
-- 100x faster for common aggregation queries!
-- STEP 4: Search Optimization Service (for needle-in-haystack queries)
ALTER TABLE bookings ADD SEARCH OPTIMIZATION;
-- Optimizes: WHERE booking_id = 'BK123456' (point lookup)
-- Optimizes: WHERE passenger_email = 'user@example.com' (equality on high-cardinality col)
-- NOT for range queries (use clustering key instead)
-- STEP 5: Query Profile analysis
-- After running a query, in Snowflake UI: Query History → Click query → Query Profile
-- Shows execution tree, where time was spent
-- Key metrics to look for:
-- "Bytes scanned from cache" high = good (cache working)
-- "Partitions scanned" vs "Partitions total" → pruning efficiency
-- "Spilling to disk" = not enough memory → scale up warehouse
Materialized Views vs Dynamic Tables:
| Materialized View | Dynamic Table | |
|---|---|---|
| Define with | CREATE MATERIALIZED VIEW | CREATE DYNAMIC TABLE |
| Auto-refresh | Yes (automatic) | Yes (on target lag) |
| Supports joins | Limited | Full SQL (all joins) |
| Supports streams | No | Yes (incremental from streams) |
| Complexity | Simple aggregations | Complex ELT |
| Best for | BI aggregations, dashboards | Multi-step ELT pipelines |