❄️
Snowflake
Day 2: Data Loading, Pipelines & Performance
❄️
❄️
Snowflake · Section 4 of 7

Day 2: Data Loading, Pipelines & Performance

Day 2: Data Loading, Pipelines & Performance

Pro Tip
Time: 5-6 hours | Priority: HIGH — Pipelines & performance are 30-35% of interviews Context: Real-time booking ingestion, CDC from Oracle, BI dashboard performance Approach: Simple explanation → analogy → technical details → code → interview tip

SECTION 1: DATA LOADING (1 hour)

Q1: What is COPY INTO? How is it different from INSERT?

Simple Explanation: COPY INTO is Snowflake's bulk loading command — it reads files from a stage (ADLS, S3, internal) and loads them into a table. It is MUCH faster than INSERT for large volumes because it runs in parallel across the warehouse.

Why do we need it? INSERT processes rows one by one — fine for 100 rows, terrible for 100 million rows. COPY INTO parallelizes across the warehouse and processes files in parallel batches.

Real-world analogy: INSERT = one person typing data manually. COPY INTO = 100 workers simultaneously moving pallets from a loading dock into the warehouse.

sql
-- Basic COPY INTO from external stage (Azure ADLS)
COPY INTO raw_bookings
FROM @azure_booking_stage/2026/03/25/        -- Read from this ADLS path
FILE_FORMAT = (
    TYPE = 'PARQUET'                          -- File type
)
ON_ERROR = 'CONTINUE';                        -- Skip bad files, continue loading
-- Options: ABORT_STATEMENT (default), CONTINUE, SKIP_FILE

-- COPY INTO with transformation (load + clean in one step)
COPY INTO silver_bookings (booking_id, booking_date, amount, airline_code)
FROM (
    SELECT
        $1:booking_id::VARCHAR,               -- Cast from VARIANT
        $1:booking_date::DATE,
        $1:fare.amount::NUMBER(10,2),
        UPPER($1:airline_code::VARCHAR)        -- Transform during load
    FROM @azure_booking_stage/bookings.json
)
FILE_FORMAT = (TYPE = 'JSON');

-- COPY INTO with error handling — see what failed
COPY INTO raw_bookings FROM @stage/
ON_ERROR = 'CONTINUE';

-- After loading, check what errors occurred:
SELECT * FROM TABLE(VALIDATE(raw_bookings, JOB_ID => '_last'));
-- Shows exactly which files + rows failed and why

-- COPY INTO is idempotent (safe to re-run!)
-- Snowflake tracks which files were already loaded (load history table)
-- Re-running the same COPY INTO skips already-loaded files automatically
-- To force reload: COPY INTO ... FORCE = TRUE;

Key COPY INTO options:

OptionValuesUse Case
ON_ERRORABORT / CONTINUE / SKIP_FILEHandle bad rows
PURGE = TRUEAuto-delete files from stage after loading
FORCE = TRUERe-load already-loaded files
PATTERN = '.*\.csv'RegexLoad only matching file names
FILES = ('f1.csv','f2.csv')ListLoad specific files

Q2: What is Snowpipe? How does it differ from COPY INTO?

Simple Explanation: COPY INTO = you manually trigger the load (or schedule with a Task). Snowpipe = load happens AUTOMATICALLY when new files arrive. Snowpipe monitors your stage and runs COPY INTO for you the moment a new file appears.

Real-world analogy: COPY INTO = you call a taxi (trigger manually). Snowpipe = Uber pool running continuously — jumps in automatically whenever you're ready (file arrives).

FILE ARRIVES IN ADLS/S3
SNOWPIPE NOTIFICATION (via Azure Event Grid / AWS SQS)
↓ (seconds later)
SNOWPIPE COPIES FILE INTO TABLE
TABLE IS UPDATED
Total latency: ~30 seconds to 1 minute
vs COPY INTO (batch):
CRON JOB RUNS EVERY 5 MINUTES
↓ "are there new files?"
COPY INTO runs for all new files
Total latency: up to 5 minutes
sql
-- Create a pipe that auto-loads from a stage
CREATE PIPE booking_pipe
    AUTO_INGEST = TRUE          -- Enabled for cloud event notifications
    AS
    COPY INTO raw_bookings      -- What to run when files arrive
    FROM @azure_booking_stage
    FILE_FORMAT = (TYPE = 'PARQUET');

-- Check pipe status
SELECT SYSTEM$PIPE_STATUS('booking_pipe');
-- Shows: pendingFileCount, lastIngestedTimestamp, status

-- See load history (what files loaded, errors)
SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
    TABLE_NAME => 'raw_bookings',
    START_TIME => DATEADD('hour', -24, CURRENT_TIMESTAMP)
));

Snowpipe vs COPY INTO vs Snowpipe Streaming:

COPY INTO
✓ Batch: you trigger manually or via Task
✓ Cheapest option (uses your warehouse)
✓ Best for scheduled batch ETL (hourly, daily)
✗ Latency: as long as your schedule (5min, 1hr, daily)
SNOWPIPE (auto-ingest)
✓ Event-driven: loads within ~1 min of file arrival
✓ Serverless (no warehouse needed)
✓ Best for near-real-time ingestion from landing zones
✓ Idempotent: tracks loaded files
✗ Higher cost than COPY INTO for same volume
✗ Minimum latency ~30 seconds
SNOWPIPE STREAMING (API-based)
✓ Push rows directly via SDK (Python/Java)
✓ Sub-second to seconds latency
✓ Best for: Kafka ConnectSnowflake, real-time IoT, Flink
✗ Most expensive (pays per row + rows buffered)
✗ Small files accumulateneeds OPTIMIZE equivalent

SECTION 2: STREAMS & TASKS (CDC in Snowflake)

Q3: What Are Snowflake Streams? How Do They Enable CDC?

Simple Explanation: A Stream is Snowflake's native CDC (Change Data Capture) mechanism. It tracks every INSERT, UPDATE, and DELETE on a table — without modifying the original table or storing a separate copy. When you query the stream, you see "what changed since I last consumed it?" Once you consume the changes inside a DML transaction, the stream automatically advances its offset — the consumed rows disappear from the stream view.

Real-world analogy: Think of a hotel check-in/check-out log. The front desk doesn't copy the entire guest database every hour — it just records new arrivals (INSERT), room changes (UPDATE), and departures (DELETE). The housekeeping team reads the log, services the rooms, and tears off the page. Tomorrow there's a fresh page with only new changes.

Why do we need it? Without streams, you'd have to either (a) reload the entire table every time (expensive, slow), or (b) build custom CDC logic using timestamps or audit columns (fragile, error-prone). Streams give you a zero-maintenance, guaranteed-consistent way to see exactly what changed — perfect for incremental ETL pipelines.

How Streams Track Changes (The Offset Model):

📐 Architecture Diagram
TABLE: raw_bookings
┌────────────────────────────────────────────┐
│  Row 1  │  Row 2  │  Row 3  │  Row 4      │
│ (old)   │ (old)   │ (new!)  │ (new!)      │
└────────────────────────────────────────────┘
                       ▲
                       │ Stream offset points here
                       │ "I've seen everything before Row 3"
                       │
STREAM VIEW: shows Row 3 and Row 4 only (the delta)

After you consume Row 3 & 4 in a DML (INSERT/MERGE):
  → Offset advances to Row 4
  → Stream view is now empty (until new changes arrive)

Streams use Snowflake's internal versioning (Time Travel metadata) to track the offset. No data is copied — the stream is essentially a pointer into the table's change history.

Stream Metadata Columns:

Every stream row includes three hidden metadata columns:

ColumnTypeMeaning
METADATA$ACTIONVARCHAR'INSERT' or 'DELETE'
METADATA$ISUPDATEBOOLEANTRUE if this row is part of an UPDATE
METADATA$ROW_IDVARCHARUnique internal row identifier

An UPDATE appears as TWO rows in the stream: a DELETE of the old value + an INSERT of the new value, both with METADATA$ISUPDATE = TRUE.

Stream Types:

TypeTracksUse Case
Standard (default)INSERT + UPDATE + DELETEFull CDC — SCD Type 2, MERGE pipelines
Append-onlyINSERT onlyLog/event tables where rows are never updated
Insert-onlyINSERT only (external tables)Tracking new files arriving in external stages
sql
-- ========================================
-- CREATING STREAMS
-- ========================================

-- Standard stream (tracks all DML changes)
CREATE STREAM booking_changes_stream
    ON TABLE raw_bookings
    SHOW_INITIAL_ROWS = FALSE;   -- Only track NEW changes from this point
    -- SHOW_INITIAL_ROWS = TRUE → backfill: include all existing rows as INSERTs

-- Append-only stream (lighter weight — only new inserts)
CREATE STREAM new_bookings_stream
    ON TABLE raw_bookings
    APPEND_ONLY = TRUE;
-- Use for event/log tables where rows are never updated or deleted

-- Insert-only stream on an external table
CREATE STREAM new_files_stream
    ON EXTERNAL TABLE ext_booking_files
    INSERT_ONLY = TRUE;
-- Fires when new files appear in the external stage

-- ========================================
-- QUERYING A STREAM (peek at changes)
-- ========================================

SELECT *,
    METADATA$ACTION,         -- 'INSERT' or 'DELETE'
    METADATA$ISUPDATE,       -- TRUE if part of an UPDATE
    METADATA$ROW_ID          -- Internal row identifier
FROM booking_changes_stream;
-- This is a READ — it does NOT consume the stream
-- Stream is only consumed when changes are used in a DML statement

-- ========================================
-- CONSUMING A STREAM (the MERGE pattern — #1 interview pattern!)
-- ========================================

-- The Stream + MERGE pattern: upsert changes into a target table
MERGE INTO silver_bookings AS target
USING (
    SELECT booking_id, booking_date, amount, airline_code,
           METADATA$ACTION AS action,
           METADATA$ISUPDATE AS is_update
    FROM booking_changes_stream
) AS src
ON target.booking_id = src.booking_id
-- Handle inserts (new rows)
WHEN NOT MATCHED AND src.action = 'INSERT'
    THEN INSERT (booking_id, booking_date, amount, airline_code)
         VALUES (src.booking_id, src.booking_date, src.amount, src.airline_code)
-- Handle updates (new value of updated rows)
WHEN MATCHED AND src.action = 'INSERT' AND src.is_update = TRUE
    THEN UPDATE SET
        target.amount = src.amount,
        target.booking_date = src.booking_date,
        target.airline_code = src.airline_code
-- Handle deletes
WHEN MATCHED AND src.action = 'DELETE' AND src.is_update = FALSE
    THEN DELETE;

-- After MERGE commits → stream advances → consumed rows disappear
-- Next query to stream shows only changes AFTER this MERGE

-- ========================================
-- STALE STREAMS — how to avoid them
-- ========================================

-- A stream becomes STALE if you don't consume it within the table's
-- DATA_RETENTION_TIME_IN_DAYS (Time Travel period, default 1 day).
-- Once stale, the stream is BROKEN — you must recreate it.

-- Check if a stream is stale:
SHOW STREAMS LIKE 'booking_changes_stream';
-- Look at "stale_after" column — if past, the stream is stale

-- Prevention: consume streams regularly OR extend retention:
ALTER TABLE raw_bookings SET DATA_RETENTION_TIME_IN_DAYS = 14;
-- Now stream has 14 days before going stale
-- Trade-off: longer retention = more storage cost (Time Travel storage)

Interview Tip: The Stream + MERGE pattern is the single most asked Snowflake pipeline question. Know it cold. Also explain that streams are "transactional consumers" — they only advance when the consuming DML commits. If the MERGE fails, the stream stays where it was (no data loss). Mention stale streams and retention as a follow-up — it shows production experience.

Q4: What Are Snowflake Tasks? How Do You Schedule SQL?

Simple Explanation: A Task is Snowflake's built-in scheduler — it runs a SQL statement or stored procedure on a defined schedule (cron or interval). Tasks can be chained into a DAG (Directed Acyclic Graph) where child tasks run only after their parent succeeds. Combined with Streams, Tasks form the classic Snowflake incremental pipeline.

Real-world analogy: Think of a factory assembly line with a foreman. The foreman (Task scheduler) blows a whistle every 5 minutes. Station 1 (bronze) starts work. When Station 1 finishes, Station 2 (silver) automatically begins. Station 3 (gold) waits for Station 2. If the foreman sees no raw materials on the conveyor belt (SYSTEM$STREAM_HAS_DATA() = FALSE), he skips that cycle — no wasted work.

Why do we need it? Without Tasks, you'd need an external scheduler (Airflow, cron, Azure Data Factory) just to run SQL on a timer. Tasks keep everything inside Snowflake — no external dependencies, no credentials to manage, no network calls. For simple-to-medium pipelines, Tasks eliminate an entire orchestration layer.

sql
-- ========================================
-- CREATING TASKS
-- ========================================

-- Simple task: run SQL every 5 minutes using a warehouse
CREATE TASK load_bronze_task
    WAREHOUSE = 'ETL_WH'               -- Uses this warehouse's credits
    SCHEDULE = '5 MINUTE'              -- Interval schedule
AS
    COPY INTO raw_bookings
    FROM @azure_booking_stage;

-- Cron-based schedule
CREATE TASK nightly_aggregate_task
    WAREHOUSE = 'ETL_WH'
    SCHEDULE = 'USING CRON 0 2 * * * UTC'   -- 2:00 AM UTC daily
    -- Cron: minute hour day month weekday timezone
AS
    CALL refresh_daily_aggregates();

-- !! CRITICAL: Tasks are SUSPENDED by default !!
-- You MUST resume them or they will never run
ALTER TASK load_bronze_task RESUME;
ALTER TASK load_bronze_task SUSPEND;     -- Pause it

-- ========================================
-- SERVERLESS TASKS vs WAREHOUSE-BASED TASKS
-- ========================================

-- Warehouse-based: uses a specific warehouse (you control size & cost)
CREATE TASK warehouse_task
    WAREHOUSE = 'ETL_WH'
    SCHEDULE = '10 MINUTE'
AS SELECT 1;

-- Serverless: Snowflake manages compute — no warehouse needed
CREATE TASK serverless_task
    USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'MEDIUM'
    -- Omit WAREHOUSE = ... to make it serverless
    SCHEDULE = '5 MINUTE'
AS
    INSERT INTO audit_log VALUES (CURRENT_TIMESTAMP, 'heartbeat');

-- Serverless tasks:
-- ✓ ~10% cheaper than warehouse-based for equivalent work
-- ✓ No warehouse to manage — auto-scales to workload
-- ✓ Best for lightweight, frequent tasks
-- ✗ Less control over compute resources
-- ✗ Compute size is approximate (Snowflake decides)

-- ========================================
-- TASK TREES (DAG — parent → child dependencies)
-- ========================================

-- Root task: has a SCHEDULE (the only task that needs one)
CREATE TASK bronze_task
    WAREHOUSE = 'ETL_WH'
    SCHEDULE = 'USING CRON 0 2 * * * UTC'  -- 2 AM daily
AS
    CALL load_bronze_data();

-- Child task: uses AFTER clause (no SCHEDULE — triggered by parent)
CREATE TASK silver_task
    WAREHOUSE = 'ETL_WH'
    AFTER bronze_task               -- Runs ONLY after bronze_task succeeds
AS
    CALL transform_silver_data();

-- Grandchild task: runs after silver
CREATE TASK gold_task
    WAREHOUSE = 'ETL_WH'
    AFTER silver_task
AS
    CALL aggregate_gold_metrics();

-- A child can depend on MULTIPLE parents (fan-in):
CREATE TASK final_report_task
    WAREHOUSE = 'ETL_WH'
    AFTER silver_task, gold_task     -- Waits for BOTH to finish
AS
    CALL generate_report();

-- !! Resume ONLY the root task — children auto-activate !!
-- !! But you must RESUME children BEFORE resuming the root !!
ALTER TASK gold_task RESUME;
ALTER TASK silver_task RESUME;
ALTER TASK bronze_task RESUME;      -- Root task last

-- ========================================
-- CONDITIONAL EXECUTION: SYSTEM$STREAM_HAS_DATA()
-- ========================================

-- Only run the task if the stream has new data (skip empty cycles)
CREATE TASK process_booking_changes
    WAREHOUSE = 'ETL_WH'
    SCHEDULE = '5 MINUTE'
    WHEN SYSTEM$STREAM_HAS_DATA('booking_changes_stream')
    -- If stream is empty → task is SKIPPED (no warehouse credit used!)
AS
    MERGE INTO silver_bookings AS target
    USING booking_changes_stream AS src
    ON target.booking_id = src.booking_id
    WHEN NOT MATCHED AND src.METADATA$ACTION = 'INSERT'
        THEN INSERT (booking_id, booking_date, amount, airline_code)
             VALUES (src.booking_id, src.booking_date, src.amount, src.airline_code)
    WHEN MATCHED AND src.METADATA$ACTION = 'INSERT' AND src.METADATA$ISUPDATE = TRUE
        THEN UPDATE SET target.amount = src.amount
    WHEN MATCHED AND src.METADATA$ACTION = 'DELETE' AND src.METADATA$ISUPDATE = FALSE
        THEN DELETE;

ALTER TASK process_booking_changes RESUME;

-- ========================================
-- COMPLETE STREAM + TASK PIPELINE (end-to-end example)
-- ========================================

-- Step 1: Source table exists
-- CREATE TABLE raw_bookings (booking_id VARCHAR, booking_date DATE, ...);

-- Step 2: Create stream to track changes
CREATE STREAM booking_cdc_stream ON TABLE raw_bookings;

-- Step 3: Create task that consumes stream via MERGE
CREATE TASK cdc_pipeline_task
    WAREHOUSE = 'ETL_WH'
    SCHEDULE = '5 MINUTE'
    WHEN SYSTEM$STREAM_HAS_DATA('booking_cdc_stream')
AS
    MERGE INTO silver_bookings AS t
    USING booking_cdc_stream AS s
    ON t.booking_id = s.booking_id
    WHEN MATCHED AND s.METADATA$ACTION = 'DELETE' AND s.METADATA$ISUPDATE = FALSE
        THEN DELETE
    WHEN MATCHED AND s.METADATA$ACTION = 'INSERT' AND s.METADATA$ISUPDATE = TRUE
        THEN UPDATE SET t.amount = s.amount, t.booking_date = s.booking_date
    WHEN NOT MATCHED AND s.METADATA$ACTION = 'INSERT'
        THEN INSERT (booking_id, booking_date, amount)
             VALUES (s.booking_id, s.booking_date, s.amount);

-- Step 4: Resume the task
ALTER TASK cdc_pipeline_task RESUME;

-- Now: any INSERT/UPDATE/DELETE on raw_bookings is automatically
-- captured by the stream and merged into silver_bookings every 5 minutes

-- ========================================
-- MONITORING TASKS
-- ========================================

-- View task run history (recent runs, status, errors)
SELECT name, state, scheduled_time, completed_time,
       error_code, error_message,
       DATEDIFF('second', query_start_time, completed_time) AS duration_sec
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
    TASK_NAME => 'CDC_PIPELINE_TASK',
    SCHEDULED_TIME_RANGE_START => DATEADD('day', -1, CURRENT_TIMESTAMP)
))
ORDER BY scheduled_time DESC;
-- state: 'SUCCEEDED', 'FAILED', 'SKIPPED' (WHEN condition was false)

-- Account-level task history (all tasks)
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TASK_HISTORY
WHERE scheduled_time > DATEADD('hour', -24, CURRENT_TIMESTAMP)
ORDER BY scheduled_time DESC;

Interview Tip: Always mention three things: (1) Tasks are suspended by default — forgetting ALTER TASK RESUME is the #1 production mistake. (2) Use WHEN SYSTEM$STREAM_HAS_DATA() to avoid wasting credits on empty runs. (3) In a Task DAG, resume children BEFORE the root task, and only the root task needs a SCHEDULE. If asked "Streams+Tasks vs Airflow?" — answer: Tasks for simple Snowflake-only pipelines; Airflow when you need cross-system orchestration (Snowflake + Spark + APIs).

Q5: What Are Dynamic Tables? How Are They Different from Streams + Tasks?

Simple Explanation: Dynamic Tables (GA 2024, recommended approach 2025-2026) are Snowflake's declarative ELT mechanism. You define WHAT the result should look like using a SELECT query, and Snowflake handles the WHEN (scheduling) and HOW (incremental refresh). You never write MERGE logic, manage stream offsets, or build Task DAGs — Snowflake does it all automatically.

Real-world analogy:

  • Streams + Tasks = writing a recipe step by step, setting timers, and telling someone exactly how to cook each ingredient
  • Dynamic Tables = showing someone a photo of the finished dish and saying "keep it looking like this, and make sure it's no more than 5 minutes stale"

Why do we need them? Streams + Tasks pipelines work but require 30+ lines of boilerplate per stage (create stream, create task, write MERGE, handle edge cases, resume tasks). For a 5-stage pipeline, that's 150+ lines of procedural code to maintain. Dynamic Tables reduce each stage to a single CREATE statement. Snowflake auto-detects dependencies, auto-schedules refreshes, and auto-computes incremental diffs.

The TARGET_LAG Parameter (freshness guarantee):

🧠 Memory Map
TARGET_LAG = '5 minutes'
→ Snowflake guarantees the dynamic table is never more than 5 min stale
→ Snowflake decides WHEN to refresh (could be every 1 min if source changes fast)
→ You don't set a schedule — you set a freshness SLA
TARGET_LAG = '1 hour'
→ Cheaper (fewer refreshes), but data can be up to 1 hour behind
TARGET_LAG = 'DOWNSTREAM'
→ Only refresh when a DOWNSTREAM dynamic table or query needs fresh data
→ Most cost-efficient for intermediate tables in a chain
sql
-- ========================================
-- CREATING DYNAMIC TABLES
-- ========================================

-- Silver layer: clean + transform raw data
CREATE DYNAMIC TABLE silver_bookings
    TARGET_LAG = '5 minutes'         -- Freshness guarantee
    WAREHOUSE = 'ETL_WH'             -- Warehouse for refresh compute
AS
SELECT
    booking_id,
    booking_date,
    UPPER(airline_code)     AS airline_code,
    amount,
    CURRENT_TIMESTAMP       AS processed_at
FROM raw_bookings
WHERE amount > 0
  AND booking_id IS NOT NULL;

-- Snowflake AUTOMATICALLY:
-- 1. Detects when raw_bookings changes
-- 2. Computes ONLY the incremental diff (not full reprocess)
-- 3. Updates silver_bookings within 5 minutes of source change

-- Gold layer: aggregate (auto-chains to silver!)
CREATE DYNAMIC TABLE gold_daily_revenue
    TARGET_LAG = '1 hour'
    WAREHOUSE = 'ETL_WH'
AS
SELECT
    booking_date,
    airline_code,
    SUM(amount)  AS daily_revenue,
    COUNT(*)     AS booking_count
FROM silver_bookings    -- Reads from dynamic table above!
GROUP BY booking_date, airline_code;
-- Snowflake auto-detects the silver → gold dependency
-- No AFTER clause needed — it just works

-- Manually refresh (force update now, useful for testing)
ALTER DYNAMIC TABLE silver_bookings REFRESH;

-- Suspend / resume
ALTER DYNAMIC TABLE silver_bookings SUSPEND;
ALTER DYNAMIC TABLE silver_bookings RESUME;

-- Check status of all dynamic tables
SELECT name, schema_name, target_lag, refresh_mode,
       refresh_mode_reason, data_timestamp
FROM INFORMATION_SCHEMA.DYNAMIC_TABLES;
-- refresh_mode: 'INCREMENTAL' (optimal) or 'FULL' (fallback if query is too complex)
-- data_timestamp: when the data was last refreshed to

Comparison: Dynamic Tables vs Streams + Tasks vs Materialized Views:

Dynamic TablesStreams + TasksMaterialized Views
StyleDeclarative (WHAT)Procedural (HOW)Declarative (WHAT)
Setup complexitySingle CREATE30+ lines (stream + task + MERGE)Single CREATE
Joins supportedFull SQLFull SQLNo joins (single table only)
AggregationsYesYesYes
Incremental refreshAutomaticManual (you write MERGE)Automatic
Dependency mgmtAuto-detectedManual (AFTER clause)N/A (no chaining)
Min freshness1 minute (TARGET_LAG)Seconds (task schedule)Seconds (auto-refresh)
SCD Type 2Not supportedSupported (stream tracks before/after)Not supported
Complex logicSQL onlySQL + stored procedures + external functionsSimple aggregations only
Best forMulti-step ELT pipelinesComplex CDC, SCD Type 2, sub-minute latencyBI dashboard acceleration

When to use which:

  • Dynamic Tables — Default choice for new ELT pipelines. Multi-stage bronze → silver → gold transformations.
  • Streams + Tasks — When you need SCD Type 2, sub-minute latency, stored procedures, or external function calls.
  • Materialized Views — Single-table aggregations for BI dashboards (no joins, no UDFs, read-heavy/write-light tables).

Interview Tip: Lead with Dynamic Tables as the modern approach, then say "but Streams + Tasks are still necessary for SCD Type 2 and sub-minute latency." Mention TARGET_LAG as the key differentiator — you set a freshness SLA, not a schedule. If asked about cost: Dynamic Tables can be slightly more expensive because Snowflake may refresh more aggressively than needed to meet the TARGET_LAG guarantee.

Q5b: Explain Materialized Views in Snowflake

Simple Explanation: A Materialized View (MV) is a pre-computed, physically stored result of a query. Unlike a regular view (which re-runs the query every time you SELECT from it), a Materialized View stores the results on disk and Snowflake automatically keeps them in sync with the base table. Queries that hit the MV read pre-computed results instead of scanning the full table — often 10-100x faster.

Real-world analogy: A regular view is like asking the librarian "how many books do we have by genre?" every time — she counts them all each time you ask. A Materialized View is like a summary card pinned to the wall that says "Fiction: 500, Science: 200, History: 150." The librarian updates the card whenever new books arrive. You just read the card — instant answer.

Why do we need it? BI dashboards often run the same expensive aggregation queries hundreds of times per day. Without MVs, each dashboard refresh scans the full fact table. With MVs, the aggregation is pre-computed and auto-refreshed — the dashboard reads the small, pre-aggregated result instead.

How auto-refresh works: Snowflake automatically detects when the base table changes and refreshes the MV in the background. You do not schedule refreshes — they happen transparently. The refresh is incremental (only processes changed micro-partitions), so it's efficient.

sql
-- ========================================
-- CREATING A MATERIALIZED VIEW
-- ========================================

CREATE MATERIALIZED VIEW daily_revenue_mv AS
SELECT
    booking_date,
    airline_code,
    SUM(amount)   AS total_revenue,
    COUNT(*)      AS booking_count,
    AVG(amount)   AS avg_booking_value
FROM fact_bookings
GROUP BY booking_date, airline_code;
-- Snowflake physically stores this result
-- Auto-refreshes when fact_bookings changes

-- Query the MV (same as querying a table — but pre-computed!)
SELECT * FROM daily_revenue_mv
WHERE booking_date = '2026-03-25' AND airline_code = 'LH';
-- Returns instantly — reads pre-aggregated result, not the full fact table

-- Snowflake's optimizer can AUTOMATICALLY route queries to the MV
-- Even if you query the BASE TABLE, Snowflake may use the MV if it
-- detects the query matches the MV's definition (query rewrite)

-- Drop a materialized view
DROP MATERIALIZED VIEW daily_revenue_mv;

-- Check MV refresh status
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.MATERIALIZED_VIEW_REFRESH_HISTORY
WHERE materialized_view_name = 'DAILY_REVENUE_MV'
ORDER BY start_time DESC;

Limitations (important for interviews!):

LimitationDetail
No joinsMV can only reference a SINGLE base table
No UDFsCannot use user-defined functions
No subqueriesCannot use subqueries in the MV definition
No window functionsROW_NUMBER(), RANK(), etc. not allowed
No UNIONCannot combine multiple queries
Single table onlyCannot join fact + dimension tables in MV
Write-light tables onlyFrequent writes = frequent (expensive) refreshes

Cost implications:

  • Storage cost: The MV physically stores data (additional storage credits).
  • Compute cost: Auto-refresh consumes serverless compute credits (background maintenance). Snowflake charges for this automatically — you don't control when it runs.
  • Rule of thumb: MVs save money when reads >> writes. If the base table changes every minute and the MV is queried once per day, the refresh cost exceeds the query savings — don't use an MV.

When to use Materialized Views:

USE MVs when:
✓ Single-table aggregations (SUM, COUNT, AVG by common dimensions)
✓ Read-heavy, write-light tables (fact table updated daily, queried 1000x/day)
✓ BI dashboard acceleration (same aggregation run by many analysts)
✓ You want Snowflake to auto-route queries to pre-computed results
DON'T use MVs when:
✗ You need joins (use Dynamic Tables instead)
✗ Table changes frequently (high refresh cost)
✗ Complex logic (UDFs, window functions, subqueries)
✗ Multi-step transformations (use Dynamic Tables or Streams+Tasks)

Interview Tip: The #1 interview follow-up is "MV vs Dynamic Table — when would you use each?" Answer: MVs for simple single-table aggregations on read-heavy tables (BI acceleration). Dynamic Tables for anything involving joins, multi-step transformations, or complex SQL. Also mention that MVs support automatic query rewrite — even if analysts query the base table, Snowflake can transparently use the MV if the query pattern matches. This is a feature Dynamic Tables do NOT have.

SECTION 3: SNOWPARK

Q6: What is Snowpark? When is it useful?

Simple Explanation: Snowpark is Snowflake's API for Python (also Java/Scala) that lets you write DataFrame-style code that runs INSIDE Snowflake's compute — like PySpark, but for Snowflake. Instead of extracting data to a Python environment and processing it there, Snowpark pushes the computation to Snowflake.

Real-world analogy: Normally, if you want to analyze data in Snowflake with Python, you'd pull it to your laptop (expensive network transfer), process it, then push back (another transfer). Snowpark = run Python code on Snowflake's servers directly. No data leaves Snowflake.

python — editable
# Connect to Snowflake with Snowpark
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum, upper, current_timestamp

session = Session.builder.configs({
    "account":   "amadeus-prod",
    "user":      "data_engineer",
    "password":  "***",
    "warehouse": "ETL_WH",
    "database":  "ANALYTICS_DB",
    "schema":    "BOOKINGS"
}).create()

# Create a Snowpark DataFrame (lazy — nothing runs yet)
raw_df = session.table("RAW_BOOKINGS")    # Points to Snowflake table

# Transform using DataFrame API (exactly like PySpark!)
silver_df = (raw_df
    .filter(col("AMOUNT") > 0)           # Filter bad rows
    .filter(col("BOOKING_ID").isNotNull())
    .select(
        col("BOOKING_ID"),
        col("BOOKING_DATE"),
        upper(col("AIRLINE_CODE")).alias("AIRLINE_CODE"),  # Transform
        col("AMOUNT"),
        current_timestamp().alias("PROCESSED_AT")
    )
)

# Write to Snowflake table (this EXECUTES the computation — runs in Snowflake!)
silver_df.write.mode("append").save_as_table("SILVER_BOOKINGS")
# All processing happened INSIDE Snowflake — no data left Snowflake!

# Snowpark UDF: Write Python function, run in Snowflake
from snowflake.snowpark.functions import udf
import snowflake.snowpark.types as T

@udf(return_type=T.FloatType(), input_types=[T.FloatType(), T.StringType()])
def calculate_tax(amount: float, country: str) -> float:
    """Run Python code INSIDE Snowflake — no data transfer"""
    tax_rates = {"IN": 0.18, "UK": 0.20, "DE": 0.19}
    return amount * tax_rates.get(country, 0.0)

# Use the UDF in SQL or DataFrame
result = silver_df.select(
    col("AMOUNT"),
    calculate_tax(col("AMOUNT"), col("COUNTRY")).alias("TAX")
)

New 2026: Snowpark Intelligent Hybrid Execution

python — editable
# Snowpark now AUTOMATICALLY decides:
# - Run small operations locally on your machine (faster for tiny data)
# - Push large operations to Snowflake (for big data)
# No code change needed — it detects which is optimal automatically

import modin.pandas as mpd   # Drop-in pandas replacement
df = mpd.read_snowflake("SELECT * FROM bookings WHERE date > '2026-01-01'")
# Snowpark decides: big table → run in Snowflake; result small → return locally

SECTION 4: PERFORMANCE TUNING

Q7: How do you optimize query performance in Snowflake?

The PERFORMANCE CHECKLIST ("CQMWS"):

CClustering (is data well clustered for this query's filters?)
QQuery Profile (find the bottleneck — where is time spent?)
MMaterialized Views (pre-compute expensive aggregations)
WWarehouse size (is the warehouse right-sized for this workload?)
SSearch Optimization Service (for point-lookup queries on unordered data)
Remember: "CQMWS" = Clustering, Query profile, Materialized views, Warehouse, Search
sql
-- STEP 1: Find slow queries
SELECT query_text, total_elapsed_time/1000 as seconds,
       bytes_scanned/1024/1024/1024 as gb_scanned,
       partitions_scanned, partitions_total,
       ROUND(partitions_scanned*100.0/partitions_total, 1) as pct_scanned
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE execution_status = 'SUCCESS'
  AND warehouse_name = 'BI_WH'
ORDER BY total_elapsed_time DESC
LIMIT 20;
-- HIGH pct_scanned (>80%) = bad pruning → add clustering key
-- LOW gb_scanned but slow = compute bottleneck → scale up warehouse

-- STEP 2: Check clustering health
SELECT SYSTEM$CLUSTERING_INFORMATION('bookings', '(booking_date, airline_code)');
-- average_depth > 6 → reclustering needed
-- average_overlaps > 5 → partitions heavily mixed

-- STEP 3: Use Materialized Views for repeated heavy aggregations
CREATE MATERIALIZED VIEW daily_revenue_mv AS
SELECT booking_date, airline_code, SUM(amount) as revenue
FROM silver_bookings
GROUP BY booking_date, airline_code;
-- Snowflake pre-computes and auto-refreshes this view
-- BI queries that aggregate by date+airline now hit MV instead of raw table
-- 100x faster for common aggregation queries!

-- STEP 4: Search Optimization Service (for needle-in-haystack queries)
ALTER TABLE bookings ADD SEARCH OPTIMIZATION;
-- Optimizes: WHERE booking_id = 'BK123456' (point lookup)
-- Optimizes: WHERE passenger_email = 'user@example.com' (equality on high-cardinality col)
-- NOT for range queries (use clustering key instead)

-- STEP 5: Query Profile analysis
-- After running a query, in Snowflake UI: Query History → Click query → Query Profile
-- Shows execution tree, where time was spent
-- Key metrics to look for:
--   "Bytes scanned from cache" high = good (cache working)
--   "Partitions scanned" vs "Partitions total" → pruning efficiency
--   "Spilling to disk" = not enough memory → scale up warehouse

Materialized Views vs Dynamic Tables:

Materialized ViewDynamic Table
Define withCREATE MATERIALIZED VIEWCREATE DYNAMIC TABLE
Auto-refreshYes (automatic)Yes (on target lag)
Supports joinsLimitedFull SQL (all joins)
Supports streamsNoYes (incremental from streams)
ComplexitySimple aggregationsComplex ELT
Best forBI aggregations, dashboardsMulti-step ELT pipelines

Q8: Scenario — Dashboard queries are slow for 200 BI analysts. How to fix?

🧠 SIZE = 'MEDIUM'
DIAGNOSIS STEPS
1. Check: Are queries queuing or slow individually?
→ Check QUERY_HISTORY for queries where queued_overload_time > 0
→ If queuingconcurrency problem → scale OUT (multi-cluster)
→ If not queuing but slowcomputation problem → scale UP or clustering
2. Check: Are result cache hits low?
→ QUERY_HISTORY: result_from_cache = FALSE for most queries
→ If analysts run different queries: result cache can't help
→ If same queries: check why cache isn't hitting (data changing too often?)
3. Check: Partition pruning ratio
→ High partitions_scanned/totaladd clustering key on filter columns
SOLUTION BLUEPRINT
-- Separate BI warehouse from ETL
CREATE WAREHOUSE bi_wh
SIZE'MEDIUM'
MIN_CLUSTER_COUNT = 2 -- Always at least 2 clusters for BI hours
MAX_CLUSTER_COUNT = 8 -- Scale to 8 during peak
AUTO_SUSPEND = 600 -- 10 min idle before suspend (keep cache warm!)
SCALING_POLICY = 'STANDARD';
-- Pre-aggregate with Materialized Views
CREATE MATERIALIZED VIEW booking_summary_mv AS
SELECT DATE_TRUNC('month', booking_date) AS month,
airline_code, SUM(amount) AS revenue, COUNT(*) AS bookings
FROM fact_bookings
GROUP BY 1, 2;
-- BI dashboards now hit MV = instant
-- Cluster fact tables by common filter columns
ALTER TABLE fact_bookings CLUSTER BY (booking_date, airline_code);
-- Route BI dashboards to BI warehouse (not ETL warehouse)
-- Tool: Power BI / Tableau connection stringbi_wh warehouse