🗃️

SQL

SQL Patterns 1–3: Window Functions — Ranking, Running Totals, LAG/LEAD

🗃️

SQL · Section 2 of 10

SQL Patterns 1–3: Window Functions — Ranking, Running Totals, LAG/LEAD

💡 Interview Tip

The most commonly tested patterns in data engineering interviews. Master these 3 and you can answer ~40% of all medium SQL questions.

🧠 MEMORY MAP

🧠 WINDOW FUNCTIONS = "RRC"

WINDOW FUNCTIONS"RRC"

RROW_NUMBER / RANK / DENSE_RANK (Pattern 1: Ranking)

RRunning SUM/AVG (Pattern 2: Cumulative)

CCurrent vs Previous/Next (LAG/LEAD) (Pattern 3: Comparison)

RANK vs DENSE_RANK vs ROW_NUMBER:

Data: scores = [100, 90, 90, 80]

ROW_NUMBER: 1, 2, 3, 4 (always unique, no ties, just sequential)

RANK: 1, 2, 2, 4 (ties share rank, SKIPS next rank)

DENSE_RANK: 1, 2, 2, 3 (ties share rank, does NOT skip)

RULE: "Second highest with ties→DENSE_RANK"

"Deduplicate keeping one row→ROW_NUMBER"

"Leaderboard where ties exist→RANK or DENSE_RANK"

PATTERN 1: RANKING / TOP-N PER GROUP

What Is It?

Find the top N records within each partition (category, department, company). "Top 2 products per category", "Top 3 salaries per department", "2nd highest per group"

Recognize It When You See:

"Top N per [group]"
"Highest/lowest within each [group]"
"Nth [something] in each [group]"
"Rank [users/products/employees] by [metric] within [group]"

THE TEMPLATE

sql

-- TEMPLATE: Top N per group
WITH ranked AS (
    SELECT
        group_col,
        value_col,
        other_cols,
        DENSE_RANK() OVER (
            PARTITION BY group_col          -- reset rank for each group
            ORDER BY value_col DESC         -- rank by this (DESC = highest first)
        ) AS rnk
    FROM source_table
)
SELECT group_col, value_col, other_cols
FROM ranked
WHERE rnk <= N;   -- change N to 1, 2, 3 etc.

When to Use Which Rank Function

🧠 Memory Map

ROW_NUMBER()→always unique

Use for: DEDUPLICATION (keep exactly 1 row per group)

Use for: Pagination (rows 11-20 of a result set)

RANK() → ties share rank, next rank skips (1,2,2,4)

Use for: "What is this person's rank?" (true competition rank)

Use for: Sports leaderboards

DENSE_RANK() → ties share rank, no skipping (1,2,2,3)

Use for: "Top 2 per group" WITH ties (both tied 2nd place people should appear)

Use for: Most "top N" interview questions (safer default!)

⚠️TRAP: "Top 2 per category" with RANK() might return 3 rows per category

if two products are tied for 2nd! Use DENSE_RANK to handle this correctly.

SOLVED EXAMPLE — Q01 (Amazon): Top 2 highest-grossing products per category

Problem:

Table: product_spend (category, product, user_id, spend, transaction_date)

Find the top 2 highest-grossing products per category in 2022.

Return: category, product, total_spend

Step-by-step thinking:

1. Grain of output: one row per (category, product) with total spend

2. Partition: by category (reset rank for each category)

3. Order: by total spend DESC (highest grossing first)

4. Filter: keep only rank 1 and 2

sql

WITH category_spend AS (
    -- Step 1: Aggregate total spend per category+product pair
    SELECT
        category,
        product,
        SUM(spend) AS total_spend
    FROM product_spend
    WHERE EXTRACT(YEAR FROM transaction_date) = 2022
    GROUP BY category, product
),

ranked_products AS (
    -- Step 2: Rank products within each category by spend
    SELECT
        category,
        product,
        total_spend,
        DENSE_RANK() OVER (
            PARTITION BY category
            ORDER BY total_spend DESC
        ) AS spend_rank
    FROM category_spend
)

-- Step 3: Keep only top 2
SELECT category, product, total_spend
FROM ranked_products
WHERE spend_rank <= 2
ORDER BY category, spend_rank;

SOLVED EXAMPLE — Q02: Top 3 salaries per department

sql

WITH dept_salary_rank AS (
    SELECT
        e.name AS employee_name,
        d.name AS department_name,
        e.salary,
        DENSE_RANK() OVER (
            PARTITION BY e.department_id
            ORDER BY e.salary DESC
        ) AS salary_rank
    FROM employees e
    JOIN departments d ON e.department_id = d.id
)

SELECT department_name, employee_name, salary
FROM dept_salary_rank
WHERE salary_rank <= 3;

-- ⚠️ Edge case: if a dept has only 2 employees → returns 2 rows (correct!)
-- ⚠️ Ties: if 3rd and 4th salary are equal → DENSE_RANK returns BOTH at rank 3
--    Both are "Top 3" so this is CORRECT behavior for interview questions

PATTERN 1 QUESTIONS (from Question Bank)

Q#	Question	Key Insight
Q01	Top 2 highest-grossing products per category (Amazon)	Aggregate first, then DENSE_RANK
Q02	Top 3 salaries per department	DENSE_RANK + PARTITION BY dept
Q03	Email activity rank per user	DENSE_RANK on SUM of multiple columns
Q04	Top 2 users per company with most calls (keep ties)	DENSE_RANK specifically to keep ties
Q05	Most-used vehicle type past year	GROUP BY + ORDER BY + LIMIT 1 (no window needed!)
Q06	Olympic swimmers with only gold medals	HAVING COUNT(medal) = COUNT(CASE WHEN medal='Gold')
Q07	Nominee who won most Oscars	Simple GROUP BY + ORDER BY + LIMIT
Q08	Top 10 users by total ride distance	SUM + JOIN + ORDER BY + LIMIT
Q09	Top 5 product pairs (combined P1+P9)	See Market Basket pattern
Q10	Top 3 departments by average salary	RANK on AVG(salary) PARTITION BY nothing

PATTERN 2: RUNNING TOTALS / CUMULATIVE AGGREGATES

What Is It?

A metric that accumulates over time within a partition. "Running total of sales", "Cumulative signups per month", "Moving 7-day average"

Recognize It When You See:

"Running total / cumulative"
"Month-to-date / year-to-date"
"Rolling N-day average"
"Balance at end of each day"
"Cumulative X from the start"

THE TEMPLATE

sql

-- TEMPLATE: Running total / cumulative sum
SELECT
    partition_col,
    date_col,
    daily_value,

    -- Cumulative sum (reset per partition)
    SUM(daily_value) OVER (
        PARTITION BY partition_col      -- reset for each partition (optional)
        ORDER BY date_col               -- accumulate in this order
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW  -- from start to now
    ) AS running_total,

    -- Rolling N-day average (e.g., 7-day)
    AVG(daily_value) OVER (
        PARTITION BY partition_col
        ORDER BY date_col
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW  -- current + 6 previous = 7 rows
    ) AS rolling_7day_avg

FROM daily_metrics;

Window Frame Syntax

🧠 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW → cumulative from start

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWcumulative from start

ROWS BETWEEN 6 PRECEDING AND CURRENT ROWrolling 7-day window

ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING3-row centered moving avg

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGgrand total on every row

⚠️TRAP: If you omit ROWS BETWEEN → default is RANGE BETWEEN UNBOUNDED PRECEDING

AND CURRENT ROW. For most cases this works, but for ties in the ORDER BY

column it can include/exclude rows unexpectedly. Use ROWS BETWEEN explicitly.

⚠️TRAP: If you omit ORDER BY inside OVER() → window = ALL rows in partition

→ you get GRAND TOTAL on every row, not a running total!

SOLVED EXAMPLE — Q11 (Visa): Cumulative merchant balance, reset each month

Problem:

Table: transactions (transaction_date, merchant_id, amount)

Calculate monthly running balance per merchant.

Each month resets to zero. Show: date, amount, monthly_running_balance

sql

WITH monthly_txns AS (
    SELECT
        merchant_id,
        transaction_date,
        amount,
        -- Extract year+month for the PARTITION (reset each month)
        DATE_TRUNC('month', transaction_date) AS txn_month
    FROM transactions
)

SELECT
    merchant_id,
    transaction_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY merchant_id, txn_month  -- partition = merchant + month → resets monthly
        ORDER BY transaction_date            -- accumulate in date order within the month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS monthly_running_balance
FROM monthly_txns
ORDER BY merchant_id, transaction_date;

SOLVED EXAMPLE — Q19 (Twitter): 3-day rolling average of tweets per user

sql

WITH daily_tweets AS (
    SELECT
        user_id,
        tweet_date,
        COUNT(*) AS tweet_count
    FROM tweets
    GROUP BY user_id, tweet_date
)

SELECT
    user_id,
    tweet_date,
    tweet_count,
    ROUND(
        AVG(tweet_count) OVER (
            PARTITION BY user_id
            ORDER BY tweet_date
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW  -- 3 rows: today + 2 before
        ), 2
    ) AS rolling_3day_avg
FROM daily_tweets
ORDER BY user_id, tweet_date;

PATTERN 2 QUESTIONS

Q#	Question	Key Insight
Q11	Cumulative merchant balance, reset each month	PARTITION BY merchant + month
Q12	Cumulative users added daily, reset each month	DATE_TRUNC to create monthly partition
Q13	Running total revenue by product category	PARTITION BY category, ORDER BY date
Q14	Cumulative salary for 3 months excluding most recent	ROWS BETWEEN + date filter
Q15	Total server uptime across overlapping windows	Merge intervals first (harder — LEAD approach)
Q16	Month-over-month revenue change	SUM monthly → then LAG (combines P2+P3)

PATTERN 3: LAG / LEAD — ROW-OVER-ROW COMPARISON

What Is It?

Compare each row to the row before it (LAG) or after it (LEAD). Used for: growth rates, detecting sequences, finding time between events.

Recognize It When You See:

"Month-over-month / year-over-year change"
"Previous [value]" or "next [event]"
"Time between [event A] and [event B]"
"Did [X happen immediately after Y]?"
"Detect duplicate/repeated [events]"

THE TEMPLATE

sql

-- TEMPLATE: LAG / LEAD comparison
SELECT
    partition_col,
    date_col,
    value_col,

    -- Previous row's value
    LAG(value_col, 1) OVER (
        PARTITION BY partition_col
        ORDER BY date_col
    ) AS prev_value,

    -- Compute change
    value_col - LAG(value_col, 1) OVER (
        PARTITION BY partition_col
        ORDER BY date_col
    ) AS change_from_prev,

    -- % change
    ROUND(
        100.0 * (value_col - LAG(value_col,1) OVER (PARTITION BY partition_col ORDER BY date_col))
        / NULLIF(LAG(value_col,1) OVER (PARTITION BY partition_col ORDER BY date_col), 0),
    2) AS pct_change,

    -- Next row's value
    LEAD(value_col, 1) OVER (
        PARTITION BY partition_col
        ORDER BY date_col
    ) AS next_value

FROM source_table;

-- ⚠️ NULLIF(..., 0) prevents division by zero when previous value = 0
-- ⚠️ First row: LAG returns NULL (no previous row) — handle with COALESCE if needed

SOLVED EXAMPLE — Q17 (Stripe): Detect duplicate payments within 10 minutes

Problem:

Table: transactions (transaction_id, merchant_id, credit_card_id, amount, transaction_timestamp)

Find duplicate payments: same merchant + same card + same amount, within 10 minutes.

Return: count of such duplicate pairs.

sql

WITH payment_gaps AS (
    SELECT
        merchant_id,
        credit_card_id,
        amount,
        transaction_timestamp,

        -- Get the PREVIOUS transaction timestamp for same merchant+card+amount
        LAG(transaction_timestamp) OVER (
            PARTITION BY merchant_id, credit_card_id, amount
            ORDER BY transaction_timestamp
        ) AS prev_timestamp
    FROM transactions
)

SELECT COUNT(*) AS duplicate_count
FROM payment_gaps
WHERE
    prev_timestamp IS NOT NULL  -- not the first transaction in this group
    AND transaction_timestamp - prev_timestamp <= INTERVAL '10 minutes';
    -- In Hive/Spark: (UNIX_TIMESTAMP(transaction_timestamp) - UNIX_TIMESTAMP(prev_timestamp)) <= 600

SOLVED EXAMPLE — Q20 (Apple): AirPod purchase directly after iPhone

Problem:

Table: purchases (user_id, product_name, purchase_date)

Find % of users who bought AirPods as their VERY NEXT purchase after buying iPhone.

Return: airpod_after_iphone_pct

sql

WITH purchase_sequence AS (
    SELECT
        user_id,
        product_name,
        -- Get the NEXT purchase for each user
        LEAD(product_name, 1) OVER (
            PARTITION BY user_id
            ORDER BY purchase_date
        ) AS next_product
    FROM purchases
),

iphone_buyers AS (
    SELECT COUNT(DISTINCT user_id) AS total_iphone_buyers
    FROM purchase_sequence
    WHERE product_name = 'iPhone'
),

iphone_then_airpod AS (
    SELECT COUNT(DISTINCT user_id) AS airpod_after_iphone
    FROM purchase_sequence
    WHERE product_name = 'iPhone'
      AND next_product = 'AirPods'
)

SELECT
    ROUND(
        100.0 * a.airpod_after_iphone / i.total_iphone_buyers,
    2) AS airpod_after_iphone_pct
FROM iphone_then_airpod a, iphone_buyers i;

SOLVED EXAMPLE — Q24: Rows where amount grew vs previous transaction

sql

WITH ordered_transactions AS (
    SELECT
        user_id,
        transaction_date,
        amount,
        LAG(amount) OVER (
            PARTITION BY user_id
            ORDER BY transaction_date
        ) AS prev_amount
    FROM transactions
)

SELECT user_id, transaction_date, amount, prev_amount
FROM ordered_transactions
WHERE prev_amount IS NOT NULL   -- skip first row per user
  AND amount > prev_amount;     -- only rows where amount increased

PATTERN 3 QUESTIONS

Q#	Question	Key Insight
Q17	Duplicate payments within 10 min (Stripe)	LAG with PARTITION BY merchant+card+amount
Q18	Avg delay between sign-up and 2nd ride	ROW_NUMBER to find 2nd ride, then DATEDIFF
Q19	Twitter 3-day rolling average	AVG OVER ROWS BETWEEN 2 PRECEDING AND CURRENT
Q20	AirPod after iPhone purchase (Apple)	LEAD to find next product, filter iPhone→AirPods
Q21	Countries moving up in comment ranking	Two CTEs (month1, month2) + compare ranks
Q22	3 largest MoM call declines by company	LAG on monthly call count, ORDER BY decline
Q23	2nd purchase within 48 hours of 1st	ROW_NUMBER filter rn IN (1,2) per user + DATEDIFF
Q24	Rows where purchase amount grew vs prev	LAG(amount) per user, filter WHERE amount > prev

⚠️ COMMON TRAPS ACROSS ALL WINDOW FUNCTIONS

🧠 Memory Map

TRAP 1: Forgetting PARTITION BY

Wrong: ROW_NUMBER() OVER (ORDER BY salary DESC)

→ ranks ALL employees globally, not per department!

Right: ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC)

TRAP 2: Aggregate then window (ORDER OF OPERATIONS)

Wrong: SELECT SUM(amount) OVER (ORDER BY date), SUM(amount) FROM t GROUP BY date

→ can't mix window + group in same SELECT level

Right: Use CTE — aggregate in inner CTE, apply window in outer query

TRAP 3: RANK vs DENSE_RANK for "Top N"

"Find top 2 products" with RANK: if 2nd and 3rd are tied→both rank 2 but 3rd ranks 4 → 3rd excluded!

Use DENSE_RANK: both tied 2nd rank as 2→both included in "top 2" → CORRECT!

TRAP 4: NULL in LAG/LEAD first/last row

First row for each partition: LAG() returns NULL (no previous row)

Must handle: COALESCE(LAG(col), 0) OR add WHERE prev_val IS NOT NULL

TRAP 5: Window function in WHERE clause

Wrong: WHERE ROW_NUMBER() OVER (...) = 1 ← SYNTAX ERROR

Right: Wrap in CTE, then WHERE rn = 1 in outer query

← SQL Interview Prep — Pattern-Based Guide for Data EngineersPrevious SQL Patterns 4–6: Gaps & Islands, Sessionization, DeduplicationNext →