🗃️
SQL
SQL Patterns 1–3: Window Functions — Ranking, Running Totals, LAG/LEAD
🗃️
🗃️
SQL · Section 2 of 9

SQL Patterns 1–3: Window Functions — Ranking, Running Totals, LAG/LEAD

SQL Patterns 1–3: Window Functions — Ranking, Running Totals, LAG/LEAD

💡 Interview Tip
The most commonly tested patterns in data engineering interviews. Master these 3 and you can answer ~40% of all medium SQL questions.

🧠 MEMORY MAP

🧠 WINDOW FUNCTIONS = "RRC"
WINDOW FUNCTIONS"RRC"
RROW_NUMBER / RANK / DENSE_RANK (Pattern 1: Ranking)
RRunning SUM/AVG (Pattern 2: Cumulative)
CCurrent vs Previous/Next (LAG/LEAD) (Pattern 3: Comparison)
RANK vs DENSE_RANK vs ROW_NUMBER:
Data: scores = [100, 90, 90, 80]
ROW_NUMBER: 1, 2, 3, 4 (always unique, no ties, just sequential)
RANK: 1, 2, 2, 4 (ties share rank, SKIPS next rank)
DENSE_RANK: 1, 2, 2, 3 (ties share rank, does NOT skip)
RULE: "Second highest with tiesDENSE_RANK"
"Deduplicate keeping one rowROW_NUMBER"
"Leaderboard where ties existRANK or DENSE_RANK"

PATTERN 1: RANKING / TOP-N PER GROUP

What Is It?

Find the top N records within each partition (category, department, company). "Top 2 products per category", "Top 3 salaries per department", "2nd highest per group"

Recognize It When You See:

  • "Top N per [group]"
  • "Highest/lowest within each [group]"
  • "Nth [something] in each [group]"
  • "Rank [users/products/employees] by [metric] within [group]"

THE TEMPLATE

sql
-- TEMPLATE: Top N per group
WITH ranked AS (
    SELECT
        group_col,
        value_col,
        other_cols,
        DENSE_RANK() OVER (
            PARTITION BY group_col          -- reset rank for each group
            ORDER BY value_col DESC         -- rank by this (DESC = highest first)
        ) AS rnk
    FROM source_table
)
SELECT group_col, value_col, other_cols
FROM ranked
WHERE rnk <= N;   -- change N to 1, 2, 3 etc.

When to Use Which Rank Function

🧠 Memory Map
ROW_NUMBER()always unique
Use for: DEDUPLICATION (keep exactly 1 row per group)
Use for: Pagination (rows 11-20 of a result set)
RANK() → ties share rank, next rank skips (1,2,2,4)
Use for: "What is this person's rank?" (true competition rank)
Use for: Sports leaderboards
DENSE_RANK() → ties share rank, no skipping (1,2,2,3)
Use for: "Top 2 per group" WITH ties (both tied 2nd place people should appear)
Use for: Most "top N" interview questions (safer default!)
⚠️TRAP: "Top 2 per category" with RANK() might return 3 rows per category
if two products are tied for 2nd! Use DENSE_RANK to handle this correctly.

SOLVED EXAMPLE — Q01 (Amazon): Top 2 highest-grossing products per category

Problem:
Table: product_spend (category, product, user_id, spend, transaction_date)
Find the top 2 highest-grossing products per category in 2022.
Return: category, product, total_spend
Step-by-step thinking:
1. Grain of output: one row per (category, product) with total spend
2. Partition: by category (reset rank for each category)
3. Order: by total spend DESC (highest grossing first)
4. Filter: keep only rank 1 and 2
sql
WITH category_spend AS (
    -- Step 1: Aggregate total spend per category+product pair
    SELECT
        category,
        product,
        SUM(spend) AS total_spend
    FROM product_spend
    WHERE EXTRACT(YEAR FROM transaction_date) = 2022
    GROUP BY category, product
),

ranked_products AS (
    -- Step 2: Rank products within each category by spend
    SELECT
        category,
        product,
        total_spend,
        DENSE_RANK() OVER (
            PARTITION BY category
            ORDER BY total_spend DESC
        ) AS spend_rank
    FROM category_spend
)

-- Step 3: Keep only top 2
SELECT category, product, total_spend
FROM ranked_products
WHERE spend_rank <= 2
ORDER BY category, spend_rank;

SOLVED EXAMPLE — Q02: Top 3 salaries per department

sql
WITH dept_salary_rank AS (
    SELECT
        e.name AS employee_name,
        d.name AS department_name,
        e.salary,
        DENSE_RANK() OVER (
            PARTITION BY e.department_id
            ORDER BY e.salary DESC
        ) AS salary_rank
    FROM employees e
    JOIN departments d ON e.department_id = d.id
)

SELECT department_name, employee_name, salary
FROM dept_salary_rank
WHERE salary_rank <= 3;

-- ⚠️ Edge case: if a dept has only 2 employees → returns 2 rows (correct!)
-- ⚠️ Ties: if 3rd and 4th salary are equal → DENSE_RANK returns BOTH at rank 3
--    Both are "Top 3" so this is CORRECT behavior for interview questions

PATTERN 1 QUESTIONS (from Question Bank)

Q#QuestionKey Insight
Q01Top 2 highest-grossing products per category (Amazon)Aggregate first, then DENSE_RANK
Q02Top 3 salaries per departmentDENSE_RANK + PARTITION BY dept
Q03Email activity rank per userDENSE_RANK on SUM of multiple columns
Q04Top 2 users per company with most calls (keep ties)DENSE_RANK specifically to keep ties
Q05Most-used vehicle type past yearGROUP BY + ORDER BY + LIMIT 1 (no window needed!)
Q06Olympic swimmers with only gold medalsHAVING COUNT(medal) = COUNT(CASE WHEN medal='Gold')
Q07Nominee who won most OscarsSimple GROUP BY + ORDER BY + LIMIT
Q08Top 10 users by total ride distanceSUM + JOIN + ORDER BY + LIMIT
Q09Top 5 product pairs (combined P1+P9)See Market Basket pattern
Q10Top 3 departments by average salaryRANK on AVG(salary) PARTITION BY nothing

PATTERN 2: RUNNING TOTALS / CUMULATIVE AGGREGATES

What Is It?

A metric that accumulates over time within a partition. "Running total of sales", "Cumulative signups per month", "Moving 7-day average"

Recognize It When You See:

  • "Running total / cumulative"
  • "Month-to-date / year-to-date"
  • "Rolling N-day average"
  • "Balance at end of each day"
  • "Cumulative X from the start"

THE TEMPLATE

sql
-- TEMPLATE: Running total / cumulative sum
SELECT
    partition_col,
    date_col,
    daily_value,

    -- Cumulative sum (reset per partition)
    SUM(daily_value) OVER (
        PARTITION BY partition_col      -- reset for each partition (optional)
        ORDER BY date_col               -- accumulate in this order
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW  -- from start to now
    ) AS running_total,

    -- Rolling N-day average (e.g., 7-day)
    AVG(daily_value) OVER (
        PARTITION BY partition_col
        ORDER BY date_col
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW  -- current + 6 previous = 7 rows
    ) AS rolling_7day_avg

FROM daily_metrics;

Window Frame Syntax

🧠 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW → cumulative from start
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWcumulative from start
ROWS BETWEEN 6 PRECEDING AND CURRENT ROWrolling 7-day window
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING3-row centered moving avg
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGgrand total on every row
⚠️TRAP: If you omit ROWS BETWEEN → default is RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW. For most cases this works, but for ties in the ORDER BY
column it can include/exclude rows unexpectedly. Use ROWS BETWEEN explicitly.
⚠️TRAP: If you omit ORDER BY inside OVER() → window = ALL rows in partition
→ you get GRAND TOTAL on every row, not a running total!

SOLVED EXAMPLE — Q11 (Visa): Cumulative merchant balance, reset each month

Problem:
Table: transactions (transaction_date, merchant_id, amount)
Calculate monthly running balance per merchant.
Each month resets to zero. Show: date, amount, monthly_running_balance
sql
WITH monthly_txns AS (
    SELECT
        merchant_id,
        transaction_date,
        amount,
        -- Extract year+month for the PARTITION (reset each month)
        DATE_TRUNC('month', transaction_date) AS txn_month
    FROM transactions
)

SELECT
    merchant_id,
    transaction_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY merchant_id, txn_month  -- partition = merchant + month → resets monthly
        ORDER BY transaction_date            -- accumulate in date order within the month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS monthly_running_balance
FROM monthly_txns
ORDER BY merchant_id, transaction_date;

SOLVED EXAMPLE — Q19 (Twitter): 3-day rolling average of tweets per user

sql
WITH daily_tweets AS (
    SELECT
        user_id,
        tweet_date,
        COUNT(*) AS tweet_count
    FROM tweets
    GROUP BY user_id, tweet_date
)

SELECT
    user_id,
    tweet_date,
    tweet_count,
    ROUND(
        AVG(tweet_count) OVER (
            PARTITION BY user_id
            ORDER BY tweet_date
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW  -- 3 rows: today + 2 before
        ), 2
    ) AS rolling_3day_avg
FROM daily_tweets
ORDER BY user_id, tweet_date;

PATTERN 2 QUESTIONS

Q#QuestionKey Insight
Q11Cumulative merchant balance, reset each monthPARTITION BY merchant + month
Q12Cumulative users added daily, reset each monthDATE_TRUNC to create monthly partition
Q13Running total revenue by product categoryPARTITION BY category, ORDER BY date
Q14Cumulative salary for 3 months excluding most recentROWS BETWEEN + date filter
Q15Total server uptime across overlapping windowsMerge intervals first (harder — LEAD approach)
Q16Month-over-month revenue changeSUM monthly → then LAG (combines P2+P3)

PATTERN 3: LAG / LEAD — ROW-OVER-ROW COMPARISON

What Is It?

Compare each row to the row before it (LAG) or after it (LEAD). Used for: growth rates, detecting sequences, finding time between events.

Recognize It When You See:

  • "Month-over-month / year-over-year change"
  • "Previous [value]" or "next [event]"
  • "Time between [event A] and [event B]"
  • "Did [X happen immediately after Y]?"
  • "Detect duplicate/repeated [events]"

THE TEMPLATE

sql
-- TEMPLATE: LAG / LEAD comparison
SELECT
    partition_col,
    date_col,
    value_col,

    -- Previous row's value
    LAG(value_col, 1) OVER (
        PARTITION BY partition_col
        ORDER BY date_col
    ) AS prev_value,

    -- Compute change
    value_col - LAG(value_col, 1) OVER (
        PARTITION BY partition_col
        ORDER BY date_col
    ) AS change_from_prev,

    -- % change
    ROUND(
        100.0 * (value_col - LAG(value_col,1) OVER (PARTITION BY partition_col ORDER BY date_col))
        / NULLIF(LAG(value_col,1) OVER (PARTITION BY partition_col ORDER BY date_col), 0),
    2) AS pct_change,

    -- Next row's value
    LEAD(value_col, 1) OVER (
        PARTITION BY partition_col
        ORDER BY date_col
    ) AS next_value

FROM source_table;

-- ⚠️ NULLIF(..., 0) prevents division by zero when previous value = 0
-- ⚠️ First row: LAG returns NULL (no previous row) — handle with COALESCE if needed

SOLVED EXAMPLE — Q17 (Stripe): Detect duplicate payments within 10 minutes

Problem:
Table: transactions (transaction_id, merchant_id, credit_card_id, amount, transaction_timestamp)
Find duplicate payments: same merchant + same card + same amount, within 10 minutes.
Return: count of such duplicate pairs.
sql
WITH payment_gaps AS (
    SELECT
        merchant_id,
        credit_card_id,
        amount,
        transaction_timestamp,

        -- Get the PREVIOUS transaction timestamp for same merchant+card+amount
        LAG(transaction_timestamp) OVER (
            PARTITION BY merchant_id, credit_card_id, amount
            ORDER BY transaction_timestamp
        ) AS prev_timestamp
    FROM transactions
)

SELECT COUNT(*) AS duplicate_count
FROM payment_gaps
WHERE
    prev_timestamp IS NOT NULL  -- not the first transaction in this group
    AND transaction_timestamp - prev_timestamp <= INTERVAL '10 minutes';
    -- In Hive/Spark: (UNIX_TIMESTAMP(transaction_timestamp) - UNIX_TIMESTAMP(prev_timestamp)) <= 600

SOLVED EXAMPLE — Q20 (Apple): AirPod purchase directly after iPhone

Problem:
Table: purchases (user_id, product_name, purchase_date)
Find % of users who bought AirPods as their VERY NEXT purchase after buying iPhone.
Return: airpod_after_iphone_pct
sql
WITH purchase_sequence AS (
    SELECT
        user_id,
        product_name,
        -- Get the NEXT purchase for each user
        LEAD(product_name, 1) OVER (
            PARTITION BY user_id
            ORDER BY purchase_date
        ) AS next_product
    FROM purchases
),

iphone_buyers AS (
    SELECT COUNT(DISTINCT user_id) AS total_iphone_buyers
    FROM purchase_sequence
    WHERE product_name = 'iPhone'
),

iphone_then_airpod AS (
    SELECT COUNT(DISTINCT user_id) AS airpod_after_iphone
    FROM purchase_sequence
    WHERE product_name = 'iPhone'
      AND next_product = 'AirPods'
)

SELECT
    ROUND(
        100.0 * a.airpod_after_iphone / i.total_iphone_buyers,
    2) AS airpod_after_iphone_pct
FROM iphone_then_airpod a, iphone_buyers i;

SOLVED EXAMPLE — Q24 (Ankit Bansal): Rows where amount grew vs previous transaction

sql
WITH ordered_transactions AS (
    SELECT
        user_id,
        transaction_date,
        amount,
        LAG(amount) OVER (
            PARTITION BY user_id
            ORDER BY transaction_date
        ) AS prev_amount
    FROM transactions
)

SELECT user_id, transaction_date, amount, prev_amount
FROM ordered_transactions
WHERE prev_amount IS NOT NULL   -- skip first row per user
  AND amount > prev_amount;     -- only rows where amount increased

PATTERN 3 QUESTIONS

Q#QuestionKey Insight
Q17Duplicate payments within 10 min (Stripe)LAG with PARTITION BY merchant+card+amount
Q18Avg delay between sign-up and 2nd rideROW_NUMBER to find 2nd ride, then DATEDIFF
Q19Twitter 3-day rolling averageAVG OVER ROWS BETWEEN 2 PRECEDING AND CURRENT
Q20AirPod after iPhone purchase (Apple)LEAD to find next product, filter iPhone→AirPods
Q21Countries moving up in comment rankingTwo CTEs (month1, month2) + compare ranks
Q223 largest MoM call declines by companyLAG on monthly call count, ORDER BY decline
Q232nd purchase within 48 hours of 1stROW_NUMBER filter rn IN (1,2) per user + DATEDIFF
Q24Rows where purchase amount grew vs prevLAG(amount) per user, filter WHERE amount > prev

⚠️ COMMON TRAPS ACROSS ALL WINDOW FUNCTIONS

🧠 Memory Map
TRAP 1: Forgetting PARTITION BY
Wrong: ROW_NUMBER() OVER (ORDER BY salary DESC)
→ ranks ALL employees globally, not per department!
Right: ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC)
TRAP 2: Aggregate then window (ORDER OF OPERATIONS)
Wrong: SELECT SUM(amount) OVER (ORDER BY date), SUM(amount) FROM t GROUP BY date
→ can't mix window + group in same SELECT level
Right: Use CTE — aggregate in inner CTE, apply window in outer query
TRAP 3: RANK vs DENSE_RANK for "Top N"
"Find top 2 products" with RANK: if 2nd and 3rd are tiedboth rank 2 but 3rd ranks 4 → 3rd excluded!
Use DENSE_RANK: both tied 2nd rank as 2both included in "top 2" → CORRECT!
TRAP 4: NULL in LAG/LEAD first/last row
First row for each partition: LAG() returns NULL (no previous row)
Must handle: COALESCE(LAG(col), 0) OR add WHERE prev_val IS NOT NULL
TRAP 5: Window function in WHERE clause
Wrong: WHERE ROW_NUMBER() OVER (...) = 1 ← SYNTAX ERROR
Right: Wrap in CTE, then WHERE rn = 1 in outer query