SQL Patterns 1–3: Window Functions — Ranking, Running Totals, LAG/LEAD
💡 Interview Tip
The most commonly tested patterns in data engineering interviews.
Master these 3 and you can answer ~40% of all medium SQL questions.
🧠 MEMORY MAP
🧠 WINDOW FUNCTIONS = "RRC"
WINDOW FUNCTIONS"RRC"
RROW_NUMBER / RANK / DENSE_RANK (Pattern 1: Ranking)
RRunning SUM/AVG (Pattern 2: Cumulative)
CCurrent vs Previous/Next (LAG/LEAD) (Pattern 3: Comparison)
RANK vs DENSE_RANK vs ROW_NUMBER:
Data: scores = [100, 90, 90, 80]
ROW_NUMBER: 1, 2, 3, 4 (always unique, no ties, just sequential)
RANK: 1, 2, 2, 4 (ties share rank, SKIPS next rank)
DENSE_RANK: 1, 2, 2, 3 (ties share rank, does NOT skip)
RULE: "Second highest with ties→DENSE_RANK"
"Deduplicate keeping one row→ROW_NUMBER"
"Leaderboard where ties exist→RANK or DENSE_RANK"
PATTERN 1: RANKING / TOP-N PER GROUP
What Is It?
Find the top N records within each partition (category, department, company). "Top 2 products per category", "Top 3 salaries per department", "2nd highest per group"
Recognize It When You See:
- "Top N per [group]"
- "Highest/lowest within each [group]"
- "Nth [something] in each [group]"
- "Rank [users/products/employees] by [metric] within [group]"
THE TEMPLATE
sql
-- TEMPLATE: Top N per group
WITH ranked AS (
SELECT
group_col,
value_col,
other_cols,
DENSE_RANK() OVER (
PARTITION BY group_col -- reset rank for each group
ORDER BY value_col DESC -- rank by this (DESC = highest first)
) AS rnk
FROM source_table
)
SELECT group_col, value_col, other_cols
FROM ranked
WHERE rnk <= N; -- change N to 1, 2, 3 etc.
When to Use Which Rank Function
🧠 Memory Map
ROW_NUMBER()→always uniqueUse for: DEDUPLICATION (keep exactly 1 row per group)
Use for: Pagination (rows 11-20 of a result set)
RANK() → ties share rank, next rank skips (1,2,2,4)
Use for: "What is this person's rank?" (true competition rank)
Use for: Sports leaderboards
DENSE_RANK() → ties share rank, no skipping (1,2,2,3)
Use for: "Top 2 per group" WITH ties (both tied 2nd place people should appear)
Use for: Most "top N" interview questions (safer default!)
TRAP: "Top 2 per category" with RANK() might return 3 rows per category
if two products are tied for 2nd! Use DENSE_RANK to handle this correctly.
SOLVED EXAMPLE — Q01 (Amazon): Top 2 highest-grossing products per category
Problem:
Table: product_spend (category, product, user_id, spend, transaction_date)
Find the top 2 highest-grossing products per category in 2022.
Return: category, product, total_spend
Step-by-step thinking:
1. Grain of output: one row per (category, product) with total spend
2. Partition: by category (reset rank for each category)
3. Order: by total spend DESC (highest grossing first)
4. Filter: keep only rank 1 and 2
sql
WITH category_spend AS (
-- Step 1: Aggregate total spend per category+product pair
SELECT
category,
product,
SUM(spend) AS total_spend
FROM product_spend
WHERE EXTRACT(YEAR FROM transaction_date) = 2022
GROUP BY category, product
),
ranked_products AS (
-- Step 2: Rank products within each category by spend
SELECT
category,
product,
total_spend,
DENSE_RANK() OVER (
PARTITION BY category
ORDER BY total_spend DESC
) AS spend_rank
FROM category_spend
)
-- Step 3: Keep only top 2
SELECT category, product, total_spend
FROM ranked_products
WHERE spend_rank <= 2
ORDER BY category, spend_rank;
SOLVED EXAMPLE — Q02: Top 3 salaries per department
sql
WITH dept_salary_rank AS (
SELECT
e.name AS employee_name,
d.name AS department_name,
e.salary,
DENSE_RANK() OVER (
PARTITION BY e.department_id
ORDER BY e.salary DESC
) AS salary_rank
FROM employees e
JOIN departments d ON e.department_id = d.id
)
SELECT department_name, employee_name, salary
FROM dept_salary_rank
WHERE salary_rank <= 3;
-- ⚠️ Edge case: if a dept has only 2 employees → returns 2 rows (correct!)
-- ⚠️ Ties: if 3rd and 4th salary are equal → DENSE_RANK returns BOTH at rank 3
-- Both are "Top 3" so this is CORRECT behavior for interview questions
PATTERN 1 QUESTIONS (from Question Bank)
| Q# | Question | Key Insight |
|---|---|---|
| Q01 | Top 2 highest-grossing products per category (Amazon) | Aggregate first, then DENSE_RANK |
| Q02 | Top 3 salaries per department | DENSE_RANK + PARTITION BY dept |
| Q03 | Email activity rank per user | DENSE_RANK on SUM of multiple columns |
| Q04 | Top 2 users per company with most calls (keep ties) | DENSE_RANK specifically to keep ties |
| Q05 | Most-used vehicle type past year | GROUP BY + ORDER BY + LIMIT 1 (no window needed!) |
| Q06 | Olympic swimmers with only gold medals | HAVING COUNT(medal) = COUNT(CASE WHEN medal='Gold') |
| Q07 | Nominee who won most Oscars | Simple GROUP BY + ORDER BY + LIMIT |
| Q08 | Top 10 users by total ride distance | SUM + JOIN + ORDER BY + LIMIT |
| Q09 | Top 5 product pairs (combined P1+P9) | See Market Basket pattern |
| Q10 | Top 3 departments by average salary | RANK on AVG(salary) PARTITION BY nothing |
PATTERN 2: RUNNING TOTALS / CUMULATIVE AGGREGATES
What Is It?
A metric that accumulates over time within a partition. "Running total of sales", "Cumulative signups per month", "Moving 7-day average"
Recognize It When You See:
- "Running total / cumulative"
- "Month-to-date / year-to-date"
- "Rolling N-day average"
- "Balance at end of each day"
- "Cumulative X from the start"
THE TEMPLATE
sql
-- TEMPLATE: Running total / cumulative sum
SELECT
partition_col,
date_col,
daily_value,
-- Cumulative sum (reset per partition)
SUM(daily_value) OVER (
PARTITION BY partition_col -- reset for each partition (optional)
ORDER BY date_col -- accumulate in this order
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- from start to now
) AS running_total,
-- Rolling N-day average (e.g., 7-day)
AVG(daily_value) OVER (
PARTITION BY partition_col
ORDER BY date_col
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- current + 6 previous = 7 rows
) AS rolling_7day_avg
FROM daily_metrics;
Window Frame Syntax
🧠 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW → cumulative from start
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWcumulative from start
ROWS BETWEEN 6 PRECEDING AND CURRENT ROWrolling 7-day window
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING3-row centered moving avg
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGgrand total on every row
TRAP: If you omit ROWS BETWEEN → default is RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW. For most cases this works, but for ties in the ORDER BY
column it can include/exclude rows unexpectedly. Use ROWS BETWEEN explicitly.
TRAP: If you omit ORDER BY inside OVER() → window = ALL rows in partition
→ you get GRAND TOTAL on every row, not a running total!
SOLVED EXAMPLE — Q11 (Visa): Cumulative merchant balance, reset each month
Problem:
Table: transactions (transaction_date, merchant_id, amount)
Calculate monthly running balance per merchant.
Each month resets to zero. Show: date, amount, monthly_running_balance
sql
WITH monthly_txns AS (
SELECT
merchant_id,
transaction_date,
amount,
-- Extract year+month for the PARTITION (reset each month)
DATE_TRUNC('month', transaction_date) AS txn_month
FROM transactions
)
SELECT
merchant_id,
transaction_date,
amount,
SUM(amount) OVER (
PARTITION BY merchant_id, txn_month -- partition = merchant + month → resets monthly
ORDER BY transaction_date -- accumulate in date order within the month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS monthly_running_balance
FROM monthly_txns
ORDER BY merchant_id, transaction_date;
SOLVED EXAMPLE — Q19 (Twitter): 3-day rolling average of tweets per user
sql
WITH daily_tweets AS (
SELECT
user_id,
tweet_date,
COUNT(*) AS tweet_count
FROM tweets
GROUP BY user_id, tweet_date
)
SELECT
user_id,
tweet_date,
tweet_count,
ROUND(
AVG(tweet_count) OVER (
PARTITION BY user_id
ORDER BY tweet_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW -- 3 rows: today + 2 before
), 2
) AS rolling_3day_avg
FROM daily_tweets
ORDER BY user_id, tweet_date;
PATTERN 2 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q11 | Cumulative merchant balance, reset each month | PARTITION BY merchant + month |
| Q12 | Cumulative users added daily, reset each month | DATE_TRUNC to create monthly partition |
| Q13 | Running total revenue by product category | PARTITION BY category, ORDER BY date |
| Q14 | Cumulative salary for 3 months excluding most recent | ROWS BETWEEN + date filter |
| Q15 | Total server uptime across overlapping windows | Merge intervals first (harder — LEAD approach) |
| Q16 | Month-over-month revenue change | SUM monthly → then LAG (combines P2+P3) |
PATTERN 3: LAG / LEAD — ROW-OVER-ROW COMPARISON
What Is It?
Compare each row to the row before it (LAG) or after it (LEAD). Used for: growth rates, detecting sequences, finding time between events.
Recognize It When You See:
- "Month-over-month / year-over-year change"
- "Previous [value]" or "next [event]"
- "Time between [event A] and [event B]"
- "Did [X happen immediately after Y]?"
- "Detect duplicate/repeated [events]"
THE TEMPLATE
sql
-- TEMPLATE: LAG / LEAD comparison
SELECT
partition_col,
date_col,
value_col,
-- Previous row's value
LAG(value_col, 1) OVER (
PARTITION BY partition_col
ORDER BY date_col
) AS prev_value,
-- Compute change
value_col - LAG(value_col, 1) OVER (
PARTITION BY partition_col
ORDER BY date_col
) AS change_from_prev,
-- % change
ROUND(
100.0 * (value_col - LAG(value_col,1) OVER (PARTITION BY partition_col ORDER BY date_col))
/ NULLIF(LAG(value_col,1) OVER (PARTITION BY partition_col ORDER BY date_col), 0),
2) AS pct_change,
-- Next row's value
LEAD(value_col, 1) OVER (
PARTITION BY partition_col
ORDER BY date_col
) AS next_value
FROM source_table;
-- ⚠️ NULLIF(..., 0) prevents division by zero when previous value = 0
-- ⚠️ First row: LAG returns NULL (no previous row) — handle with COALESCE if needed
SOLVED EXAMPLE — Q17 (Stripe): Detect duplicate payments within 10 minutes
Problem:
Table: transactions (transaction_id, merchant_id, credit_card_id, amount, transaction_timestamp)
Find duplicate payments: same merchant + same card + same amount, within 10 minutes.
Return: count of such duplicate pairs.
sql
WITH payment_gaps AS (
SELECT
merchant_id,
credit_card_id,
amount,
transaction_timestamp,
-- Get the PREVIOUS transaction timestamp for same merchant+card+amount
LAG(transaction_timestamp) OVER (
PARTITION BY merchant_id, credit_card_id, amount
ORDER BY transaction_timestamp
) AS prev_timestamp
FROM transactions
)
SELECT COUNT(*) AS duplicate_count
FROM payment_gaps
WHERE
prev_timestamp IS NOT NULL -- not the first transaction in this group
AND transaction_timestamp - prev_timestamp <= INTERVAL '10 minutes';
-- In Hive/Spark: (UNIX_TIMESTAMP(transaction_timestamp) - UNIX_TIMESTAMP(prev_timestamp)) <= 600
SOLVED EXAMPLE — Q20 (Apple): AirPod purchase directly after iPhone
Problem:
Table: purchases (user_id, product_name, purchase_date)
Find % of users who bought AirPods as their VERY NEXT purchase after buying iPhone.
Return: airpod_after_iphone_pct
sql
WITH purchase_sequence AS (
SELECT
user_id,
product_name,
-- Get the NEXT purchase for each user
LEAD(product_name, 1) OVER (
PARTITION BY user_id
ORDER BY purchase_date
) AS next_product
FROM purchases
),
iphone_buyers AS (
SELECT COUNT(DISTINCT user_id) AS total_iphone_buyers
FROM purchase_sequence
WHERE product_name = 'iPhone'
),
iphone_then_airpod AS (
SELECT COUNT(DISTINCT user_id) AS airpod_after_iphone
FROM purchase_sequence
WHERE product_name = 'iPhone'
AND next_product = 'AirPods'
)
SELECT
ROUND(
100.0 * a.airpod_after_iphone / i.total_iphone_buyers,
2) AS airpod_after_iphone_pct
FROM iphone_then_airpod a, iphone_buyers i;
SOLVED EXAMPLE — Q24 (Ankit Bansal): Rows where amount grew vs previous transaction
sql
WITH ordered_transactions AS (
SELECT
user_id,
transaction_date,
amount,
LAG(amount) OVER (
PARTITION BY user_id
ORDER BY transaction_date
) AS prev_amount
FROM transactions
)
SELECT user_id, transaction_date, amount, prev_amount
FROM ordered_transactions
WHERE prev_amount IS NOT NULL -- skip first row per user
AND amount > prev_amount; -- only rows where amount increased
PATTERN 3 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q17 | Duplicate payments within 10 min (Stripe) | LAG with PARTITION BY merchant+card+amount |
| Q18 | Avg delay between sign-up and 2nd ride | ROW_NUMBER to find 2nd ride, then DATEDIFF |
| Q19 | Twitter 3-day rolling average | AVG OVER ROWS BETWEEN 2 PRECEDING AND CURRENT |
| Q20 | AirPod after iPhone purchase (Apple) | LEAD to find next product, filter iPhone→AirPods |
| Q21 | Countries moving up in comment ranking | Two CTEs (month1, month2) + compare ranks |
| Q22 | 3 largest MoM call declines by company | LAG on monthly call count, ORDER BY decline |
| Q23 | 2nd purchase within 48 hours of 1st | ROW_NUMBER filter rn IN (1,2) per user + DATEDIFF |
| Q24 | Rows where purchase amount grew vs prev | LAG(amount) per user, filter WHERE amount > prev |
⚠️ COMMON TRAPS ACROSS ALL WINDOW FUNCTIONS
🧠 Memory Map
TRAP 1: Forgetting PARTITION BY
Wrong: ROW_NUMBER() OVER (ORDER BY salary DESC)
→ ranks ALL employees globally, not per department!
Right: ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC)
TRAP 2: Aggregate then window (ORDER OF OPERATIONS)
Wrong: SELECT SUM(amount) OVER (ORDER BY date), SUM(amount) FROM t GROUP BY date
→ can't mix window + group in same SELECT level
Right: Use CTE — aggregate in inner CTE, apply window in outer query
TRAP 3: RANK vs DENSE_RANK for "Top N"
"Find top 2 products" with RANK: if 2nd and 3rd are tied→both rank 2 but 3rd ranks 4 → 3rd excluded!
Use DENSE_RANK: both tied 2nd rank as 2→both included in "top 2" → CORRECT!
TRAP 4: NULL in LAG/LEAD first/last row
First row for each partition: LAG() returns NULL (no previous row)
Must handle: COALESCE(LAG(col), 0) OR add WHERE prev_val IS NOT NULL
TRAP 5: Window function in WHERE clause
Wrong: WHERE ROW_NUMBER() OVER (...) = 1 ← SYNTAX ERROR
Right: Wrap in CTE, then WHERE rn = 1 in outer query