SQL Interview — Quick Recall Guide
Must knowCommon trapMemory mapOne-liner
🧠 MASTER DECISION FLOWCHART
🧠 Memory Map
QUESTION SAYS...→USE THIS PATTERN
──────────────────────────────────────────────────────────────────
"Top N per group / rank within group"→P1: DENSE_RANK PARTITION BY
"Running total / cumulative / rolling avg"→P2: SUM OVER (ORDER BY)
"Previous row / MoM / next event"→P3: LAG / LEAD
"Consecutive days / streak / period"→P4: GAPS & ISLANDS (date - rn trick)
"Group events into sessions"→P5: SESSIONIZATION (LAG + SUM flag)
"Remove duplicates / keep latest row"→P6: ROW_NUMBER PARTITION BY = 1
"Compare rows in same table / hierarchy"→P7: SELF-JOIN (two aliases)
"All combinations / grid / every pair"→P8: CROSS JOIN
"Items bought in same order together"→P9: MARKET BASKET (self-join order_id)
"Count/sum by category in one row"→P10: CASE WHEN inside aggregate
"First users / retention / cohorts"→P11: COHORT (MIN date as cohort)
"Days between / overdue / past 90 days"→P12: DATE ARITHMETIC
"Hierarchy / tree / all levels"→P13: RECURSIVE CTE
"Median / percentile / quartile"→P14: PERCENTILE_CONT or ROW_NUMBER trick
"Funnel / conversion / drop-off rate"→P15: FUNNEL (COUNT DISTINCT per stage)
SECTION 1: WINDOW FUNCTIONS (P1, P2, P3)
P1: Ranking Template
sql
WITH ranked AS (
SELECT *, DENSE_RANK() OVER (PARTITION BY grp ORDER BY val DESC) AS rnk
FROM tbl
)
SELECT * FROM ranked WHERE rnk <= N;
-- ROW_NUMBER: unique, no ties | RANK: skips | DENSE_RANK: no skip ← use for Top-N
⚡ RANK vs DENSE_RANK: Top-2 with RANK → only 2 rows max. Top-2 with DENSE_RANK → tied 2nd places ALL appear (correct!).
P2: Running Total Template
sql
SUM(val) OVER (PARTITION BY grp ORDER BY date_col
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
AVG(val) OVER (PARTITION BY grp ORDER BY date_col
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
⚠️ TRAP: Omit ORDER BY inside OVER → gets grand total, not running total!
P3: LAG/LEAD Template
sql
LAG(val, 1) OVER (PARTITION BY grp ORDER BY date_col) AS prev_val
LEAD(val, 1) OVER (PARTITION BY grp ORDER BY date_col) AS next_val
-- First row LAG = NULL, Last row LEAD = NULL
-- pct_change = (val - prev_val) / NULLIF(prev_val, 0) × 100
⚡ KEY USES: MoM/YoY change, detect duplicates within time window (Stripe!), "next purchase was AirPods" (Apple!)
SECTION 2: HARD PATTERNS (P4, P5, P6)
P4: Gaps & Islands — THE KEY TRICK
sql
-- TRICK: consecutive dates - sequential row numbers = CONSTANT for each island!
WITH deduped AS (SELECT DISTINCT user_id, activity_date FROM events),
numbered AS (
SELECT user_id, activity_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY activity_date) AS rn
FROM deduped
),
islands AS (
SELECT user_id, activity_date,
DATE_SUB(activity_date, rn) AS island_key -- Hive/Spark
-- PostgreSQL: activity_date - rn AS island_key
FROM numbered
)
SELECT user_id, MIN(activity_date) AS start, MAX(activity_date) AS end, COUNT(*) AS len
FROM islands
GROUP BY user_id, island_key;
⚠️ TRAP 1: Deduplicate FIRST (DISTINCT) or same-day events break the trick.
⚠️ TRAP 2: Hive = DATE_SUB(date, rn), PostgreSQL = date - rn.
P5: Sessionization Template
sql
WITH flagged AS (
SELECT *, LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_t,
CASE WHEN prev_t IS NULL OR (event_time - prev_t) > INTERVAL '30 min'
THEN 1 ELSE 0 END AS new_session
FROM events
)
SELECT *, SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS session_id
FROM flagged;
P6: Deduplication Template
sql
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY unique_key ORDER BY updated_at DESC) AS rn
FROM table_with_dupes
)
SELECT * FROM ranked WHERE rn = 1; -- always use ROW_NUMBER not RANK (RANK can have ties!)
SECTION 3: JOIN PATTERNS (P7, P8, P9)
P7: Self-Join Template
sql
-- Hierarchy:
SELECT e.name, m.name AS manager FROM employees e JOIN employees m ON e.manager_id = m.id
-- Pairs (no duplicates):
SELECT t1.col, t2.col FROM t t1 JOIN t t2 ON t1.shared = t2.shared AND t1.id < t2.id
⚠️ TRAP: Forget id < id2 → each pair appears TWICE + self-pairs appear.
P8: Cross Join Template
sql
-- All combinations:
SELECT a.col, b.col FROM table_a a CROSS JOIN table_b b;
-- Round-robin (no self-pairs):
SELECT t1.name, t2.name FROM teams t1 CROSS JOIN teams t2 WHERE t1.id < t2.id;
-- 3-way (pizza toppings):
SELECT t1.name, t2.name, t3.name FROM t t1 JOIN t t2 ON t1.name < t2.name
JOIN t t3 ON t2.name < t3.name;
P9: Market Basket Template
sql
SELECT o1.product_id, o2.product_id, COUNT(DISTINCT o1.order_id) AS co_purchases
FROM order_items o1
JOIN order_items o2 ON o1.order_id = o2.order_id AND o1.product_id < o2.product_id
GROUP BY o1.product_id, o2.product_id
ORDER BY co_purchases DESC LIMIT 5;
SECTION 4: AGGREGATION & ANALYTICS (P10, P11, P12)
P10: Conditional Aggregation Template
sql
SELECT date_col,
SUM(CASE WHEN category = 'A' THEN 1 ELSE 0 END) AS count_a,
SUM(CASE WHEN category = 'B' THEN amount ELSE 0 END) AS revenue_b,
ROUND(100.0 * SUM(CASE WHEN event='click' THEN 1 ELSE 0 END)
/ NULLIF(SUM(CASE WHEN event='impression' THEN 1 ELSE 0 END), 0), 2) AS ctr
FROM events GROUP BY date_col;
⚠️ TRAP: Always NULLIF(denominator, 0) for ratios/CTR to prevent division by zero.
P11: Cohort Retention Template
sql
WITH cohorts AS (SELECT user_id, MIN(event_date) AS cohort_date FROM events GROUP BY user_id),
activity AS (
SELECT e.user_id, c.cohort_date,
DATEDIFF(e.event_date, c.cohort_date) AS days_since_cohort -- Hive/Spark
FROM events e JOIN cohorts c ON e.user_id = c.user_id
)
SELECT cohort_date,
COUNT(DISTINCT CASE WHEN days_since_cohort = 0 THEN user_id END) AS day0,
COUNT(DISTINCT CASE WHEN days_since_cohort = 7 THEN user_id END) AS day7,
COUNT(DISTINCT CASE WHEN days_since_cohort = 30 THEN user_id END) AS day30
FROM activity GROUP BY cohort_date;
P12: Date Arithmetic — Quick Reference
sql
DATEDIFF(end, start) -- Hive/Spark: days between dates
DATE_ADD(date, N) -- Hive/Spark: add N days
DATE_TRUNC('month', date) -- Spark/PG: start of month
TRUNC(date, 'MM') -- Hive: start of month
WHERE date >= DATE_SUB(CURRENT_DATE(), 90) -- past 90 days (Hive/Spark)
WHERE date >= CURRENT_DATE - INTERVAL '90 days' -- PostgreSQL
SECTION 5: ADVANCED (P13, P14, P15)
P13: Recursive CTE Template
sql
WITH RECURSIVE cte AS (
SELECT id, parent_id, name, 1 AS lvl FROM tree WHERE parent_id IS NULL -- ANCHOR
UNION ALL
SELECT t.id, t.parent_id, t.name, c.lvl + 1
FROM tree t JOIN cte c ON t.parent_id = c.id WHERE c.lvl < 20 -- RECURSIVE + LIMIT
)
SELECT * FROM cte;
-- ⚠️ Always add level limit! ⚠️ Use UNION ALL not UNION
-- ⚠️ Hive doesn't support WITH RECURSIVE → use Spark SQL
P14: Median Template
sql
-- Option 1: Built-in (PostgreSQL/Snowflake/Spark)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val) AS median FROM t;
-- Option 2: Manual (works everywhere including Hive)
WITH ranked AS (
SELECT val,
ROW_NUMBER() OVER (ORDER BY val) AS rn,
COUNT(*) OVER () AS n
FROM t
)
SELECT AVG(val) AS median FROM ranked WHERE rn IN ((n+1)/2, (n+2)/2);
-- Option 3: Quartiles with NTILE
SELECT user_id, spend, NTILE(4) OVER (ORDER BY spend) AS quartile FROM user_totals;
P15: Funnel Template
sql
WITH stages AS (
SELECT user_id,
MAX(CASE WHEN event='signup' THEN 1 ELSE 0 END) AS s1,
MAX(CASE WHEN event='verify' THEN 1 ELSE 0 END) AS s2,
MAX(CASE WHEN event='purchase' THEN 1 ELSE 0 END) AS s3
FROM events GROUP BY user_id
)
SELECT SUM(s1) AS signup, SUM(s2) AS verify, SUM(s3) AS purchase,
ROUND(100.0*SUM(s2)/NULLIF(SUM(s1),0),2) AS s1_to_s2,
ROUND(100.0*SUM(s3)/NULLIF(SUM(s2),0),2) AS s2_to_s3
FROM stages;
-- ⚠️ Use MAX(CASE WHEN) per user — counting unique users, not events!
THE SQL ULTRA CHEAT SHEET
📐 Architecture Diagram
╔══════════════════════════════════════════════════════════════════════╗ ║ SQL INTERVIEW — ALL 15 PATTERNS ║ ╠══════════════════════════════════════════════════════════════════════╣ ║ ║ ║ WINDOW FUNCTIONS (P1-P3): ║ ║ P1 RANK: DENSE_RANK() OVER(PARTITION BY grp ORDER BY val DESC) ║ ║ → Top-N: WHERE rank <= N ║ ║ P2 RUNNING: SUM() OVER(PARTITION BY grp ORDER BY date ║ ║ ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ║ ║ Rolling: ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW ║ ║ P3 LAG/LEAD: LAG(col,1) OVER(PARTITION BY grp ORDER BY date) ║ ║ → MoM change, sequence detection, time between events ║ ║ ║ ║ HARD PATTERNS (P4-P6): ║ ║ P4 GAPS&ISL: DISTINCT dates → ROW_NUMBER → date - rn = island_key ║ ║ GROUP BY island_key → MIN/MAX = streak start/end ║ ║ P5 SESSION: LAG gap > threshold → new_session=1 → SUM as session_id║ ║ P6 DEDUP: ROW_NUMBER() OVER(PARTITION BY key ORDER BY ts DESC)=1 ║ ║ ║ ║ JOIN PATTERNS (P7-P9): ║ ║ P7 SELF: FROM t e1 JOIN t e2 ON e1.mgr = e2.id (hierarchy) ║ ║ ON e1.shared = e2.shared AND e1.id < e2.id (pairs) ║ ║ P8 CROSS: CROSS JOIN for all combos; WHERE id < id2 for pairs ║ ║ P9 BASKET: JOIN orders ON same order_id AND p1.id < p2.id ║ ║ GROUP BY pair → ORDER BY count DESC LIMIT 5 ║ ║ ║ ║ ANALYTICS (P10-P12): ║ ║ P10 COND-AGG: SUM(CASE WHEN cond THEN val ELSE 0 END) ║ ║ NULLIF(denominator, 0) for ratios ║ ║ P11 COHORT: MIN(event_date) per user = cohort ║ ║ JOIN back on DATEDIFF = N for retention day ║ ║ P12 DATE: DATEDIFF(end,start) | DATE_ADD(d,N) | DATE_TRUNC ║ ║ COALESCE to handle 0-day gaps ║ ║ ║ ║ ADVANCED (P13-P15): ║ ║ P13 RECURSE: ANCHOR (roots) UNION ALL RECURSIVE (children) ║ ║ Always: WHERE level < 20 to prevent infinite loop ║ ║ Hive: no WITH RECURSIVE → use Spark SQL ║ ║ P14 MEDIAN: PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val) ║ ║ Manual: ROW_NUMBER + WHERE rn IN (n+1/2, n+2/2) ║ ║ NTILE(4) for quartiles ║ ║ P15 FUNNEL: MAX(CASE WHEN event=stage) per user → SUM each stage ║ ║ Conversion = next_count/this_count × 100 ║ ║ ║ ╠══════════════════════════════════════════════════════════════════════╣ ║ ║ ║ TOP TRAPS TO AVOID: ║ ║ 1. Window function in WHERE → must use CTE first ║ ║ 2. RANK not DENSE_RANK for top-N with ties ║ ║ 3. Omit ORDER BY in OVER → grand total, not running total ║ ║ 4. Self-join without id < id2 → duplicate pairs ║ ║ 5. No DISTINCT before gaps & islands → same-day events break trick ║ ║ 6. Division without NULLIF(denom, 0) → division by zero error ║ ║ 7. WITH RECURSIVE in Hive → not supported ║ ║ 8. COUNT(CASE WHEN) for users → must be COUNT(DISTINCT user_id) ║ ║ ║ ║ 5-STEP SOLVING FRAMEWORK (say this in interview!): ║ ║ 1. "What is the grain of the output?" (one row = one what?) ║ ║ 2. "What is the partition/group?" (what resets the window/join?) ║ ║ 3. "What pattern does this match?" (use the flowchart above) ║ ║ 4. "Build with CTEs — one CTE per logical step" ║ ║ 5. "Check edge cases: NULLs, ties, first/last row, empty groups" ║ ║ ║ ║ HOW TO ADD YOUR OWN QUESTIONS: ║ ║ → Open SQL_QUESTION_BANK.md ║ ║ → Add new row: Q91 | question text | company | P? | difficulty | ║ ║ → Share with me and I'll identify the pattern + add solved solution ║ ║ ║ ╚══════════════════════════════════════════════════════════════════════╝
SYNTAX DIFF: Hive vs Spark vs PostgreSQL
| Operation | Hive SQL | Spark SQL / PostgreSQL |
|---|---|---|
| Days between dates | DATEDIFF(end, start) | end - start OR DATEDIFF(e,s) |
| Add N days | DATE_ADD(date, N) | date + INTERVAL 'N days' |
| Truncate to month | TRUNC(date, 'MM') | DATE_TRUNC('month', date) |
| Current date | CURRENT_DATE() | CURRENT_DATE |
| Recursive CTE | ❌ NOT SUPPORTED | WITH RECURSIVE cte AS (...) |
| Percentile | PERCENTILE_APPROX(col, 0.5) | PERCENTILE_CONT(0.5) WITHIN GROUP |
| String concat | CONCAT('a','b') | 'a' |
| Unix timestamp | UNIX_TIMESTAMP(datetime) | EXTRACT(EPOCH FROM datetime) |