🗃️
SQL
SQL Interview — Quick Recall Guide
🗃️
🗃️
SQL · Section 7 of 9

SQL Interview — Quick Recall Guide

SQL Interview — Quick Recall Guide

Must know⚠️Common trap🧠Memory map📝One-liner

🧠 MASTER DECISION FLOWCHART

🧠 Memory Map
QUESTION SAYS...USE THIS PATTERN
──────────────────────────────────────────────────────────────────
"Top N per group / rank within group"P1: DENSE_RANK PARTITION BY
"Running total / cumulative / rolling avg"P2: SUM OVER (ORDER BY)
"Previous row / MoM / next event"P3: LAG / LEAD
"Consecutive days / streak / period"P4: GAPS & ISLANDS (date - rn trick)
"Group events into sessions"P5: SESSIONIZATION (LAG + SUM flag)
"Remove duplicates / keep latest row"P6: ROW_NUMBER PARTITION BY = 1
"Compare rows in same table / hierarchy"P7: SELF-JOIN (two aliases)
"All combinations / grid / every pair"P8: CROSS JOIN
"Items bought in same order together"P9: MARKET BASKET (self-join order_id)
"Count/sum by category in one row"P10: CASE WHEN inside aggregate
"First users / retention / cohorts"P11: COHORT (MIN date as cohort)
"Days between / overdue / past 90 days"P12: DATE ARITHMETIC
"Hierarchy / tree / all levels"P13: RECURSIVE CTE
"Median / percentile / quartile"P14: PERCENTILE_CONT or ROW_NUMBER trick
"Funnel / conversion / drop-off rate"P15: FUNNEL (COUNT DISTINCT per stage)

SECTION 1: WINDOW FUNCTIONS (P1, P2, P3)

P1: Ranking Template

sql
WITH ranked AS (
    SELECT *, DENSE_RANK() OVER (PARTITION BY grp ORDER BY val DESC) AS rnk
    FROM tbl
)
SELECT * FROM ranked WHERE rnk <= N;
-- ROW_NUMBER: unique, no ties | RANK: skips | DENSE_RANK: no skip ← use for Top-N

RANK vs DENSE_RANK: Top-2 with RANK → only 2 rows max. Top-2 with DENSE_RANK → tied 2nd places ALL appear (correct!).

P2: Running Total Template

sql
SUM(val) OVER (PARTITION BY grp ORDER BY date_col
               ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total

AVG(val) OVER (PARTITION BY grp ORDER BY date_col
               ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7day_avg

⚠️ TRAP: Omit ORDER BY inside OVER → gets grand total, not running total!

P3: LAG/LEAD Template

sql
LAG(val, 1) OVER (PARTITION BY grp ORDER BY date_col) AS prev_val
LEAD(val, 1) OVER (PARTITION BY grp ORDER BY date_col) AS next_val
-- First row LAG = NULL, Last row LEAD = NULL
-- pct_change = (val - prev_val) / NULLIF(prev_val, 0) × 100

KEY USES: MoM/YoY change, detect duplicates within time window (Stripe!), "next purchase was AirPods" (Apple!)

SECTION 2: HARD PATTERNS (P4, P5, P6)

P4: Gaps & Islands — THE KEY TRICK

sql
-- TRICK: consecutive dates - sequential row numbers = CONSTANT for each island!
WITH deduped AS (SELECT DISTINCT user_id, activity_date FROM events),
numbered AS (
    SELECT user_id, activity_date,
           ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY activity_date) AS rn
    FROM deduped
),
islands AS (
    SELECT user_id, activity_date,
           DATE_SUB(activity_date, rn) AS island_key   -- Hive/Spark
           -- PostgreSQL: activity_date - rn AS island_key
    FROM numbered
)
SELECT user_id, MIN(activity_date) AS start, MAX(activity_date) AS end, COUNT(*) AS len
FROM islands
GROUP BY user_id, island_key;

⚠️ TRAP 1: Deduplicate FIRST (DISTINCT) or same-day events break the trick. ⚠️ TRAP 2: Hive = DATE_SUB(date, rn), PostgreSQL = date - rn.

P5: Sessionization Template

sql
WITH flagged AS (
    SELECT *, LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_t,
           CASE WHEN prev_t IS NULL OR (event_time - prev_t) > INTERVAL '30 min'
                THEN 1 ELSE 0 END AS new_session
    FROM events
)
SELECT *, SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_time
                                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
          AS session_id
FROM flagged;

P6: Deduplication Template

sql
WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY unique_key ORDER BY updated_at DESC) AS rn
    FROM table_with_dupes
)
SELECT * FROM ranked WHERE rn = 1;  -- always use ROW_NUMBER not RANK (RANK can have ties!)

SECTION 3: JOIN PATTERNS (P7, P8, P9)

P7: Self-Join Template

sql
-- Hierarchy:
SELECT e.name, m.name AS manager FROM employees e JOIN employees m ON e.manager_id = m.id
-- Pairs (no duplicates):
SELECT t1.col, t2.col FROM t t1 JOIN t t2 ON t1.shared = t2.shared AND t1.id < t2.id

⚠️ TRAP: Forget id < id2 → each pair appears TWICE + self-pairs appear.

P8: Cross Join Template

sql
-- All combinations:
SELECT a.col, b.col FROM table_a a CROSS JOIN table_b b;
-- Round-robin (no self-pairs):
SELECT t1.name, t2.name FROM teams t1 CROSS JOIN teams t2 WHERE t1.id < t2.id;
-- 3-way (pizza toppings):
SELECT t1.name, t2.name, t3.name FROM t t1 JOIN t t2 ON t1.name < t2.name
       JOIN t t3 ON t2.name < t3.name;

P9: Market Basket Template

sql
SELECT o1.product_id, o2.product_id, COUNT(DISTINCT o1.order_id) AS co_purchases
FROM order_items o1
JOIN order_items o2 ON o1.order_id = o2.order_id AND o1.product_id < o2.product_id
GROUP BY o1.product_id, o2.product_id
ORDER BY co_purchases DESC LIMIT 5;

SECTION 4: AGGREGATION & ANALYTICS (P10, P11, P12)

P10: Conditional Aggregation Template

sql
SELECT date_col,
    SUM(CASE WHEN category = 'A' THEN 1 ELSE 0 END) AS count_a,
    SUM(CASE WHEN category = 'B' THEN amount ELSE 0 END) AS revenue_b,
    ROUND(100.0 * SUM(CASE WHEN event='click' THEN 1 ELSE 0 END)
        / NULLIF(SUM(CASE WHEN event='impression' THEN 1 ELSE 0 END), 0), 2) AS ctr
FROM events GROUP BY date_col;

⚠️ TRAP: Always NULLIF(denominator, 0) for ratios/CTR to prevent division by zero.

P11: Cohort Retention Template

sql
WITH cohorts AS (SELECT user_id, MIN(event_date) AS cohort_date FROM events GROUP BY user_id),
activity AS (
    SELECT e.user_id, c.cohort_date,
           DATEDIFF(e.event_date, c.cohort_date) AS days_since_cohort  -- Hive/Spark
    FROM events e JOIN cohorts c ON e.user_id = c.user_id
)
SELECT cohort_date,
    COUNT(DISTINCT CASE WHEN days_since_cohort = 0 THEN user_id END) AS day0,
    COUNT(DISTINCT CASE WHEN days_since_cohort = 7 THEN user_id END) AS day7,
    COUNT(DISTINCT CASE WHEN days_since_cohort = 30 THEN user_id END) AS day30
FROM activity GROUP BY cohort_date;

P12: Date Arithmetic — Quick Reference

sql
DATEDIFF(end, start)                    -- Hive/Spark: days between dates
DATE_ADD(date, N)                       -- Hive/Spark: add N days
DATE_TRUNC('month', date)               -- Spark/PG: start of month
TRUNC(date, 'MM')                       -- Hive: start of month
WHERE date >= DATE_SUB(CURRENT_DATE(), 90)  -- past 90 days (Hive/Spark)
WHERE date >= CURRENT_DATE - INTERVAL '90 days'  -- PostgreSQL

SECTION 5: ADVANCED (P13, P14, P15)

P13: Recursive CTE Template

sql
WITH RECURSIVE cte AS (
    SELECT id, parent_id, name, 1 AS lvl FROM tree WHERE parent_id IS NULL  -- ANCHOR
    UNION ALL
    SELECT t.id, t.parent_id, t.name, c.lvl + 1
    FROM tree t JOIN cte c ON t.parent_id = c.id WHERE c.lvl < 20           -- RECURSIVE + LIMIT
)
SELECT * FROM cte;
-- ⚠️ Always add level limit! ⚠️ Use UNION ALL not UNION
-- ⚠️ Hive doesn't support WITH RECURSIVE → use Spark SQL

P14: Median Template

sql
-- Option 1: Built-in (PostgreSQL/Snowflake/Spark)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val) AS median FROM t;

-- Option 2: Manual (works everywhere including Hive)
WITH ranked AS (
    SELECT val,
           ROW_NUMBER() OVER (ORDER BY val) AS rn,
           COUNT(*) OVER () AS n
    FROM t
)
SELECT AVG(val) AS median FROM ranked WHERE rn IN ((n+1)/2, (n+2)/2);

-- Option 3: Quartiles with NTILE
SELECT user_id, spend, NTILE(4) OVER (ORDER BY spend) AS quartile FROM user_totals;

P15: Funnel Template

sql
WITH stages AS (
    SELECT user_id,
        MAX(CASE WHEN event='signup' THEN 1 ELSE 0 END)   AS s1,
        MAX(CASE WHEN event='verify' THEN 1 ELSE 0 END)   AS s2,
        MAX(CASE WHEN event='purchase' THEN 1 ELSE 0 END) AS s3
    FROM events GROUP BY user_id
)
SELECT SUM(s1) AS signup, SUM(s2) AS verify, SUM(s3) AS purchase,
    ROUND(100.0*SUM(s2)/NULLIF(SUM(s1),0),2) AS s1_to_s2,
    ROUND(100.0*SUM(s3)/NULLIF(SUM(s2),0),2) AS s2_to_s3
FROM stages;
-- ⚠️ Use MAX(CASE WHEN) per user — counting unique users, not events!

THE SQL ULTRA CHEAT SHEET

📐 Architecture Diagram
╔══════════════════════════════════════════════════════════════════════╗
║              SQL INTERVIEW — ALL 15 PATTERNS                          ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  WINDOW FUNCTIONS (P1-P3):                                           ║
║  P1 RANK:     DENSE_RANK() OVER(PARTITION BY grp ORDER BY val DESC)  ║
║               → Top-N: WHERE rank <= N                               ║
║  P2 RUNNING:  SUM() OVER(PARTITION BY grp ORDER BY date              ║
║               ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)      ║
║               Rolling: ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW   ║
║  P3 LAG/LEAD: LAG(col,1) OVER(PARTITION BY grp ORDER BY date)        ║
║               → MoM change, sequence detection, time between events  ║
║                                                                      ║
║  HARD PATTERNS (P4-P6):                                              ║
║  P4 GAPS&ISL: DISTINCT dates → ROW_NUMBER → date - rn = island_key   ║
║               GROUP BY island_key → MIN/MAX = streak start/end       ║
║  P5 SESSION:  LAG gap > threshold → new_session=1 → SUM as session_id║
║  P6 DEDUP:    ROW_NUMBER() OVER(PARTITION BY key ORDER BY ts DESC)=1 ║
║                                                                      ║
║  JOIN PATTERNS (P7-P9):                                              ║
║  P7 SELF:     FROM t e1 JOIN t e2 ON e1.mgr = e2.id (hierarchy)     ║
║               ON e1.shared = e2.shared AND e1.id < e2.id (pairs)    ║
║  P8 CROSS:    CROSS JOIN for all combos; WHERE id < id2 for pairs    ║
║  P9 BASKET:   JOIN orders ON same order_id AND p1.id < p2.id         ║
║               GROUP BY pair → ORDER BY count DESC LIMIT 5            ║
║                                                                      ║
║  ANALYTICS (P10-P12):                                                ║
║  P10 COND-AGG: SUM(CASE WHEN cond THEN val ELSE 0 END)               ║
║                NULLIF(denominator, 0) for ratios                     ║
║  P11 COHORT:   MIN(event_date) per user = cohort                     ║
║                JOIN back on DATEDIFF = N for retention day           ║
║  P12 DATE:     DATEDIFF(end,start) | DATE_ADD(d,N) | DATE_TRUNC      ║
║                COALESCE to handle 0-day gaps                         ║
║                                                                      ║
║  ADVANCED (P13-P15):                                                 ║
║  P13 RECURSE:  ANCHOR (roots) UNION ALL RECURSIVE (children)         ║
║                Always: WHERE level < 20 to prevent infinite loop     ║
║                Hive: no WITH RECURSIVE → use Spark SQL               ║
║  P14 MEDIAN:   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val)      ║
║                Manual: ROW_NUMBER + WHERE rn IN (n+1/2, n+2/2)      ║
║                NTILE(4) for quartiles                                ║
║  P15 FUNNEL:   MAX(CASE WHEN event=stage) per user → SUM each stage  ║
║                Conversion = next_count/this_count × 100             ║
║                                                                      ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  TOP TRAPS TO AVOID:                                                 ║
║  1. Window function in WHERE → must use CTE first                    ║
║  2. RANK not DENSE_RANK for top-N with ties                          ║
║  3. Omit ORDER BY in OVER → grand total, not running total           ║
║  4. Self-join without id < id2 → duplicate pairs                     ║
║  5. No DISTINCT before gaps & islands → same-day events break trick  ║
║  6. Division without NULLIF(denom, 0) → division by zero error       ║
║  7. WITH RECURSIVE in Hive → not supported                           ║
║  8. COUNT(CASE WHEN) for users → must be COUNT(DISTINCT user_id)     ║
║                                                                      ║
║  5-STEP SOLVING FRAMEWORK (say this in interview!):                  ║
║  1. "What is the grain of the output?" (one row = one what?)         ║
║  2. "What is the partition/group?" (what resets the window/join?)    ║
║  3. "What pattern does this match?" (use the flowchart above)        ║
║  4. "Build with CTEs — one CTE per logical step"                     ║
║  5. "Check edge cases: NULLs, ties, first/last row, empty groups"    ║
║                                                                      ║
║  HOW TO ADD YOUR OWN QUESTIONS:                                      ║
║  → Open SQL_QUESTION_BANK.md                                         ║
║  → Add new row: Q91 | question text | company | P? | difficulty |    ║
║  → Share with me and I'll identify the pattern + add solved solution  ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝

SYNTAX DIFF: Hive vs Spark vs PostgreSQL

OperationHive SQLSpark SQL / PostgreSQL
Days between datesDATEDIFF(end, start)end - start OR DATEDIFF(e,s)
Add N daysDATE_ADD(date, N)date + INTERVAL 'N days'
Truncate to monthTRUNC(date, 'MM')DATE_TRUNC('month', date)
Current dateCURRENT_DATE()CURRENT_DATE
Recursive CTE❌ NOT SUPPORTEDWITH RECURSIVE cte AS (...)
PercentilePERCENTILE_APPROX(col, 0.5)PERCENTILE_CONT(0.5) WITHIN GROUP
String concatCONCAT('a','b')'a'
Unix timestampUNIX_TIMESTAMP(datetime)EXTRACT(EPOCH FROM datetime)