SQL Interview Prep — Pattern-Based Guide for Data Engineers
💡 Interview Tip
Philosophy: Don't memorize 100 answers. Master 15 patterns. Every question becomes recognizable.
Level: Medium to Hard (data engineering interview standard)
Total Questions: 90 across 15 patterns — extensible to 200+
HOW TO USE THIS PREP
📋 Overview
STEP 1: See a question→ask "what is it REALLY asking?"
STEP 2: Match to a pattern using the Decision Flowchart below
STEP 3: Apply the pattern TEMPLATE (fill in the blanks)
STEP 4: Build with CTEs (one CTE per logical step)
STEP 5: Test for edge cases (NULLs, ties, first/last row)
THE DECISION FLOWCHART — Which Pattern To Use
LOOK AT THE QUESTION. What does it ask for?
"Top N per group / highest / rank within group"
→ PATTERN 1: RANKING (DENSE_RANK, ROW_NUMBER)
"Running total / cumulative sum / month-to-date"
→ PATTERN 2: RUNNING TOTALS (SUM OVER ORDER BY)
"Previous row / next row / month-over-month / growth / streak"
→ PATTERN 3: LAG / LEAD
"Consecutive days / continuous period / streak / island"
→ PATTERN 4: GAPS & ISLANDS (date - ROW_NUMBER trick)
"Group events into sessions / inactivity window"
→ PATTERN 5: SESSIONIZATION
"Remove duplicates / keep latest / CDC dedup"
→ PATTERN 6: DEDUPLICATION (ROW_NUMBER PARTITION BY)
"Compare rows in same table / manager vs employee / hierarchy"
→ PATTERN 7: SELF-JOIN
"All combinations / every pair / Cartesian product"
→ PATTERN 8: CROSS JOIN
"Items bought together / co-occurrence / frequently paired"
→ PATTERN 9: MARKET BASKET (self-join on order_id)
"Count/sum by category in same row / pivot columns"
→ PATTERN 10: CONDITIONAL AGGREGATION (CASE WHEN)
"First time users / retention / returning customers / cohort"
→ PATTERN 11: COHORT / RETENTION
"Date difference / interval / overdue / days between events"
→ PATTERN 12: DATE ARITHMETIC
"Hierarchy / org chart / tree / all levels / recursive"
→ PATTERN 13: RECURSIVE CTE
"Median / percentile / quartile / middle value"
→ PATTERN 14: MEDIAN / PERCENTILE
"Funnel / conversion / drop-off / step-by-step users"
→ PATTERN 15: FUNNEL ANALYSIS
FILE STRUCTURE
| File | Patterns Covered | Questions |
|---|---|---|
SQL_00_PLAN.md | This file — flowchart + overview | — |
SQL_01_Window_Functions.md | P1: Ranking, P2: Running Totals, P3: LAG/LEAD | Q01–Q24 |
SQL_02_Hard_Patterns.md | P4: Gaps & Islands, P5: Sessionization, P6: Deduplication | Q25–Q41 |
SQL_03_Join_Patterns.md | P7: Self-Join, P8: Cross Join, P9: Market Basket | Q42–Q56 |
SQL_04_Aggregation_Cohort.md | P10: Conditional Agg, P11: Cohort/Retention, P12: Date Arithmetic | Q57–Q73 |
SQL_05_Advanced.md | P13: Recursive CTE, P14: Median/Percentile, P15: Funnel | Q74–Q90 |
SQL_06_Quick_Recall.md | All templates + ultra cheat sheet + decision card | All |
SQL_QUESTION_BANK.md | Master list of ALL 90 questions + space to add yours | All |
PATTERN DIFFICULTY HEAT MAP
EASIER ──────────────────────────────── HARDER
P6: Dedup P1: Ranking P3: LAG/LEAD P4: Gaps&Islands
P12: DateArith P10: CaseWhen P2: Running P5: Sessionize
P8: CrossJoin P7: SelfJoin P11: Cohort P13: Recursive
P14: Median P9: MarketBask P15: Funnel
THE 5-STEP SOLVING TEMPLATE (use every time)
sql
-- STEP 1: Understand the grain
-- "What does ONE ROW in the output represent?"
-- Example: one row = one product per category
-- STEP 2: Build intermediate CTEs
WITH raw_data AS (
-- Clean or filter source data
SELECT ...
FROM source_table
WHERE conditions
),
intermediate AS (
-- Apply window function or join
SELECT
...,
ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY sort_col) AS rn
FROM raw_data
)
-- STEP 3: Filter to final result
SELECT ...
FROM intermediate
WHERE rn = 1; -- or whatever condition applies
-- STEP 4: Check edge cases in your head:
-- - What if there are NULL values in partition/order columns?
-- - What if a group has only 1 row? (top-2 query returns 1 row per group — is that OK?)
-- - Are ties handled correctly? (RANK vs DENSE_RANK)
-- - What if the date column has gaps?
-- STEP 5: Verbalize to interviewer:
-- "First I'm partitioning by X because each group needs its own ranking..."
-- "I chose DENSE_RANK over RANK because I want ties to share the same rank..."
WHAT INTERVIEWERS REALLY EVALUATE
1. PATTERN RECOGNITION (fastest signal)
Do you immediately see "this is a gaps-and-islands" or do you start
writing random JOINs hoping something works?
2. CTE HYGIENE
Do you break complex logic into named CTEs?
Good engineers write readable SQL, not one 30-line mega-query.
3. EDGE CASE AWARENESS
"What if there are NULLs in the join key?"
"What if a user has only one purchase — does your LAG() still work?"
4. PERFORMANCE THINKING (senior-level differentiator)
"On 10 billion rows in Snowflake/BigQuery, I'd partition the underlying
table by date to support partition pruning on this query..."
5. TOOL AWARENESS
Know when to say: "In Hive I'd use DISTRIBUTE BY + SORT BY instead of
ORDER BY for this window function to avoid data movement to one reducer."
SYNTAX CHEAT SHEET — Standard SQL vs Hive SQL vs Spark SQL
| Operation | Standard SQL / PostgreSQL | Hive SQL | Spark SQL |
|---|---|---|---|
| Date difference | DATE_DIFF(end, start) | DATEDIFF(end,start) | DATEDIFF(end,start) |
| Truncate to month | DATE_TRUNC('month', date) | TRUNC(date,'MM') | DATE_TRUNC('month',date) |
| Add days | date + INTERVAL '7 days' | DATE_ADD(date, 7) | DATE_ADD(date, 7) |
| String concat | 'a' | 'b' | |
| Recursive CTE | WITH RECURSIVE cte AS... | NOT SUPPORTED* | Supported (Spark 3+) |
| Percentile | PERCENTILE_CONT(0.5) ... | PERCENTILE(col,.5) | PERCENTILE_CONT(0.5)... |
| Extract year | EXTRACT(YEAR FROM date) | YEAR(date) | YEAR(date) or EXTRACT |
| Row number | ROW_NUMBER() OVER(...) | ROW_NUMBER() OVER | ROW_NUMBER() OVER |
| LAG/LEAD | LAG(col,1) OVER(...) | LAG(col,1) OVER | LAG(col,1) OVER |
* Hive doesn't support recursive CTEs — use CONNECT BY (HiveServer2 workaround) or Spark instead