🗃️
SQL
SQL Interview Prep — Pattern-Based Guide for Data Engineers
🗃️
🗃️
SQL · Section 1 of 9

SQL Interview Prep — Pattern-Based Guide for Data Engineers

SQL Interview Prep — Pattern-Based Guide for Data Engineers

💡 Interview Tip
Philosophy: Don't memorize 100 answers. Master 15 patterns. Every question becomes recognizable. Level: Medium to Hard (data engineering interview standard) Total Questions: 90 across 15 patterns — extensible to 200+

HOW TO USE THIS PREP

📋 Overview
STEP 1: See a questionask "what is it REALLY asking?"
STEP 2: Match to a pattern using the Decision Flowchart below
STEP 3: Apply the pattern TEMPLATE (fill in the blanks)
STEP 4: Build with CTEs (one CTE per logical step)
STEP 5: Test for edge cases (NULLs, ties, first/last row)

THE DECISION FLOWCHART — Which Pattern To Use

🗂️LOOK AT THE QUESTION. What does it ask for?
"Top N per group / highest / rank within group"
→ PATTERN 1: RANKING (DENSE_RANK, ROW_NUMBER)
"Running total / cumulative sum / month-to-date"
→ PATTERN 2: RUNNING TOTALS (SUM OVER ORDER BY)
"Previous row / next row / month-over-month / growth / streak"
→ PATTERN 3: LAG / LEAD
"Consecutive days / continuous period / streak / island"
→ PATTERN 4: GAPS & ISLANDS (date - ROW_NUMBER trick)
"Group events into sessions / inactivity window"
→ PATTERN 5: SESSIONIZATION
"Remove duplicates / keep latest / CDC dedup"
→ PATTERN 6: DEDUPLICATION (ROW_NUMBER PARTITION BY)
"Compare rows in same table / manager vs employee / hierarchy"
→ PATTERN 7: SELF-JOIN
"All combinations / every pair / Cartesian product"
→ PATTERN 8: CROSS JOIN
"Items bought together / co-occurrence / frequently paired"
→ PATTERN 9: MARKET BASKET (self-join on order_id)
"Count/sum by category in same row / pivot columns"
→ PATTERN 10: CONDITIONAL AGGREGATION (CASE WHEN)
"First time users / retention / returning customers / cohort"
→ PATTERN 11: COHORT / RETENTION
"Date difference / interval / overdue / days between events"
→ PATTERN 12: DATE ARITHMETIC
"Hierarchy / org chart / tree / all levels / recursive"
→ PATTERN 13: RECURSIVE CTE
"Median / percentile / quartile / middle value"
→ PATTERN 14: MEDIAN / PERCENTILE
"Funnel / conversion / drop-off / step-by-step users"
→ PATTERN 15: FUNNEL ANALYSIS

FILE STRUCTURE

FilePatterns CoveredQuestions
SQL_00_PLAN.mdThis file — flowchart + overview
SQL_01_Window_Functions.mdP1: Ranking, P2: Running Totals, P3: LAG/LEADQ01–Q24
SQL_02_Hard_Patterns.mdP4: Gaps & Islands, P5: Sessionization, P6: DeduplicationQ25–Q41
SQL_03_Join_Patterns.mdP7: Self-Join, P8: Cross Join, P9: Market BasketQ42–Q56
SQL_04_Aggregation_Cohort.mdP10: Conditional Agg, P11: Cohort/Retention, P12: Date ArithmeticQ57–Q73
SQL_05_Advanced.mdP13: Recursive CTE, P14: Median/Percentile, P15: FunnelQ74–Q90
SQL_06_Quick_Recall.mdAll templates + ultra cheat sheet + decision cardAll
SQL_QUESTION_BANK.mdMaster list of ALL 90 questions + space to add yoursAll

PATTERN DIFFICULTY HEAT MAP

EASIER ──────────────────────────────── HARDER
P6: Dedup P1: Ranking P3: LAG/LEAD P4: Gaps&Islands
P12: DateArith P10: CaseWhen P2: Running P5: Sessionize
P8: CrossJoin P7: SelfJoin P11: Cohort P13: Recursive
P14: Median P9: MarketBask P15: Funnel

THE 5-STEP SOLVING TEMPLATE (use every time)

sql
-- STEP 1: Understand the grain
--   "What does ONE ROW in the output represent?"
--   Example: one row = one product per category

-- STEP 2: Build intermediate CTEs
WITH raw_data AS (
    -- Clean or filter source data
    SELECT ...
    FROM source_table
    WHERE conditions
),

intermediate AS (
    -- Apply window function or join
    SELECT
        ...,
        ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY sort_col) AS rn
    FROM raw_data
)

-- STEP 3: Filter to final result
SELECT ...
FROM intermediate
WHERE rn = 1;  -- or whatever condition applies

-- STEP 4: Check edge cases in your head:
--   - What if there are NULL values in partition/order columns?
--   - What if a group has only 1 row? (top-2 query returns 1 row per group — is that OK?)
--   - Are ties handled correctly? (RANK vs DENSE_RANK)
--   - What if the date column has gaps?

-- STEP 5: Verbalize to interviewer:
--   "First I'm partitioning by X because each group needs its own ranking..."
--   "I chose DENSE_RANK over RANK because I want ties to share the same rank..."

WHAT INTERVIEWERS REALLY EVALUATE

1. PATTERN RECOGNITION (fastest signal)
Do you immediately see "this is a gaps-and-islands" or do you start
writing random JOINs hoping something works?
2. CTE HYGIENE
Do you break complex logic into named CTEs?
Good engineers write readable SQL, not one 30-line mega-query.
3. EDGE CASE AWARENESS
"What if there are NULLs in the join key?"
"What if a user has only one purchase — does your LAG() still work?"
4. PERFORMANCE THINKING (senior-level differentiator)
"On 10 billion rows in Snowflake/BigQuery, I'd partition the underlying
table by date to support partition pruning on this query..."
5. TOOL AWARENESS
Know when to say: "In Hive I'd use DISTRIBUTE BY + SORT BY instead of
ORDER BY for this window function to avoid data movement to one reducer."

SYNTAX CHEAT SHEET — Standard SQL vs Hive SQL vs Spark SQL

OperationStandard SQL / PostgreSQLHive SQLSpark SQL
Date differenceDATE_DIFF(end, start)DATEDIFF(end,start)DATEDIFF(end,start)
Truncate to monthDATE_TRUNC('month', date)TRUNC(date,'MM')DATE_TRUNC('month',date)
Add daysdate + INTERVAL '7 days'DATE_ADD(date, 7)DATE_ADD(date, 7)
String concat'a''b'
Recursive CTEWITH RECURSIVE cte AS...NOT SUPPORTED*Supported (Spark 3+)
PercentilePERCENTILE_CONT(0.5) ...PERCENTILE(col,.5)PERCENTILE_CONT(0.5)...
Extract yearEXTRACT(YEAR FROM date)YEAR(date)YEAR(date) or EXTRACT
Row numberROW_NUMBER() OVER(...)ROW_NUMBER() OVERROW_NUMBER() OVER
LAG/LEADLAG(col,1) OVER(...)LAG(col,1) OVERLAG(col,1) OVER

* Hive doesn't support recursive CTEs — use CONNECT BY (HiveServer2 workaround) or Spark instead

SOURCES