🗃️

SQL

SQL Interview Prep — Pattern-Based Guide for Data Engineers

🗃️

SQL · Section 1 of 10

SQL Interview Prep — Pattern-Based Guide for Data Engineers

💡 Interview Tip

Philosophy: Don't memorize 100 answers. Master 15 patterns. Every question becomes recognizable. Level: Medium to Hard (data engineering interview standard) Total Questions: 90 across 15 patterns — extensible to 200+

HOW TO USE THIS PREP

📋 Overview

STEP 1: See a question→ask "what is it REALLY asking?"

STEP 2: Match to a pattern using the Decision Flowchart below

STEP 3: Apply the pattern TEMPLATE (fill in the blanks)

STEP 4: Build with CTEs (one CTE per logical step)

STEP 5: Test for edge cases (NULLs, ties, first/last row)

THE DECISION FLOWCHART — Which Pattern To Use

🗂️LOOK AT THE QUESTION. What does it ask for?

"Top N per group / highest / rank within group"

→ PATTERN 1: RANKING (DENSE_RANK, ROW_NUMBER)

"Running total / cumulative sum / month-to-date"

→ PATTERN 2: RUNNING TOTALS (SUM OVER ORDER BY)

"Previous row / next row / month-over-month / growth / streak"

→ PATTERN 3: LAG / LEAD

"Consecutive days / continuous period / streak / island"

→ PATTERN 4: GAPS & ISLANDS (date - ROW_NUMBER trick)

"Group events into sessions / inactivity window"

→ PATTERN 5: SESSIONIZATION

"Remove duplicates / keep latest / CDC dedup"

→ PATTERN 6: DEDUPLICATION (ROW_NUMBER PARTITION BY)

"Compare rows in same table / manager vs employee / hierarchy"

→ PATTERN 7: SELF-JOIN

"All combinations / every pair / Cartesian product"

→ PATTERN 8: CROSS JOIN

"Items bought together / co-occurrence / frequently paired"

→ PATTERN 9: MARKET BASKET (self-join on order_id)

"Count/sum by category in same row / pivot columns"

→ PATTERN 10: CONDITIONAL AGGREGATION (CASE WHEN)

"First time users / retention / returning customers / cohort"

→ PATTERN 11: COHORT / RETENTION

"Date difference / interval / overdue / days between events"

→ PATTERN 12: DATE ARITHMETIC

"Hierarchy / org chart / tree / all levels / recursive"

→ PATTERN 13: RECURSIVE CTE

"Median / percentile / quartile / middle value"

→ PATTERN 14: MEDIAN / PERCENTILE

"Funnel / conversion / drop-off / step-by-step users"

→ PATTERN 15: FUNNEL ANALYSIS

PATTERN DIFFICULTY HEAT MAP

EASIER ──────────────────────────────── HARDER

P6: Dedup P1: Ranking P3: LAG/LEAD P4: Gaps&Islands

P12: DateArith P10: CaseWhen P2: Running P5: Sessionize

P8: CrossJoin P7: SelfJoin P11: Cohort P13: Recursive

P14: Median P9: MarketBask P15: Funnel

THE 5-STEP SOLVING TEMPLATE (use every time)

sql

-- STEP 1: Understand the grain
--   "What does ONE ROW in the output represent?"
--   Example: one row = one product per category

-- STEP 2: Build intermediate CTEs
WITH raw_data AS (
    -- Clean or filter source data
    SELECT ...
    FROM source_table
    WHERE conditions
),

intermediate AS (
    -- Apply window function or join
    SELECT
        ...,
        ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY sort_col) AS rn
    FROM raw_data
)

-- STEP 3: Filter to final result
SELECT ...
FROM intermediate
WHERE rn = 1;  -- or whatever condition applies

-- STEP 4: Check edge cases in your head:
--   - What if there are NULL values in partition/order columns?
--   - What if a group has only 1 row? (top-2 query returns 1 row per group — is that OK?)
--   - Are ties handled correctly? (RANK vs DENSE_RANK)
--   - What if the date column has gaps?

-- STEP 5: Verbalize to interviewer:
--   "First I'm partitioning by X because each group needs its own ranking..."
--   "I chose DENSE_RANK over RANK because I want ties to share the same rank..."

WHAT INTERVIEWERS REALLY EVALUATE

1. PATTERN RECOGNITION (fastest signal)

Do you immediately see "this is a gaps-and-islands" or do you start

writing random JOINs hoping something works?

2. CTE HYGIENE

Do you break complex logic into named CTEs?

Good engineers write readable SQL, not one 30-line mega-query.

3. EDGE CASE AWARENESS

"What if there are NULLs in the join key?"

"What if a user has only one purchase — does your LAG() still work?"

4. PERFORMANCE THINKING (senior-level differentiator)

"On 10 billion rows in Snowflake/BigQuery, I'd partition the underlying

table by date to support partition pruning on this query..."

5. TOOL AWARENESS

Know when to say: "In Hive I'd use DISTRIBUTE BY + SORT BY instead of

ORDER BY for this window function to avoid data movement to one reducer."

SYNTAX CHEAT SHEET — Standard SQL vs Hive SQL vs Spark SQL

Operation	Standard SQL / PostgreSQL	Hive SQL	Spark SQL
Date difference	`DATE_DIFF(end, start)`	`DATEDIFF(end,start)`	`DATEDIFF(end,start)`
Truncate to month	`DATE_TRUNC('month', date)`	`TRUNC(date,'MM')`	`DATE_TRUNC('month',date)`
Add days	date + INTERVAL '7 days'	`DATE_ADD(date, 7)`	`DATE_ADD(date, 7)`
String concat	'a'		'b'
Recursive CTE	WITH RECURSIVE cte AS...	NOT SUPPORTED*	Supported (Spark 3+)
Percentile	`PERCENTILE_CONT(0.5)` ...	`PERCENTILE(col,.5)`	`PERCENTILE_CONT(0.5)`...
Extract year	`EXTRACT(YEAR FROM date)`	`YEAR(date)`	`YEAR(date)` or EXTRACT
Row number	`ROW_NUMBER()` `OVER(...)`	`ROW_NUMBER()` OVER	`ROW_NUMBER()` OVER
LAG/LEAD	`LAG(col,1)` `OVER(...)`	`LAG(col,1)` OVER	`LAG(col,1)` OVER

* Hive doesn't support recursive CTEs — use CONNECT BY (HiveServer2 workaround) or Spark instead

SQL Patterns 1–3: Window Functions — Ranking, Running Totals, LAG/LEADNext →