SQL Patterns 13–15: Recursive CTE, Median/Percentile, Funnel Analysis
The advanced patterns. Recursive CTE separates good candidates from great ones. Funnel analysis is asked at almost every product-driven data engineering role.
🧠 MEMORY MAP
🧠 ADVANCED PATTERNS = "RMF"
ADVANCED PATTERNS"RMF"
RRecursive CTE: anchor + recursive step (trees, hierarchies, date spines)
MMedian/Percentile: ROW_NUMBER trick OR PERCENTILE_CONT (built-in)
FFunnel: COUNT DISTINCT at each stage → conversion rate between stages
RECURSIVE CTE STRUCTURE
WITH RECURSIVE cte AS (
SELECT ... ← ANCHOR: the starting point (root / first row)
UNION ALL
SELECT ... FROM cte JOIN table ... ← RECURSIVE: add one more level
WHERE stopping_condition ← TERMINATION: when to stop
)
SELECT * FROM cte;
MEDIAN TRICK (no built-in MEDIAN)
"The median is the middle value"
→ Count total rows (N)
→ Assign ROW_NUMBER to each row ordered by value
→ Keep row where rn = (N+1)/2 (odd N)
→ Or AVG of rows where rn IN (N/2, N/2+1) (even N)
→ PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) is simpler if available!
FUNNEL PATTERN
Define stages: signup→verify → first_purchase
For each stage: COUNT(DISTINCT user_id) who reached that stage
Conversion: next_stage_count / this_stage_count × 100
PATTERN 13: RECURSIVE CTE
What Is It?
A CTE that calls itself repeatedly, adding one "level" per iteration. Used for: org chart traversal, category trees, date spine generation, dependency chains, graph reachability.
Recognize It When You See:
- "All employees under manager X (direct AND indirect)"
- "Full category hierarchy / subcategories recursively"
- "Generate a sequence of dates / numbers"
- "Task A depends on task B depends on task C — full chain"
- "All cities reachable from city X via connections"
THE TEMPLATE
sql
-- TEMPLATE: Recursive CTE
WITH RECURSIVE cte_name AS (
-- ANCHOR MEMBER: base case (starting point)
-- This runs ONCE and returns the initial rows
SELECT
id,
parent_id,
name,
1 AS level, -- depth tracker
id::TEXT AS path -- path tracker (optional)
FROM hierarchy_table
WHERE parent_id IS NULL -- ← starting condition (root nodes, or specific ID)
UNION ALL
-- RECURSIVE MEMBER: add one more level
-- This references cte_name itself!
SELECT
h.id,
h.parent_id,
h.name,
cte.level + 1,
cte.path || ' > ' || h.name
FROM hierarchy_table h
JOIN cte_name cte
ON h.parent_id = cte.id -- each iteration: find children of current level
WHERE cte.level < 10 -- ← TERMINATION CONDITION (prevent infinite loops!)
)
SELECT * FROM cte_name
ORDER BY path;
-- ⚠️ ALWAYS add a termination condition (level < N OR a sentinel value)
-- Without it: infinite loop if there's a cycle in the data!
SOLVED EXAMPLE
Problem:
Table: employees (employee_id, name, manager_id)
Find ALL employees who report to manager 5 — direct AND indirect at all levels.
Return: employee_id, name, level (how many levels below manager 5)
sql
WITH RECURSIVE org_tree AS (
-- ANCHOR: direct reports of manager 5
SELECT
employee_id,
name,
manager_id,
1 AS level_below_manager
FROM employees
WHERE manager_id = 5
UNION ALL
-- RECURSIVE: find reports of the reports
SELECT
e.employee_id,
e.name,
e.manager_id,
ot.level_below_manager + 1
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.employee_id
WHERE ot.level_below_manager < 20 -- safety limit (company depth)
)
SELECT employee_id, name, level_below_manager
FROM org_tree
ORDER BY level_below_manager, name;
SOLVED EXAMPLE — Q78: Generate date spine 2024-01-01 to 2024-12-31
Problem:
Generate a table of all dates in 2024.
Use for LEFT JOIN to fill in missing dates in reporting (zero-fill).
sql
-- PostgreSQL: use generate_series (simplest)
SELECT generate_series(
'2024-01-01'::date,
'2024-12-31'::date,
'1 day'::interval
)::date AS date_val;
-- Hive/Spark: use recursive CTE
WITH RECURSIVE date_spine AS (
-- ANCHOR: start date
SELECT DATE '2024-01-01' AS date_val
UNION ALL
-- RECURSIVE: add 1 day each iteration
SELECT DATE_ADD(date_val, 1)
FROM date_spine
WHERE date_val < DATE '2024-12-31' -- TERMINATION: stop at end date
)
SELECT date_val FROM date_spine;
-- Usage: fill gaps in daily sales report
SELECT
d.date_val,
p.product_id,
COALESCE(s.revenue, 0) AS revenue
FROM date_spine d
CROSS JOIN (SELECT DISTINCT product_id FROM products) p
LEFT JOIN daily_sales s
ON d.date_val = s.sale_date
AND p.product_id = s.product_id
ORDER BY d.date_val, p.product_id;
PATTERN 13 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q74 | All employees under manager_id=5 (all levels) | Anchor = direct reports, recursive = reports of reports |
| Q75 | Total subordinates (direct + indirect) per manager | Recursive + COUNT(*) in outer query |
| Q76 | Product categories + all subcategories recursively | Anchor = top-level categories (parent_id IS NULL) |
| Q77 | Full task dependency chain | Anchor = the target task, recursive = its dependencies |
| Q78 | Generate date spine 2024-01-01 to 2024-12-31 | Anchor = start date, recursive = date + 1 day |
| Q79 | All cities reachable from origin (multi-hop flights) | Anchor = direct destinations, recursive = reachable from those |
PATTERN 14: MEDIAN / PERCENTILE
What Is It?
Find the middle value of a distribution (median) or a specific percentile (90th, 95th). Built-in functions vary by database — know both the function AND the manual approach.
Recognize It When You See:
- "Median [salary/value/time]"
- "Middle value"
- "Percentile / quartile / decile"
- "P50, P90, P95, P99"
- "Divide users into quartiles / buckets"
APPROACH 1: Built-in Functions (when available)
sql
-- PERCENTILE_CONT: continuous interpolation (for medians — interpolates between values)
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY salary) AS p90_salary,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS q1_salary
FROM employees;
-- Availability: PostgreSQL ✓, Snowflake ✓, Spark SQL ✓, Hive (PERCENTILE_APPROX) partial
-- PERCENTILE_DISC: discrete (returns actual value from the dataset, not interpolated)
SELECT
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary) AS exact_median
FROM employees;
-- NTILE: divide into N equal buckets
SELECT
user_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend) AS quartile -- 1=lowest, 4=highest
FROM user_spending;
-- Hive approximation:
SELECT PERCENTILE_APPROX(salary, 0.5) AS approx_median FROM employees;
APPROACH 2: Manual Median (when no built-in — important interview skill!)
sql
-- MANUAL MEDIAN using ROW_NUMBER
-- Works in ALL databases including Hive
WITH ordered AS (
SELECT
value_col,
ROW_NUMBER() OVER (ORDER BY value_col) AS rn,
COUNT(*) OVER () AS total_count -- total rows (same for all rows)
FROM source_table
)
SELECT AVG(value_col) AS median_value
FROM ordered
WHERE
rn = (total_count + 1) / 2 -- middle row for odd count
OR rn = (total_count + 2) / 2; -- handles even count (avg of 2 middle rows)
-- Example:
-- N=5 (odd): middle = row 3. (5+1)/2=3, (5+2)/2=3 → same row → just 1 value
-- N=6 (even): rows 3 and 4. (6+1)/2=3 (floor), (6+2)/2=4 → two rows → AVG
SOLVED EXAMPLE — Q80 (Google Hard): Median from a FREQUENCY TABLE
Problem:
Table: search_frequency (searches, num_users)
Each row says: X users made exactly Y searches.
(This is a COMPRESSED representation — not individual rows!)
Find the median number of searches per user.
Example:
searches | num_users
1 | 2 ← 2 users made 1 search each
2 | 3 ← 3 users made 2 searches each
3 | 1 ← 1 user made 3 searches
Total: 6 users. Median = (2+2)/2 = 2 (3rd and 4th values in sorted list)
sql
-- KEY INSIGHT: Can't use simple ROW_NUMBER because data is compressed!
-- Must EXPAND the frequency table first OR use cumulative counts
WITH cumulative AS (
SELECT
searches,
num_users,
-- Cumulative count up to this row
SUM(num_users) OVER (ORDER BY searches
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS cum_count,
-- Cumulative count BEFORE this row
SUM(num_users) OVER (ORDER BY searches
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
AS cum_count_before,
SUM(num_users) OVER () AS total_users
FROM search_frequency
)
SELECT ROUND(AVG(searches * 1.0), 1) AS median_searches
FROM cumulative
WHERE
-- Median rows: total/2 falls within [cum_before, cum_count] range
COALESCE(cum_count_before, 0) < (total_users + 1) / 2.0
AND cum_count >= (total_users + 1) / 2.0;
-- For even totals: both (N/2)th and (N/2+1)th values needed → AVG handles it
SOLVED EXAMPLE — Q83: Divide users into quartiles by total spend
Problem:
Table: user_orders (user_id, order_date, amount)
Divide users into 4 quartiles based on their total spend in 2023.
Return: user_id, total_spend, quartile (1=lowest, 4=highest)
sql
WITH user_totals AS (
SELECT
user_id,
SUM(amount) AS total_spend
FROM user_orders
WHERE YEAR(order_date) = 2023
GROUP BY user_id
)
SELECT
user_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend ASC) AS quartile
-- quartile 1 = lowest spenders, quartile 4 = highest spenders
FROM user_totals
ORDER BY total_spend;
PATTERN 14 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q80 | Median searches from frequency table (Google Hard) | Expand compressed freq table using cumulative SUM |
| Q81 | Median salary per department | ROW_NUMBER per dept + filter middle row(s) |
| Q82 | 90th percentile response time per API endpoint | PERCENTILE_CONT(0.90) or ROW_NUMBER approach |
| Q83 | Divide users into 4 quartiles by spend | NTILE(4) OVER (ORDER BY total_spend) |
| Q84 | Median days between 1st and 2nd purchase | Compute day_diff per user, then apply median pattern |
PATTERN 15: FUNNEL ANALYSIS
What Is It?
Track how many users pass through each stage of a defined sequence. Calculate conversion rates between stages and identify where users drop off.
Recognize It When You See:
- "Conversion funnel / conversion rate"
- "How many users reached step X"
- "Drop-off rate at each stage"
- "Signup → verification → first purchase"
- "Impression → click → purchase"
THE TEMPLATE
sql
-- TEMPLATE: Funnel Analysis
-- Each user can be in at most one stage at a time (or multiple, depending on definition)
WITH user_stages AS (
SELECT
user_id,
-- For each stage: 1 if user performed this action, else 0
MAX(CASE WHEN event_type = 'signup' THEN 1 ELSE 0 END) AS did_signup,
MAX(CASE WHEN event_type = 'email_verified' THEN 1 ELSE 0 END) AS did_verify,
MAX(CASE WHEN event_type = 'first_purchase' THEN 1 ELSE 0 END) AS did_purchase
FROM events
GROUP BY user_id
),
funnel_counts AS (
SELECT
-- COUNT users who reached each stage
COUNT(*) AS total_users,
SUM(did_signup) AS signup_count,
SUM(did_verify) AS verify_count,
SUM(did_purchase) AS purchase_count
FROM user_stages
)
SELECT
signup_count,
verify_count,
ROUND(100.0 * verify_count / NULLIF(signup_count, 0), 2) AS signup_to_verify_pct,
purchase_count,
ROUND(100.0 * purchase_count / NULLIF(verify_count, 0), 2) AS verify_to_purchase_pct,
ROUND(100.0 * purchase_count / NULLIF(signup_count, 0), 2) AS overall_conversion_pct
FROM funnel_counts;
SOLVED EXAMPLE — Q85 (TikTok/Stripe): Count users at each funnel stage + conversion rates
Problem:
Table: events (user_id, event_type, event_date)
event_type values: 'signup', 'email_verified', 'profile_completed', 'first_purchase'
For each stage: count distinct users who reached it.
Also compute step-to-step conversion rate.
sql
WITH user_funnel AS (
SELECT
user_id,
MAX(CASE WHEN event_type = 'signup' THEN 1 ELSE 0 END) AS reached_signup,
MAX(CASE WHEN event_type = 'email_verified' THEN 1 ELSE 0 END) AS reached_verify,
MAX(CASE WHEN event_type = 'profile_completed' THEN 1 ELSE 0 END) AS reached_profile,
MAX(CASE WHEN event_type = 'first_purchase' THEN 1 ELSE 0 END) AS reached_purchase
FROM events
GROUP BY user_id
),
stage_counts AS (
SELECT
SUM(reached_signup) AS signup_users,
SUM(reached_verify) AS verify_users,
SUM(reached_profile) AS profile_users,
SUM(reached_purchase) AS purchase_users
FROM user_funnel
)
SELECT
'Stage 1: Signup' AS stage, signup_users AS users, 100.0 AS conversion_from_prev FROM stage_counts
UNION ALL
SELECT 'Stage 2: Email Verified', verify_users,
ROUND(100.0 * verify_users / NULLIF(signup_users, 0), 2) FROM stage_counts
UNION ALL
SELECT 'Stage 3: Profile Completed', profile_users,
ROUND(100.0 * profile_users / NULLIF(verify_users, 0), 2) FROM stage_counts
UNION ALL
SELECT 'Stage 4: First Purchase', purchase_users,
ROUND(100.0 * purchase_users / NULLIF(profile_users, 0), 2) FROM stage_counts;
SOLVED EXAMPLE — Q86 (TikTok): Signup-to-activation rate
Problem:
Table: emails (email_id, user_id, action, date)
action values: 'sent' (signup triggered), 'open' (user opened), 'answered' (activated)
Find activation rate = users who 'answered' / total users (2 decimal places)
sql
SELECT
ROUND(
SUM(CASE WHEN action = 'answered' THEN 1.0 ELSE 0 END)
/ COUNT(DISTINCT user_id),
2) AS activation_rate
FROM emails;
-- More explicit version:
WITH user_status AS (
SELECT
user_id,
MAX(CASE WHEN action = 'answered' THEN 1 ELSE 0 END) AS activated
FROM emails
GROUP BY user_id
)
SELECT ROUND(AVG(activated * 1.0), 2) AS activation_rate FROM user_status;
PATTERN 15 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q85 | Count users at each funnel stage + conversion rates | MAX(CASE WHEN event=stage) per user, then aggregate |
| Q86 | Signup-to-activation rate | COUNT(answered) / COUNT(signed_up) |
| Q87 | Drop-off rate at each onboarding step | Funnel template + find biggest % drop between consecutive stages |
| Q88 | CTR-to-conversion per ad campaign | Impressions → clicks → purchases (3-stage funnel) |
| Q89 | Funnel step with highest drop-off (first 7 days) | Compute all stage counts, then find MAX(step_n - step_n+1) |
| Q90 | Free trial → paid conversion across monthly cohorts | Cohort = signup_month + funnel (trial → paid) per cohort |
⚠️ COMMON TRAPS IN ADVANCED PATTERNS
🧠 Memory Map
RECURSIVE CTE TRAPS
TRAP 1: No termination condition→infinite loop
Always add: WHERE level < 20 OR WHERE some column IS NOT NULL
Most databases have a default recursion limit (100 or 1000 iterations)
TRAP 2: UNION ALL vs UNION in recursive CTE
Must use UNION ALL (not UNION) in recursive CTEs
UNION would deduplicate across iterations→wrong for hierarchies!
TRAP 3: Hive doesn't support WITH RECURSIVE
In Hive: use alternative approaches (multiple queries or Spark SQL for recursive)
In interview: say "Hive doesn't support recursive CTEs — I'd use Spark SQL instead"
MEDIAN TRAPS
TRAP 4: Simple AVG ≠ Median
AVG(salary) = arithmetic average (pulled by outliers)
MEDIAN(salary) = 50th percentile (middle value, resistant to outliers)
Interviewers ask for MEDIAN specifically to test this distinction
TRAP 5: Even vs Odd count
Odd count (N=5): median = row 3 → one row
Even count (N=6): median = average of rows 3 and 4 → TWO rows, then AVG
ROW_NUMBER trick: WHERE rn IN ((N+1)/2, (N+2)/2) handles BOTH cases with AVG
FUNNEL TRAPS
TRAP 6: User can appear in multiple stages
A user who completed 'first_purchase' ALSO did 'signup' and 'email_verified'
Use MAX(CASE WHEN event=stage THEN 1 ELSE 0 END) per user
NOT COUNT(CASE WHEN event=stage) — that counts events, not unique users!
TRAP 7: Strict funnel vs non-strict
Strict: user must complete stages IN ORDER (signup before purchase)
Non-strict: just check if user ever performed each action
Most interview questions are non-strict unless explicitly stated
For strict funnel: add timestamp comparison in JOIN conditions