🗃️
SQL
SQL Patterns 13–15: Recursive CTE, Median/Percentile, Funnel Analysis
🗃️
🗃️
SQL · Section 6 of 9

SQL Patterns 13–15: Recursive CTE, Median/Percentile, Funnel Analysis

SQL Patterns 13–15: Recursive CTE, Median/Percentile, Funnel Analysis

The advanced patterns. Recursive CTE separates good candidates from great ones. Funnel analysis is asked at almost every product-driven data engineering role.

🧠 MEMORY MAP

🧠 ADVANCED PATTERNS = "RMF"
ADVANCED PATTERNS"RMF"
RRecursive CTE: anchor + recursive step (trees, hierarchies, date spines)
MMedian/Percentile: ROW_NUMBER trick OR PERCENTILE_CONT (built-in)
FFunnel: COUNT DISTINCT at each stage → conversion rate between stages
RECURSIVE CTE STRUCTURE
WITH RECURSIVE cte AS (
SELECT ... ← ANCHOR: the starting point (root / first row)
UNION ALL
SELECT ... FROM cte JOIN table ... ← RECURSIVE: add one more level
WHERE stopping_condition ← TERMINATION: when to stop
)
SELECT * FROM cte;
MEDIAN TRICK (no built-in MEDIAN)
"The median is the middle value"
→ Count total rows (N)
→ Assign ROW_NUMBER to each row ordered by value
→ Keep row where rn = (N+1)/2 (odd N)
→ Or AVG of rows where rn IN (N/2, N/2+1) (even N)
→ PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) is simpler if available!
FUNNEL PATTERN
Define stages: signupverify → first_purchase
For each stage: COUNT(DISTINCT user_id) who reached that stage
Conversion: next_stage_count / this_stage_count × 100

PATTERN 13: RECURSIVE CTE

What Is It?

A CTE that calls itself repeatedly, adding one "level" per iteration. Used for: org chart traversal, category trees, date spine generation, dependency chains, graph reachability.

Recognize It When You See:

  • "All employees under manager X (direct AND indirect)"
  • "Full category hierarchy / subcategories recursively"
  • "Generate a sequence of dates / numbers"
  • "Task A depends on task B depends on task C — full chain"
  • "All cities reachable from city X via connections"

THE TEMPLATE

sql
-- TEMPLATE: Recursive CTE
WITH RECURSIVE cte_name AS (

    -- ANCHOR MEMBER: base case (starting point)
    -- This runs ONCE and returns the initial rows
    SELECT
        id,
        parent_id,
        name,
        1 AS level,           -- depth tracker
        id::TEXT AS path      -- path tracker (optional)
    FROM hierarchy_table
    WHERE parent_id IS NULL   -- ← starting condition (root nodes, or specific ID)

    UNION ALL

    -- RECURSIVE MEMBER: add one more level
    -- This references cte_name itself!
    SELECT
        h.id,
        h.parent_id,
        h.name,
        cte.level + 1,
        cte.path || ' > ' || h.name
    FROM hierarchy_table h
    JOIN cte_name cte
        ON h.parent_id = cte.id   -- each iteration: find children of current level
    WHERE cte.level < 10          -- ← TERMINATION CONDITION (prevent infinite loops!)
)

SELECT * FROM cte_name
ORDER BY path;

-- ⚠️ ALWAYS add a termination condition (level < N OR a sentinel value)
-- Without it: infinite loop if there's a cycle in the data!

SOLVED EXAMPLE

Problem:
Table: employees (employee_id, name, manager_id)
Find ALL employees who report to manager 5 — direct AND indirect at all levels.
Return: employee_id, name, level (how many levels below manager 5)
sql
WITH RECURSIVE org_tree AS (

    -- ANCHOR: direct reports of manager 5
    SELECT
        employee_id,
        name,
        manager_id,
        1 AS level_below_manager
    FROM employees
    WHERE manager_id = 5

    UNION ALL

    -- RECURSIVE: find reports of the reports
    SELECT
        e.employee_id,
        e.name,
        e.manager_id,
        ot.level_below_manager + 1
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.employee_id
    WHERE ot.level_below_manager < 20   -- safety limit (company depth)
)

SELECT employee_id, name, level_below_manager
FROM org_tree
ORDER BY level_below_manager, name;

SOLVED EXAMPLE — Q78: Generate date spine 2024-01-01 to 2024-12-31

Problem:
Generate a table of all dates in 2024.
Use for LEFT JOIN to fill in missing dates in reporting (zero-fill).
sql
-- PostgreSQL: use generate_series (simplest)
SELECT generate_series(
    '2024-01-01'::date,
    '2024-12-31'::date,
    '1 day'::interval
)::date AS date_val;

-- Hive/Spark: use recursive CTE
WITH RECURSIVE date_spine AS (
    -- ANCHOR: start date
    SELECT DATE '2024-01-01' AS date_val

    UNION ALL

    -- RECURSIVE: add 1 day each iteration
    SELECT DATE_ADD(date_val, 1)
    FROM date_spine
    WHERE date_val < DATE '2024-12-31'   -- TERMINATION: stop at end date
)

SELECT date_val FROM date_spine;

-- Usage: fill gaps in daily sales report
SELECT
    d.date_val,
    p.product_id,
    COALESCE(s.revenue, 0) AS revenue
FROM date_spine d
CROSS JOIN (SELECT DISTINCT product_id FROM products) p
LEFT JOIN daily_sales s
    ON d.date_val = s.sale_date
    AND p.product_id = s.product_id
ORDER BY d.date_val, p.product_id;

PATTERN 13 QUESTIONS

Q#QuestionKey Insight
Q74All employees under manager_id=5 (all levels)Anchor = direct reports, recursive = reports of reports
Q75Total subordinates (direct + indirect) per managerRecursive + COUNT(*) in outer query
Q76Product categories + all subcategories recursivelyAnchor = top-level categories (parent_id IS NULL)
Q77Full task dependency chainAnchor = the target task, recursive = its dependencies
Q78Generate date spine 2024-01-01 to 2024-12-31Anchor = start date, recursive = date + 1 day
Q79All cities reachable from origin (multi-hop flights)Anchor = direct destinations, recursive = reachable from those

PATTERN 14: MEDIAN / PERCENTILE

What Is It?

Find the middle value of a distribution (median) or a specific percentile (90th, 95th). Built-in functions vary by database — know both the function AND the manual approach.

Recognize It When You See:

  • "Median [salary/value/time]"
  • "Middle value"
  • "Percentile / quartile / decile"
  • "P50, P90, P95, P99"
  • "Divide users into quartiles / buckets"

APPROACH 1: Built-in Functions (when available)

sql
-- PERCENTILE_CONT: continuous interpolation (for medians — interpolates between values)
SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
    PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY salary) AS p90_salary,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS q1_salary
FROM employees;

-- Availability: PostgreSQL ✓, Snowflake ✓, Spark SQL ✓, Hive (PERCENTILE_APPROX) partial

-- PERCENTILE_DISC: discrete (returns actual value from the dataset, not interpolated)
SELECT
    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary) AS exact_median
FROM employees;

-- NTILE: divide into N equal buckets
SELECT
    user_id,
    total_spend,
    NTILE(4) OVER (ORDER BY total_spend) AS quartile   -- 1=lowest, 4=highest
FROM user_spending;

-- Hive approximation:
SELECT PERCENTILE_APPROX(salary, 0.5) AS approx_median FROM employees;

APPROACH 2: Manual Median (when no built-in — important interview skill!)

sql
-- MANUAL MEDIAN using ROW_NUMBER
-- Works in ALL databases including Hive

WITH ordered AS (
    SELECT
        value_col,
        ROW_NUMBER() OVER (ORDER BY value_col) AS rn,
        COUNT(*) OVER () AS total_count   -- total rows (same for all rows)
    FROM source_table
)

SELECT AVG(value_col) AS median_value
FROM ordered
WHERE
    rn = (total_count + 1) / 2             -- middle row for odd count
    OR rn = (total_count + 2) / 2;         -- handles even count (avg of 2 middle rows)

-- Example:
-- N=5 (odd): middle = row 3. (5+1)/2=3, (5+2)/2=3 → same row → just 1 value
-- N=6 (even): rows 3 and 4. (6+1)/2=3 (floor), (6+2)/2=4 → two rows → AVG

SOLVED EXAMPLE — Q80 (Google Hard): Median from a FREQUENCY TABLE

Problem:
Table: search_frequency (searches, num_users)
Each row says: X users made exactly Y searches.
(This is a COMPRESSED representation — not individual rows!)
Find the median number of searches per user.
Example:
searches | num_users
1 | 2 ← 2 users made 1 search each
2 | 3 ← 3 users made 2 searches each
3 | 1 ← 1 user made 3 searches
Total: 6 users. Median = (2+2)/2 = 2 (3rd and 4th values in sorted list)
sql
-- KEY INSIGHT: Can't use simple ROW_NUMBER because data is compressed!
-- Must EXPAND the frequency table first OR use cumulative counts

WITH cumulative AS (
    SELECT
        searches,
        num_users,
        -- Cumulative count up to this row
        SUM(num_users) OVER (ORDER BY searches
                              ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
            AS cum_count,
        -- Cumulative count BEFORE this row
        SUM(num_users) OVER (ORDER BY searches
                              ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
            AS cum_count_before,
        SUM(num_users) OVER () AS total_users
    FROM search_frequency
)

SELECT ROUND(AVG(searches * 1.0), 1) AS median_searches
FROM cumulative
WHERE
    -- Median rows: total/2 falls within [cum_before, cum_count] range
    COALESCE(cum_count_before, 0) < (total_users + 1) / 2.0
    AND cum_count >= (total_users + 1) / 2.0;
    -- For even totals: both (N/2)th and (N/2+1)th values needed → AVG handles it

SOLVED EXAMPLE — Q83: Divide users into quartiles by total spend

Problem:
Table: user_orders (user_id, order_date, amount)
Divide users into 4 quartiles based on their total spend in 2023.
Return: user_id, total_spend, quartile (1=lowest, 4=highest)
sql
WITH user_totals AS (
    SELECT
        user_id,
        SUM(amount) AS total_spend
    FROM user_orders
    WHERE YEAR(order_date) = 2023
    GROUP BY user_id
)

SELECT
    user_id,
    total_spend,
    NTILE(4) OVER (ORDER BY total_spend ASC) AS quartile
    -- quartile 1 = lowest spenders, quartile 4 = highest spenders
FROM user_totals
ORDER BY total_spend;

PATTERN 14 QUESTIONS

Q#QuestionKey Insight
Q80Median searches from frequency table (Google Hard)Expand compressed freq table using cumulative SUM
Q81Median salary per departmentROW_NUMBER per dept + filter middle row(s)
Q8290th percentile response time per API endpointPERCENTILE_CONT(0.90) or ROW_NUMBER approach
Q83Divide users into 4 quartiles by spendNTILE(4) OVER (ORDER BY total_spend)
Q84Median days between 1st and 2nd purchaseCompute day_diff per user, then apply median pattern

PATTERN 15: FUNNEL ANALYSIS

What Is It?

Track how many users pass through each stage of a defined sequence. Calculate conversion rates between stages and identify where users drop off.

Recognize It When You See:

  • "Conversion funnel / conversion rate"
  • "How many users reached step X"
  • "Drop-off rate at each stage"
  • "Signup → verification → first purchase"
  • "Impression → click → purchase"

THE TEMPLATE

sql
-- TEMPLATE: Funnel Analysis
-- Each user can be in at most one stage at a time (or multiple, depending on definition)

WITH user_stages AS (
    SELECT
        user_id,
        -- For each stage: 1 if user performed this action, else 0
        MAX(CASE WHEN event_type = 'signup' THEN 1 ELSE 0 END)        AS did_signup,
        MAX(CASE WHEN event_type = 'email_verified' THEN 1 ELSE 0 END) AS did_verify,
        MAX(CASE WHEN event_type = 'first_purchase' THEN 1 ELSE 0 END) AS did_purchase
    FROM events
    GROUP BY user_id
),

funnel_counts AS (
    SELECT
        -- COUNT users who reached each stage
        COUNT(*) AS total_users,
        SUM(did_signup) AS signup_count,
        SUM(did_verify) AS verify_count,
        SUM(did_purchase) AS purchase_count
    FROM user_stages
)

SELECT
    signup_count,
    verify_count,
    ROUND(100.0 * verify_count / NULLIF(signup_count, 0), 2)   AS signup_to_verify_pct,
    purchase_count,
    ROUND(100.0 * purchase_count / NULLIF(verify_count, 0), 2) AS verify_to_purchase_pct,
    ROUND(100.0 * purchase_count / NULLIF(signup_count, 0), 2) AS overall_conversion_pct
FROM funnel_counts;

SOLVED EXAMPLE — Q85 (TikTok/Stripe): Count users at each funnel stage + conversion rates

Problem:
Table: events (user_id, event_type, event_date)
event_type values: 'signup', 'email_verified', 'profile_completed', 'first_purchase'
For each stage: count distinct users who reached it.
Also compute step-to-step conversion rate.
sql
WITH user_funnel AS (
    SELECT
        user_id,
        MAX(CASE WHEN event_type = 'signup' THEN 1 ELSE 0 END)             AS reached_signup,
        MAX(CASE WHEN event_type = 'email_verified' THEN 1 ELSE 0 END)      AS reached_verify,
        MAX(CASE WHEN event_type = 'profile_completed' THEN 1 ELSE 0 END)   AS reached_profile,
        MAX(CASE WHEN event_type = 'first_purchase' THEN 1 ELSE 0 END)      AS reached_purchase
    FROM events
    GROUP BY user_id
),

stage_counts AS (
    SELECT
        SUM(reached_signup)   AS signup_users,
        SUM(reached_verify)   AS verify_users,
        SUM(reached_profile)  AS profile_users,
        SUM(reached_purchase) AS purchase_users
    FROM user_funnel
)

SELECT
    'Stage 1: Signup'        AS stage, signup_users   AS users, 100.0 AS conversion_from_prev FROM stage_counts
UNION ALL
SELECT 'Stage 2: Email Verified', verify_users,
    ROUND(100.0 * verify_users   / NULLIF(signup_users, 0), 2)   FROM stage_counts
UNION ALL
SELECT 'Stage 3: Profile Completed', profile_users,
    ROUND(100.0 * profile_users  / NULLIF(verify_users, 0), 2)   FROM stage_counts
UNION ALL
SELECT 'Stage 4: First Purchase', purchase_users,
    ROUND(100.0 * purchase_users / NULLIF(profile_users, 0), 2)  FROM stage_counts;

SOLVED EXAMPLE — Q86 (TikTok): Signup-to-activation rate

Problem:
Table: emails (email_id, user_id, action, date)
action values: 'sent' (signup triggered), 'open' (user opened), 'answered' (activated)
Find activation rate = users who 'answered' / total users (2 decimal places)
sql
SELECT
    ROUND(
        SUM(CASE WHEN action = 'answered' THEN 1.0 ELSE 0 END)
        / COUNT(DISTINCT user_id),
    2) AS activation_rate
FROM emails;

-- More explicit version:
WITH user_status AS (
    SELECT
        user_id,
        MAX(CASE WHEN action = 'answered' THEN 1 ELSE 0 END) AS activated
    FROM emails
    GROUP BY user_id
)
SELECT ROUND(AVG(activated * 1.0), 2) AS activation_rate FROM user_status;

PATTERN 15 QUESTIONS

Q#QuestionKey Insight
Q85Count users at each funnel stage + conversion ratesMAX(CASE WHEN event=stage) per user, then aggregate
Q86Signup-to-activation rateCOUNT(answered) / COUNT(signed_up)
Q87Drop-off rate at each onboarding stepFunnel template + find biggest % drop between consecutive stages
Q88CTR-to-conversion per ad campaignImpressions → clicks → purchases (3-stage funnel)
Q89Funnel step with highest drop-off (first 7 days)Compute all stage counts, then find MAX(step_n - step_n+1)
Q90Free trial → paid conversion across monthly cohortsCohort = signup_month + funnel (trial → paid) per cohort

⚠️ COMMON TRAPS IN ADVANCED PATTERNS

🧠 Memory Map
RECURSIVE CTE TRAPS
TRAP 1: No termination conditioninfinite loop
Always add: WHERE level < 20 OR WHERE some column IS NOT NULL
Most databases have a default recursion limit (100 or 1000 iterations)
TRAP 2: UNION ALL vs UNION in recursive CTE
Must use UNION ALL (not UNION) in recursive CTEs
UNION would deduplicate across iterationswrong for hierarchies!
TRAP 3: Hive doesn't support WITH RECURSIVE
In Hive: use alternative approaches (multiple queries or Spark SQL for recursive)
In interview: say "Hive doesn't support recursive CTEs — I'd use Spark SQL instead"
MEDIAN TRAPS
TRAP 4: Simple AVG ≠ Median
AVG(salary) = arithmetic average (pulled by outliers)
MEDIAN(salary) = 50th percentile (middle value, resistant to outliers)
Interviewers ask for MEDIAN specifically to test this distinction
TRAP 5: Even vs Odd count
Odd count (N=5): median = row 3 → one row
Even count (N=6): median = average of rows 3 and 4 → TWO rows, then AVG
ROW_NUMBER trick: WHERE rn IN ((N+1)/2, (N+2)/2) handles BOTH cases with AVG
FUNNEL TRAPS
TRAP 6: User can appear in multiple stages
A user who completed 'first_purchase' ALSO did 'signup' and 'email_verified'
Use MAX(CASE WHEN event=stage THEN 1 ELSE 0 END) per user
NOT COUNT(CASE WHEN event=stage) — that counts events, not unique users!
TRAP 7: Strict funnel vs non-strict
Strict: user must complete stages IN ORDER (signup before purchase)
Non-strict: just check if user ever performed each action
Most interview questions are non-strict unless explicitly stated
For strict funnel: add timestamp comparison in JOIN conditions