🗃️
SQL
SQL Patterns 10–12: Conditional Aggregation, Cohort/Retention, Date Arithmetic
🗃️
🗃️
SQL · Section 5 of 9

SQL Patterns 10–12: Conditional Aggregation, Cohort/Retention, Date Arithmetic

SQL Patterns 10–12: Conditional Aggregation, Cohort/Retention, Date Arithmetic

These patterns appear in almost every analytics and data engineering role. Cohort analysis is the most frequently asked "hard" business SQL question.

🧠 MEMORY MAP

🧠 ANALYTICS PATTERNS = "CAD"
ANALYTICS PATTERNS"CAD"
CConditional Aggregation: SUM(CASE WHEN ...) = pivot/segment in one query
AcAhort/Retention: MIN(event_date) as cohort key + offset joins
DDate Arithmetic: DATEDIFF, DATE_TRUNC, INTERVAL for time-based calculations
CONDITIONAL AGGREGATION RULE
"Count/Sum by category WITHOUT multiple queries"
SUM(CASE WHEN category='A' THEN 1 ELSE 0 END) AS count_A
SUM(CASE WHEN category='B' THEN amount ELSE 0 END) AS revenue_B
COHORT RULE: "C-M-R" (Cohort → Measure → Rate)
CCohort: MIN(event_date) per user = their "birth date" in this system
MMeasure: COUNT(users who returned on day/week N after cohort date)
RRate: returned_count / cohort_size × 100
DATE ARITHMETIC RULE: "DIFF-TRUNC-ADD"
DIFFDATEDIFF(end, start) for age/interval
TRUNCDATE_TRUNC('month', date) for grouping by period
ADDDATE_ADD(date, 7) for deadlines and offsets

PATTERN 10: CONDITIONAL AGGREGATION / PIVOT

What Is It?

Count or sum values BY CONDITION within a single row using CASE WHEN inside aggregate functions. Creates a "wide" result with category-based columns.

Recognize It When You See:

  • "Show [metric] broken down by [category] in the same row"
  • "Count paying vs non-paying users per date"
  • "Revenue by quarter as columns"
  • "Click-through rate (clicks / impressions)"
  • "Success vs failure count side by side"
  • "Pivot [categories] into columns"

THE TEMPLATE

sql
-- TEMPLATE: Conditional Aggregation
SELECT
    grouping_col,                    -- row identifier (date, user, region)

    -- Count rows matching a condition
    COUNT(CASE WHEN category = 'A' THEN 1 END)          AS count_a,
    SUM(CASE WHEN category = 'A' THEN 1 ELSE 0 END)     AS count_a_v2,  -- same result

    -- Sum values matching a condition
    SUM(CASE WHEN category = 'B' THEN amount ELSE 0 END) AS total_b,

    -- Ratio / percentage
    ROUND(
        100.0 * SUM(CASE WHEN event = 'click' THEN 1 ELSE 0 END)
        / NULLIF(SUM(CASE WHEN event = 'impression' THEN 1 ELSE 0 END), 0),
    2) AS ctr_pct,

    -- Average conditional
    AVG(CASE WHEN is_premium = 1 THEN rating END)        AS avg_premium_rating

FROM source_table
GROUP BY grouping_col;

-- ⚠️ NULLIF(denominator, 0) prevents division-by-zero error
-- ⚠️ COUNT(CASE WHEN ...) counts NULL as 0 — CASE WHEN false returns NULL, not counted
--    SUM(CASE WHEN ... THEN 1 ELSE 0) is safer (explicit 0 for false)

SOLVED EXAMPLE — Q57 (Microsoft): Downloads for paying vs non-paying, filter where non > paying

Problem:
Table: user_activity (date, user_id, downloads)
Table: paying_users (user_id)
Show total downloads per day for paying vs non-paying users.
Only include dates where non-paying downloads > paying downloads.
sql
WITH daily_downloads AS (
    SELECT
        ua.date,
        SUM(CASE WHEN pu.user_id IS NOT NULL THEN ua.downloads ELSE 0 END)
            AS paying_downloads,
        SUM(CASE WHEN pu.user_id IS NULL THEN ua.downloads ELSE 0 END)
            AS non_paying_downloads
    FROM user_activity ua
    LEFT JOIN paying_users pu ON ua.user_id = pu.user_id
    GROUP BY ua.date
)

SELECT date, paying_downloads, non_paying_downloads
FROM daily_downloads
WHERE non_paying_downloads > paying_downloads
ORDER BY date;

SOLVED EXAMPLE — Q58 (Facebook): Click-through rate per app in 2022

Problem:
Table: events (app_id, event_type, timestamp) -- event_type: 'click' or 'impression'
Calculate CTR = 100.0 × clicks / impressions per app in 2022.
sql
SELECT
    app_id,
    ROUND(
        100.0
        * SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END)
        / NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0),
    2) AS ctr_percentage
FROM events
WHERE EXTRACT(YEAR FROM timestamp) = 2022
GROUP BY app_id
ORDER BY ctr_percentage DESC;

PATTERN 10 QUESTIONS

Q#QuestionKey Insight
Q57Downloads paying vs non-paying, filter non > payingLEFT JOIN to mark paying users + CASE WHEN
Q58Click-through rate (clicks/impressions)SUM(CASE WHEN 'click') / SUM(CASE WHEN 'impression') × 100
Q59Pivot monthly revenue by product categorySUM(CASE WHEN month='Jan') AS Jan, etc.
Q60Users active exactly 3 of past 7 daysSUM(CASE WHEN recent date THEN 1) per user, HAVING = 3
Q61Orders per quarter as separate columnsSUM(CASE WHEN EXTRACT(quarter)=1) for each quarter
Q62Twitch streamer vs viewer session ratioSUM(CASE WHEN type='streamer'), SUM(CASE WHEN type='viewer')

PATTERN 11: COHORT / RETENTION ANALYSIS

What Is It?

Group users by when they first appeared (sign-up cohort). Measure what % of each cohort was still active after N days/weeks/months. Most important pattern for product analytics and growth-focused data roles.

Recognize It When You See:

  • "Retention rate / monthly retention"
  • "Day-1, Day-7, Day-30 retention"
  • "Returning customers vs first-time"
  • "Cohort [by signup month] [activity over time]"
  • "Monthly active users who were also active last month"

THE TEMPLATE

sql
-- TEMPLATE: Cohort Retention
-- Step 1: Find each user's cohort date (first event)
WITH user_cohorts AS (
    SELECT
        user_id,
        MIN(event_date) AS cohort_date,              -- first ever event = cohort date
        DATE_TRUNC('month', MIN(event_date)) AS cohort_month
    FROM events
    GROUP BY user_id
),

-- Step 2: Join all events back to cohort date to compute "days since cohort"
user_activity AS (
    SELECT
        e.user_id,
        uc.cohort_month,
        -- How many months after their cohort month is this event?
        DATEDIFF('month', uc.cohort_date, e.event_date) AS months_since_cohort
        -- Hive: MONTHS_BETWEEN(e.event_date, uc.cohort_date)
        -- Spark: DATEDIFF(e.event_date, uc.cohort_date) / 30 (approximate)
    FROM events e
    JOIN user_cohorts uc ON e.user_id = uc.user_id
),

-- Step 3: Count users active at each cohort offset
retention AS (
    SELECT
        cohort_month,
        months_since_cohort AS period_offset,
        COUNT(DISTINCT user_id) AS active_users
    FROM user_activity
    GROUP BY cohort_month, months_since_cohort
),

-- Step 4: Get cohort sizes (users in month 0 = all users in that cohort)
cohort_sizes AS (
    SELECT cohort_month, active_users AS cohort_size
    FROM retention
    WHERE period_offset = 0
)

-- Step 5: Compute retention rate per cohort per period
SELECT
    r.cohort_month,
    r.period_offset,
    r.active_users,
    cs.cohort_size,
    ROUND(100.0 * r.active_users / cs.cohort_size, 2) AS retention_rate_pct
FROM retention r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
ORDER BY r.cohort_month, r.period_offset;

SOLVED EXAMPLE — Q63 (Facebook Hard): MAU July 2022 (active in BOTH June AND July)

Problem:
Table: user_actions (user_id, event_date, action_type)
Find count of monthly active users in July 2022.
MAU definition: active in BOTH June 2022 AND July 2022.
sql
WITH june_active AS (
    SELECT DISTINCT user_id
    FROM user_actions
    WHERE event_date >= '2022-06-01'
      AND event_date < '2022-07-01'
),

july_active AS (
    SELECT DISTINCT user_id
    FROM user_actions
    WHERE event_date >= '2022-07-01'
      AND event_date < '2022-08-01'
)

-- Users who were active in BOTH months = intersection
SELECT COUNT(*) AS monthly_active_users
FROM june_active ja
INNER JOIN july_active jul ON ja.user_id = jul.user_id;

-- Alternative with EXISTS:
SELECT COUNT(DISTINCT j.user_id) AS monthly_active_users
FROM july_active j
WHERE EXISTS (
    SELECT 1 FROM june_active ja WHERE ja.user_id = j.user_id
);

SOLVED EXAMPLE — Q65: Day-7 retention for January 2024 sign-ups

Problem:
Table: signups (user_id, signup_date)
Table: daily_logins (user_id, login_date)
Find: what % of users who signed up in January 2024 were active exactly on Day 7?
(Day 7 = signup_date + 7 days)
sql
WITH jan_signups AS (
    SELECT
        user_id,
        signup_date,
        signup_date + INTERVAL '7 days' AS day_7_date   -- their specific Day 7
        -- Hive/Spark: DATE_ADD(signup_date, 7) AS day_7_date
    FROM signups
    WHERE signup_date >= '2024-01-01'
      AND signup_date < '2024-02-01'
),

day7_active AS (
    -- Check if each user logged in on their Day 7
    SELECT
        js.user_id,
        MAX(CASE WHEN dl.login_date = js.day_7_date THEN 1 ELSE 0 END) AS was_active_day7
    FROM jan_signups js
    LEFT JOIN daily_logins dl
        ON js.user_id = dl.user_id
        AND dl.login_date = js.day_7_date
    GROUP BY js.user_id
)

SELECT
    COUNT(*) AS cohort_size,
    SUM(was_active_day7) AS day7_actives,
    ROUND(100.0 * SUM(was_active_day7) / COUNT(*), 2) AS day7_retention_rate
FROM day7_active;

PATTERN 11 QUESTIONS

Q#QuestionKey Insight
Q63MAU July 2022 (active both June + July)INTERSECT or JOIN on two sets of monthly active users
Q64Retention rate for monthly cohort at month 1, 2, 3DATEDIFF in months since cohort, then retention template
Q65Day-7 retention for January sign-upssignup_date + 7 as target date, LEFT JOIN logins
Q66Weekly cohort: % making 2nd purchase within 30 daysDATE_TRUNC('week', first_purchase) as cohort key
Q67D1, D7, D30 retention side by sideThree LEFT JOINs to logins on day +1, +7, +30
Q68Unsubscribe effect on logins over 4 weeksCohort = unsubscribe_date, measure logins in each week_offset

PATTERN 12: DATE ARITHMETIC

What Is It?

Calculate time intervals, filter by date ranges, create date-based groups, and compute deadlines or overdue records.

Recognize It When You See:

  • "Days between [event A] and [event B]"
  • "Orders in the past 90 days"
  • "Average time to deliver"
  • "Employees at the company more than 5 years"
  • "Renewals more than 7 days late"
  • "Group by week/month/quarter"

THE TEMPLATE

sql
-- TEMPLATE: Date Arithmetic Reference

-- 1. Difference between two dates (in days)
DATEDIFF(end_date, start_date)            -- Hive/Spark: returns integer days
end_date - start_date                      -- PostgreSQL: returns integer days
DATEDIFF(day, start_date, end_date)       -- SQL Server / Snowflake

-- 2. Add N days to a date
DATE_ADD(date_col, 7)                     -- Hive/Spark: add 7 days
date_col + INTERVAL '7 days'              -- PostgreSQL
DATEADD(day, 7, date_col)                -- SQL Server / Snowflake

-- 3. Truncate to start of period (useful for monthly/weekly grouping)
DATE_TRUNC('month', date_col)             -- PostgreSQL / Spark / Snowflake
TRUNC(date_col, 'MM')                     -- Hive / Oracle: start of month
DATE_FORMAT(date_col, 'yyyy-MM-01')       -- Hive: manual month truncation

-- 4. Extract components
EXTRACT(YEAR FROM date_col)               -- standard SQL
YEAR(date_col)                            -- Hive/Spark/MySQL shorthand
EXTRACT(QUARTER FROM date_col)            -- quarter number (1-4)

-- 5. Current date
CURRENT_DATE                              -- standard SQL
NOW()                                     -- PostgreSQL / MySQL (includes time)
GETDATE()                                 -- SQL Server
CURRENT_DATE()                            -- Hive/Spark

-- 6. Filter "past N days"
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days'   -- PostgreSQL
WHERE event_date >= DATE_SUB(CURRENT_DATE(), 90)        -- Hive/Spark

-- 7. Convert epoch/unix timestamp
FROM_UNIXTIME(unix_ts)                    -- Hive: unix seconds → datetime
TO_TIMESTAMP(unix_ts)                     -- Snowflake/PostgreSQL

SOLVED EXAMPLE — Q69 (Facebook): Days between first and last post per user

Problem:
Table: posts (user_id, post_date)
For each user who posted at least twice in 2024,
find the number of days between their first and last post of the year.
sql
SELECT
    user_id,
    MIN(post_date) AS first_post,
    MAX(post_date) AS last_post,
    DATEDIFF(MAX(post_date), MIN(post_date)) AS days_between
    -- PostgreSQL: MAX(post_date) - MIN(post_date) AS days_between
FROM posts
WHERE post_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY user_id
HAVING COUNT(*) >= 2   -- only users with at least 2 posts
ORDER BY days_between DESC;

SOLVED EXAMPLE — Q71 (Ankit Bansal): Employees 5+ years without promotion

Problem:
Table: employees (employee_id, hire_date)
Table: promotions (employee_id, promotion_date)
Find employees who have been at the company for 5+ years
but have NEVER been promoted.
sql
SELECT
    e.employee_id,
    e.hire_date,
    DATEDIFF(CURRENT_DATE(), e.hire_date) AS days_employed
    -- PostgreSQL: CURRENT_DATE - e.hire_date
FROM employees e
LEFT JOIN promotions p ON e.employee_id = p.employee_id
WHERE
    DATEDIFF(CURRENT_DATE(), e.hire_date) > 5 * 365   -- 5+ years
    -- PostgreSQL: CURRENT_DATE - e.hire_date > 1825
    AND p.employee_id IS NULL;   -- no record in promotions table

PATTERN 12 QUESTIONS

Q#QuestionKey Insight
Q69Days between first and last post in yearMAX(date) - MIN(date) + HAVING COUNT >= 2
Q70Average hours from order placement to deliveryAVG(DATEDIFF(delivered_at, placed_at) × 24)
Q71Employees 5+ years without promotionDATEDIFF > 1825 AND LEFT JOIN promotions WHERE IS NULL
Q72% cancelled orders and revenue lost (past 90 days)WHERE date >= current - 90 + conditional aggregation
Q73Subscription renewals more than 7 days lateDATEDIFF(renewal_date, expected_renewal_date) > 7

⚠️ COMMON TRAPS IN ANALYTICS PATTERNS

🧠 CTR = clicks / impressions → fails if impressions = 0 for some group
CONDITIONAL AGGREGATION TRAPS
TRAP 1: Division by zero in ratios
CTRclicks / impressions → fails if impressions → 0 for some group
Fix: ROUND(100.0 * clicks / NULLIF(impressions, 0), 2)
NULLIF(x, 0) returns NULL if x=0 → division returns NULL (not error)
TRAP 2: COUNT vs SUM for conditional counting
COUNT(CASE WHEN cond THEN 1 END) → works (NULL is not counted)
SUM(CASE WHEN cond THEN 1 ELSE 0 END) → works (explicit 0 for false)
SUM(CASE WHEN cond THEN 1 END) → works too (NULL = 0 in SUM)
All three produce the same result for conditional counts
COHORT ANALYSIS TRAPS
TRAP 3: Using event_date not sign-up date as cohort
Cohort = first event date (MIN per user), not a separate signup table
If you use any event date as cohort: same user appears in multiple cohorts!
TRAP 4: Off-by-one in retention day calculation
Day-0 = cohort date (signup day)
Day-7 = signup_date + 7 (exactly 1 week later)
Some definitions: Day-7 = 7 days AFTER signupsignup_date + 7
Others: Day-7 = week 1DATE_TRUNC('week') approach
DATE ARITHMETIC TRAPS
TRAP 5: DATEDIFF argument order (Hive vs Postgres vs SQL Server)
Hive/Spark: DATEDIFF(end_date, start_date) → returns end - start (positive if end > start)
SQL Server: DATEDIFF(unit, start_date, end_date) → start to end
PostgreSQL: end_date - start_dateinteger subtraction
⚠️Always test: DATEDIFF('2024-01-10', '2024-01-01') should return 9
TRAP 6: Comparing timestamps vs dates
login_date = '2024-01-15'::date ← fine if column is DATE type
login_timestamp = '2024-01-15' ← WRONG if column has time component!
Fix: DATE(login_timestamp) = '2024-01-15' OR login_timestamp >= '2024-01-15 00:00:00'