SQL Patterns 10–12: Conditional Aggregation, Cohort/Retention, Date Arithmetic
These patterns appear in almost every analytics and data engineering role. Cohort analysis is the most frequently asked "hard" business SQL question.
🧠 MEMORY MAP
🧠 ANALYTICS PATTERNS = "CAD"
ANALYTICS PATTERNS"CAD"
CConditional Aggregation: SUM(CASE WHEN ...) = pivot/segment in one query
AcAhort/Retention: MIN(event_date) as cohort key + offset joins
DDate Arithmetic: DATEDIFF, DATE_TRUNC, INTERVAL for time-based calculations
CONDITIONAL AGGREGATION RULE
"Count/Sum by category WITHOUT multiple queries"
SUM(CASE WHEN category='A' THEN 1 ELSE 0 END) AS count_A
SUM(CASE WHEN category='B' THEN amount ELSE 0 END) AS revenue_B
COHORT RULE: "C-M-R" (Cohort → Measure → Rate)
CCohort: MIN(event_date) per user = their "birth date" in this system
MMeasure: COUNT(users who returned on day/week N after cohort date)
RRate: returned_count / cohort_size × 100
DATE ARITHMETIC RULE: "DIFF-TRUNC-ADD"
DIFFDATEDIFF(end, start) for age/interval
TRUNCDATE_TRUNC('month', date) for grouping by period
ADDDATE_ADD(date, 7) for deadlines and offsets
PATTERN 10: CONDITIONAL AGGREGATION / PIVOT
What Is It?
Count or sum values BY CONDITION within a single row using CASE WHEN inside aggregate functions. Creates a "wide" result with category-based columns.
Recognize It When You See:
- "Show [metric] broken down by [category] in the same row"
- "Count paying vs non-paying users per date"
- "Revenue by quarter as columns"
- "Click-through rate (clicks / impressions)"
- "Success vs failure count side by side"
- "Pivot [categories] into columns"
THE TEMPLATE
sql
-- TEMPLATE: Conditional Aggregation
SELECT
grouping_col, -- row identifier (date, user, region)
-- Count rows matching a condition
COUNT(CASE WHEN category = 'A' THEN 1 END) AS count_a,
SUM(CASE WHEN category = 'A' THEN 1 ELSE 0 END) AS count_a_v2, -- same result
-- Sum values matching a condition
SUM(CASE WHEN category = 'B' THEN amount ELSE 0 END) AS total_b,
-- Ratio / percentage
ROUND(
100.0 * SUM(CASE WHEN event = 'click' THEN 1 ELSE 0 END)
/ NULLIF(SUM(CASE WHEN event = 'impression' THEN 1 ELSE 0 END), 0),
2) AS ctr_pct,
-- Average conditional
AVG(CASE WHEN is_premium = 1 THEN rating END) AS avg_premium_rating
FROM source_table
GROUP BY grouping_col;
-- ⚠️ NULLIF(denominator, 0) prevents division-by-zero error
-- ⚠️ COUNT(CASE WHEN ...) counts NULL as 0 — CASE WHEN false returns NULL, not counted
-- SUM(CASE WHEN ... THEN 1 ELSE 0) is safer (explicit 0 for false)
SOLVED EXAMPLE — Q57 (Microsoft): Downloads for paying vs non-paying, filter where non > paying
Problem:
Table: user_activity (date, user_id, downloads)
Table: paying_users (user_id)
Show total downloads per day for paying vs non-paying users.
Only include dates where non-paying downloads > paying downloads.
sql
WITH daily_downloads AS (
SELECT
ua.date,
SUM(CASE WHEN pu.user_id IS NOT NULL THEN ua.downloads ELSE 0 END)
AS paying_downloads,
SUM(CASE WHEN pu.user_id IS NULL THEN ua.downloads ELSE 0 END)
AS non_paying_downloads
FROM user_activity ua
LEFT JOIN paying_users pu ON ua.user_id = pu.user_id
GROUP BY ua.date
)
SELECT date, paying_downloads, non_paying_downloads
FROM daily_downloads
WHERE non_paying_downloads > paying_downloads
ORDER BY date;
SOLVED EXAMPLE — Q58 (Facebook): Click-through rate per app in 2022
Problem:
Table: events (app_id, event_type, timestamp) -- event_type: 'click' or 'impression'
Calculate CTR = 100.0 × clicks / impressions per app in 2022.
sql
SELECT
app_id,
ROUND(
100.0
* SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END)
/ NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0),
2) AS ctr_percentage
FROM events
WHERE EXTRACT(YEAR FROM timestamp) = 2022
GROUP BY app_id
ORDER BY ctr_percentage DESC;
PATTERN 10 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q57 | Downloads paying vs non-paying, filter non > paying | LEFT JOIN to mark paying users + CASE WHEN |
| Q58 | Click-through rate (clicks/impressions) | SUM(CASE WHEN 'click') / SUM(CASE WHEN 'impression') × 100 |
| Q59 | Pivot monthly revenue by product category | SUM(CASE WHEN month='Jan') AS Jan, etc. |
| Q60 | Users active exactly 3 of past 7 days | SUM(CASE WHEN recent date THEN 1) per user, HAVING = 3 |
| Q61 | Orders per quarter as separate columns | SUM(CASE WHEN EXTRACT(quarter)=1) for each quarter |
| Q62 | Twitch streamer vs viewer session ratio | SUM(CASE WHEN type='streamer'), SUM(CASE WHEN type='viewer') |
PATTERN 11: COHORT / RETENTION ANALYSIS
What Is It?
Group users by when they first appeared (sign-up cohort). Measure what % of each cohort was still active after N days/weeks/months. Most important pattern for product analytics and growth-focused data roles.
Recognize It When You See:
- "Retention rate / monthly retention"
- "Day-1, Day-7, Day-30 retention"
- "Returning customers vs first-time"
- "Cohort [by signup month] [activity over time]"
- "Monthly active users who were also active last month"
THE TEMPLATE
sql
-- TEMPLATE: Cohort Retention
-- Step 1: Find each user's cohort date (first event)
WITH user_cohorts AS (
SELECT
user_id,
MIN(event_date) AS cohort_date, -- first ever event = cohort date
DATE_TRUNC('month', MIN(event_date)) AS cohort_month
FROM events
GROUP BY user_id
),
-- Step 2: Join all events back to cohort date to compute "days since cohort"
user_activity AS (
SELECT
e.user_id,
uc.cohort_month,
-- How many months after their cohort month is this event?
DATEDIFF('month', uc.cohort_date, e.event_date) AS months_since_cohort
-- Hive: MONTHS_BETWEEN(e.event_date, uc.cohort_date)
-- Spark: DATEDIFF(e.event_date, uc.cohort_date) / 30 (approximate)
FROM events e
JOIN user_cohorts uc ON e.user_id = uc.user_id
),
-- Step 3: Count users active at each cohort offset
retention AS (
SELECT
cohort_month,
months_since_cohort AS period_offset,
COUNT(DISTINCT user_id) AS active_users
FROM user_activity
GROUP BY cohort_month, months_since_cohort
),
-- Step 4: Get cohort sizes (users in month 0 = all users in that cohort)
cohort_sizes AS (
SELECT cohort_month, active_users AS cohort_size
FROM retention
WHERE period_offset = 0
)
-- Step 5: Compute retention rate per cohort per period
SELECT
r.cohort_month,
r.period_offset,
r.active_users,
cs.cohort_size,
ROUND(100.0 * r.active_users / cs.cohort_size, 2) AS retention_rate_pct
FROM retention r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
ORDER BY r.cohort_month, r.period_offset;
SOLVED EXAMPLE — Q63 (Facebook Hard): MAU July 2022 (active in BOTH June AND July)
Problem:
Table: user_actions (user_id, event_date, action_type)
Find count of monthly active users in July 2022.
MAU definition: active in BOTH June 2022 AND July 2022.
sql
WITH june_active AS (
SELECT DISTINCT user_id
FROM user_actions
WHERE event_date >= '2022-06-01'
AND event_date < '2022-07-01'
),
july_active AS (
SELECT DISTINCT user_id
FROM user_actions
WHERE event_date >= '2022-07-01'
AND event_date < '2022-08-01'
)
-- Users who were active in BOTH months = intersection
SELECT COUNT(*) AS monthly_active_users
FROM june_active ja
INNER JOIN july_active jul ON ja.user_id = jul.user_id;
-- Alternative with EXISTS:
SELECT COUNT(DISTINCT j.user_id) AS monthly_active_users
FROM july_active j
WHERE EXISTS (
SELECT 1 FROM june_active ja WHERE ja.user_id = j.user_id
);
SOLVED EXAMPLE — Q65: Day-7 retention for January 2024 sign-ups
Problem:
Table: signups (user_id, signup_date)
Table: daily_logins (user_id, login_date)
Find: what % of users who signed up in January 2024 were active exactly on Day 7?
(Day 7 = signup_date + 7 days)
sql
WITH jan_signups AS (
SELECT
user_id,
signup_date,
signup_date + INTERVAL '7 days' AS day_7_date -- their specific Day 7
-- Hive/Spark: DATE_ADD(signup_date, 7) AS day_7_date
FROM signups
WHERE signup_date >= '2024-01-01'
AND signup_date < '2024-02-01'
),
day7_active AS (
-- Check if each user logged in on their Day 7
SELECT
js.user_id,
MAX(CASE WHEN dl.login_date = js.day_7_date THEN 1 ELSE 0 END) AS was_active_day7
FROM jan_signups js
LEFT JOIN daily_logins dl
ON js.user_id = dl.user_id
AND dl.login_date = js.day_7_date
GROUP BY js.user_id
)
SELECT
COUNT(*) AS cohort_size,
SUM(was_active_day7) AS day7_actives,
ROUND(100.0 * SUM(was_active_day7) / COUNT(*), 2) AS day7_retention_rate
FROM day7_active;
PATTERN 11 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q63 | MAU July 2022 (active both June + July) | INTERSECT or JOIN on two sets of monthly active users |
| Q64 | Retention rate for monthly cohort at month 1, 2, 3 | DATEDIFF in months since cohort, then retention template |
| Q65 | Day-7 retention for January sign-ups | signup_date + 7 as target date, LEFT JOIN logins |
| Q66 | Weekly cohort: % making 2nd purchase within 30 days | DATE_TRUNC('week', first_purchase) as cohort key |
| Q67 | D1, D7, D30 retention side by side | Three LEFT JOINs to logins on day +1, +7, +30 |
| Q68 | Unsubscribe effect on logins over 4 weeks | Cohort = unsubscribe_date, measure logins in each week_offset |
PATTERN 12: DATE ARITHMETIC
What Is It?
Calculate time intervals, filter by date ranges, create date-based groups, and compute deadlines or overdue records.
Recognize It When You See:
- "Days between [event A] and [event B]"
- "Orders in the past 90 days"
- "Average time to deliver"
- "Employees at the company more than 5 years"
- "Renewals more than 7 days late"
- "Group by week/month/quarter"
THE TEMPLATE
sql
-- TEMPLATE: Date Arithmetic Reference
-- 1. Difference between two dates (in days)
DATEDIFF(end_date, start_date) -- Hive/Spark: returns integer days
end_date - start_date -- PostgreSQL: returns integer days
DATEDIFF(day, start_date, end_date) -- SQL Server / Snowflake
-- 2. Add N days to a date
DATE_ADD(date_col, 7) -- Hive/Spark: add 7 days
date_col + INTERVAL '7 days' -- PostgreSQL
DATEADD(day, 7, date_col) -- SQL Server / Snowflake
-- 3. Truncate to start of period (useful for monthly/weekly grouping)
DATE_TRUNC('month', date_col) -- PostgreSQL / Spark / Snowflake
TRUNC(date_col, 'MM') -- Hive / Oracle: start of month
DATE_FORMAT(date_col, 'yyyy-MM-01') -- Hive: manual month truncation
-- 4. Extract components
EXTRACT(YEAR FROM date_col) -- standard SQL
YEAR(date_col) -- Hive/Spark/MySQL shorthand
EXTRACT(QUARTER FROM date_col) -- quarter number (1-4)
-- 5. Current date
CURRENT_DATE -- standard SQL
NOW() -- PostgreSQL / MySQL (includes time)
GETDATE() -- SQL Server
CURRENT_DATE() -- Hive/Spark
-- 6. Filter "past N days"
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days' -- PostgreSQL
WHERE event_date >= DATE_SUB(CURRENT_DATE(), 90) -- Hive/Spark
-- 7. Convert epoch/unix timestamp
FROM_UNIXTIME(unix_ts) -- Hive: unix seconds → datetime
TO_TIMESTAMP(unix_ts) -- Snowflake/PostgreSQL
SOLVED EXAMPLE — Q69 (Facebook): Days between first and last post per user
Problem:
Table: posts (user_id, post_date)
For each user who posted at least twice in 2024,
find the number of days between their first and last post of the year.
sql
SELECT
user_id,
MIN(post_date) AS first_post,
MAX(post_date) AS last_post,
DATEDIFF(MAX(post_date), MIN(post_date)) AS days_between
-- PostgreSQL: MAX(post_date) - MIN(post_date) AS days_between
FROM posts
WHERE post_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY user_id
HAVING COUNT(*) >= 2 -- only users with at least 2 posts
ORDER BY days_between DESC;
SOLVED EXAMPLE — Q71 (Ankit Bansal): Employees 5+ years without promotion
Problem:
Table: employees (employee_id, hire_date)
Table: promotions (employee_id, promotion_date)
Find employees who have been at the company for 5+ years
but have NEVER been promoted.
sql
SELECT
e.employee_id,
e.hire_date,
DATEDIFF(CURRENT_DATE(), e.hire_date) AS days_employed
-- PostgreSQL: CURRENT_DATE - e.hire_date
FROM employees e
LEFT JOIN promotions p ON e.employee_id = p.employee_id
WHERE
DATEDIFF(CURRENT_DATE(), e.hire_date) > 5 * 365 -- 5+ years
-- PostgreSQL: CURRENT_DATE - e.hire_date > 1825
AND p.employee_id IS NULL; -- no record in promotions table
PATTERN 12 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q69 | Days between first and last post in year | MAX(date) - MIN(date) + HAVING COUNT >= 2 |
| Q70 | Average hours from order placement to delivery | AVG(DATEDIFF(delivered_at, placed_at) × 24) |
| Q71 | Employees 5+ years without promotion | DATEDIFF > 1825 AND LEFT JOIN promotions WHERE IS NULL |
| Q72 | % cancelled orders and revenue lost (past 90 days) | WHERE date >= current - 90 + conditional aggregation |
| Q73 | Subscription renewals more than 7 days late | DATEDIFF(renewal_date, expected_renewal_date) > 7 |
⚠️ COMMON TRAPS IN ANALYTICS PATTERNS
🧠 CTR = clicks / impressions → fails if impressions = 0 for some group
CONDITIONAL AGGREGATION TRAPS
TRAP 1: Division by zero in ratios
CTRclicks / impressions → fails if impressions → 0 for some group
Fix: ROUND(100.0 * clicks / NULLIF(impressions, 0), 2)
NULLIF(x, 0) returns NULL if x=0 → division returns NULL (not error)
TRAP 2: COUNT vs SUM for conditional counting
COUNT(CASE WHEN cond THEN 1 END) → works (NULL is not counted)
SUM(CASE WHEN cond THEN 1 ELSE 0 END) → works (explicit 0 for false)
SUM(CASE WHEN cond THEN 1 END) → works too (NULL = 0 in SUM)
All three produce the same result for conditional counts
COHORT ANALYSIS TRAPS
TRAP 3: Using event_date not sign-up date as cohort
Cohort = first event date (MIN per user), not a separate signup table
If you use any event date as cohort: same user appears in multiple cohorts!
TRAP 4: Off-by-one in retention day calculation
Day-0 = cohort date (signup day)
Day-7 = signup_date + 7 (exactly 1 week later)
Some definitions: Day-7 = 7 days AFTER signup→signup_date + 7
Others: Day-7 = week 1→DATE_TRUNC('week') approach
DATE ARITHMETIC TRAPS
TRAP 5: DATEDIFF argument order (Hive vs Postgres vs SQL Server)
Hive/Spark: DATEDIFF(end_date, start_date) → returns end - start (positive if end > start)
SQL Server: DATEDIFF(unit, start_date, end_date) → start to end
PostgreSQL: end_date - start_date→integer subtraction
Always test: DATEDIFF('2024-01-10', '2024-01-01') should return 9
TRAP 6: Comparing timestamps vs dates
login_date = '2024-01-15'::date ← fine if column is DATE type
login_timestamp = '2024-01-15' ← WRONG if column has time component!
Fix: DATE(login_timestamp) = '2024-01-15' OR login_timestamp >= '2024-01-15 00:00:00'