SQL Patterns 4–6: Gaps & Islands, Sessionization, Deduplication
These are the patterns that separate average SQL writers from senior engineers. Gaps & Islands is THE hardest commonly-asked pattern. Master it.
🧠 MEMORY MAP
🧠 HARD PATTERNS = "GSD"
HARD PATTERNS"GSD"
GGaps & Islands: find consecutive sequences (date - ROW_NUMBER = constant!)
SSessionization: group events by inactivity gap (LAG + cumulative SUM)
DDeduplication: keep one row per key (ROW_NUMBER = 1)
GAPS & ISLANDS KEY TRICK:
If dates are CONSECUTIVE→(date - ROW_NUMBER) = SAME VALUE for the whole group!
When the sequence BREAKS→the (date - ROW_NUMBER) VALUE CHANGES
Group by that constant→each "island" becomes one group!
date row_num date - row_num = group_key
2024-01-01 1 2024-01-01 - 1 = 2023-12-31 ← island 1
2024-01-02 2 2024-01-02 - 2 = 2023-12-31 ← island 1 (same!)
2024-01-03 3 2024-01-03 - 3 = 2023-12-31 ← island 1 (same!)
← GAP: 01-04 missing →
2024-01-05 4 2024-01-05 - 4 = 2024-01-01 ← island 2 (different!)
2024-01-06 5 2024-01-06 - 5 = 2024-01-01 ← island 2 (same!)
PATTERN 4: GAPS & ISLANDS
What Is It?
Find consecutive sequences ("islands") of events/dates, or the gaps between them. Classic use: login streaks, consecutive active days, periods of server uptime, stocks with N consecutive price increases.
Recognize It When You See:
- "Consecutive days / weeks"
- "Continuous period of [status]"
- "Longest streak of [activity]"
- "Start and end date of each [active/inactive] period"
- "N or more consecutive [events]"
- "Periods of server uptime/downtime"
THE TEMPLATE
sql
-- TEMPLATE: Gaps & Islands
-- Find all consecutive date sequences per user/entity
WITH activity AS (
-- Step 1: Get one row per user per active date (deduplicate if needed)
SELECT DISTINCT
user_id,
activity_date
FROM events
),
numbered AS (
-- Step 2: Assign row number PER USER ordered by date
SELECT
user_id,
activity_date,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY activity_date
) AS rn
FROM activity
),
islands AS (
-- Step 3: The KEY TRICK — subtract row number from date
-- Consecutive dates → same (date - rn) value = same island group
SELECT
user_id,
activity_date,
rn,
-- This creates a constant "group key" for each consecutive run:
(activity_date - INTERVAL '1 day' * (rn - 1)) AS island_group
-- In Hive: DATE_SUB(activity_date, rn - 1) AS island_group
-- In Spark: DATE_SUB(activity_date, rn - 1) AS island_group
FROM numbered
)
-- Step 4: Aggregate each island
SELECT
user_id,
MIN(activity_date) AS streak_start,
MAX(activity_date) AS streak_end,
COUNT(*) AS streak_length
FROM islands
GROUP BY user_id, island_group
ORDER BY user_id, streak_start;
SOLVED EXAMPLE — Q25 (StrataScratch Hard): Top 3 users by longest login streak
Problem:
Table: logins (user_id, login_date)
Find the top 3 users with the longest continuous login streak (no missing days).
Return: user_id, streak_start, streak_end, streak_length
sql
WITH deduplicated AS (
-- Remove duplicate logins on same day for same user
SELECT DISTINCT user_id, login_date
FROM logins
),
numbered AS (
SELECT
user_id,
login_date,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY login_date
) AS rn
FROM deduplicated
),
islands AS (
SELECT
user_id,
login_date,
-- THE TRICK: consecutive dates minus sequential row numbers = constant
(login_date - CAST(rn AS INT)) AS island_key
-- Postgres: login_date - rn (integer subtraction from date)
-- Hive/Spark: DATE_SUB(login_date, rn)
FROM numbered
),
streaks AS (
SELECT
user_id,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS streak_length
FROM islands
GROUP BY user_id, island_key
),
ranked_streaks AS (
SELECT
user_id,
streak_start,
streak_end,
streak_length,
RANK() OVER (ORDER BY streak_length DESC) AS streak_rank
FROM streaks
)
SELECT user_id, streak_start, streak_end, streak_length
FROM ranked_streaks
WHERE streak_rank <= 3
ORDER BY streak_length DESC;
SOLVED EXAMPLE — Q29: Stocks with 3+ consecutive days of price increases
Problem:
Table: stock_prices (stock_symbol, price_date, closing_price)
Find all stocks that had consecutive days of price INCREASES for at least 3 days.
Return: stock_symbol, start_date, end_date, streak_length
sql
WITH price_direction AS (
SELECT
stock_symbol,
price_date,
closing_price,
LAG(closing_price) OVER (
PARTITION BY stock_symbol
ORDER BY price_date
) AS prev_price,
-- Flag: 1 if price increased from yesterday, 0 otherwise
CASE WHEN closing_price > LAG(closing_price) OVER (
PARTITION BY stock_symbol ORDER BY price_date
) THEN 1 ELSE 0 END AS is_increase
FROM stock_prices
),
increases_only AS (
-- Keep only the increasing-price rows
SELECT stock_symbol, price_date
FROM price_direction
WHERE is_increase = 1
),
numbered AS (
SELECT
stock_symbol,
price_date,
ROW_NUMBER() OVER (PARTITION BY stock_symbol ORDER BY price_date) AS rn
FROM increases_only
),
islands AS (
SELECT
stock_symbol,
price_date,
(price_date - CAST(rn AS INT)) AS island_key
FROM numbered
),
streaks AS (
SELECT
stock_symbol,
MIN(price_date) AS streak_start,
MAX(price_date) AS streak_end,
COUNT(*) AS streak_length
FROM islands
GROUP BY stock_symbol, island_key
)
SELECT stock_symbol, streak_start, streak_end, streak_length
FROM streaks
WHERE streak_length >= 3
ORDER BY streak_length DESC;
PATTERN 4 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q25 | Top 3 users with longest login streak | Full gaps & islands + RANK on streak_length |
| Q26 | Employees who worked consecutive days 5+ in a row | island_group on work_date, filter streak >= 5 |
| Q27 | Longest gap between orders per supplier | LEAD to get next_order, subtract dates, find max gap |
| Q28 | Users who ordered every week for 4+ consecutive weeks | DATE_TRUNC('week') as the unit, then gaps & islands |
| Q29 | Stocks with 3+ consecutive days of price increases | Add price_direction flag first, then gaps & islands on increases |
| Q30 | Date ranges when server was continuously online | Gaps & islands on status='up' rows |
| Q31 | Users with 30+ day inactivity gaps in history | LEAD to find next event, filter gap > 30 days |
PATTERN 5: SESSIONIZATION
What Is It?
Group raw event logs into sessions. A new session starts when a user has been
inactive for more than a threshold (e.g., 30 minutes).
Every event gets a session_id.
Recognize It When You See:
- "Group events into sessions"
- "Session = no activity for more than X minutes"
- "Count number of sessions per user"
- "Average session duration"
- "Did a conversion happen within the same session?"
THE TEMPLATE
sql
-- TEMPLATE: Sessionization (create session_id from events)
WITH events_with_prev AS (
SELECT
user_id,
event_time,
event_type,
-- Get the previous event time for this user
LAG(event_time) OVER (
PARTITION BY user_id
ORDER BY event_time
) AS prev_event_time
FROM events
),
session_flags AS (
SELECT
user_id,
event_time,
event_type,
-- Flag: 1 = start of a NEW session (gap > 30 min OR first event)
CASE
WHEN prev_event_time IS NULL THEN 1 -- first event ever
WHEN event_time - prev_event_time > INTERVAL '30 minutes' THEN 1 -- gap > threshold
ELSE 0
END AS is_new_session
FROM events_with_prev
),
sessions AS (
SELECT
user_id,
event_time,
event_type,
is_new_session,
-- Cumulative sum of new_session flags = session_id per user!
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_id
FROM session_flags
)
-- Now you have session_id on every event. Use it for aggregation:
SELECT
user_id,
session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS events_in_session,
MAX(event_time) - MIN(event_time) AS session_duration
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_id;
SOLVED EXAMPLE — Q32: Assign session_id (new session = 30 min inactivity)
Problem:
Table: user_events (user_id, event_id, event_time, event_type)
Assign a session_id to each event.
New session starts when user has been inactive for > 30 minutes.
sql
WITH events_ordered AS (
SELECT
user_id,
event_id,
event_time,
event_type,
LAG(event_time) OVER (
PARTITION BY user_id
ORDER BY event_time
) AS prev_event_time
FROM user_events
),
new_session_flags AS (
SELECT
user_id,
event_id,
event_time,
event_type,
CASE
WHEN prev_event_time IS NULL THEN 1
WHEN EXTRACT(EPOCH FROM (event_time - prev_event_time)) / 60 > 30 THEN 1
-- Hive/Spark: (UNIX_TIMESTAMP(event_time) - UNIX_TIMESTAMP(prev_event_time)) / 60 > 30
ELSE 0
END AS new_session
FROM events_ordered
)
SELECT
user_id,
event_id,
event_time,
event_type,
SUM(new_session) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_id -- starts at 1 for first session, increments at each new session
FROM new_session_flags
ORDER BY user_id, event_time;
PATTERN 5 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q32 | Assign session_id (30 min gap = new session) | LAG + CASE WHEN gap > threshold + cumulative SUM |
| Q33 | Average session duration per user | Sessionize first, then AVG(session_end - session_start) |
| Q34 | Users with most sessions in a month | Sessionize, then COUNT(DISTINCT session_id) per user per month |
| Q35 | Sessions that resulted in a purchase | Sessionize, then LEFT JOIN purchase events within session |
| Q36 | Viewers who later became streamers | ROW_NUMBER to find first session type per user |
PATTERN 6: DEDUPLICATION
What Is It?
Remove duplicate rows from a table, keeping only one record per key. Critical for ETL pipelines, CDC (Change Data Capture) processing, and data quality checks.
Recognize It When You See:
- "Remove duplicates / deduplicate"
- "Keep only the latest/most recent record"
- "CDC table — keep current state of each record"
- "ETL bug created duplicate rows"
- "Find rows that appear more than once"
THE TEMPLATE
sql
-- TEMPLATE: Deduplication — keep most recent record per key
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY unique_key_col -- "what defines a duplicate?"
ORDER BY updated_at DESC -- "which one to keep?" (DESC = latest first)
) AS rn
FROM source_table
)
SELECT * EXCEPT (rn) -- or SELECT col1, col2... (exclude the rn column)
FROM ranked
WHERE rn = 1; -- keep only the first row (= latest record) per key
-- ⚠️ ROW_NUMBER always gives unique ranks (no ties)
-- → exactly 1 row per partition will have rn = 1
-- ⚠️ RANK() could give multiple rows with rank=1 if tie on updated_at → unsafe for dedup
SOLVED EXAMPLE — Q38 (Databricks/DE): CDC dedup — keep most recent per customer
Problem:
Table: customer_cdc (customer_id, name, email, address, modified_at)
This is a CDC (change data capture) table where every UPDATE creates a new row.
Keep only the most recent record per customer_id.
sql
WITH latest_record AS (
SELECT
customer_id,
name,
email,
address,
modified_at,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY modified_at DESC -- latest timestamp first → rn=1 is most recent
) AS rn
FROM customer_cdc
)
SELECT customer_id, name, email, address, modified_at
FROM latest_record
WHERE rn = 1;
-- ⚠️ In production Databricks/Delta Lake:
-- MERGE INTO customer_dim USING (SELECT ... WHERE rn=1) ON customer_id MATCH
-- This is the SCD Type 1 (overwrite) pattern
SOLVED EXAMPLE — Q40: Users with more than one account (same email, different user_ids)
sql
-- Approach 1: Find the emails that have duplicates
SELECT
email,
COUNT(DISTINCT user_id) AS account_count
FROM users
GROUP BY email
HAVING COUNT(DISTINCT user_id) > 1;
-- Approach 2: Return ALL user_ids that share an email with another user
SELECT u.user_id, u.email
FROM users u
WHERE u.email IN (
SELECT email
FROM users
GROUP BY email
HAVING COUNT(DISTINCT user_id) > 1
)
ORDER BY u.email, u.user_id;
PATTERN 6 QUESTIONS
| Q# | Question | Key Insight |
|---|---|---|
| Q37 | Count how many times each customer_id is duplicated | GROUP BY + HAVING COUNT > 1 |
| Q38 | CDC table — keep most recent record per customer | ROW_NUMBER PARTITION BY id ORDER BY modified_at DESC, keep rn=1 |
| Q39 | Duplicate transaction_ids — keep highest amount record | ROW_NUMBER PARTITION BY txn_id ORDER BY amount DESC |
| Q40 | Users with multiple accounts (same email) | GROUP BY email HAVING COUNT(DISTINCT user_id) > 1 |
| Q41 | Deduplicate user_profiles keeping lowest user_id per email | ROW_NUMBER PARTITION BY email ORDER BY user_id ASC, keep rn=1 |
⚠️ COMMON TRAPS IN HARD PATTERNS
🧠 Memory Map
GAPS & ISLANDS TRAPS:
TRAP 1: Forgetting to DISTINCT first
Same user can have multiple events on same day!
Always: SELECT DISTINCT user_id, activity_date BEFORE the ROW_NUMBER step
Otherwise: two events on 2024-01-01 give rn=1 and rn=2
→ dates - row numbers are different even though dates are consecutive!
TRAP 2: Integer vs Date subtraction syntax
PostgreSQL: date_col - rn (date - integer = date, works directly)
Hive/Spark: DATE_SUB(date_col, rn)
MySQL: DATE_SUB(date_col, INTERVAL rn DAY)
TRAP 3: Weekly streaks need DATE_TRUNC first
For "consecutive WEEKS" (not days): DATE_TRUNC('week', date) as the unit
Then gaps & islands on the week-truncated date
SESSIONIZATION TRAPS
TRAP 4: Hive/Spark UNIX_TIMESTAMP for time differences
PostgreSQL: event_time - prev_event_time > INTERVAL '30 minutes'
Hive/Spark: (UNIX_TIMESTAMP(event_time) - UNIX_TIMESTAMP(prev_event_time)) / 60 > 30
TRAP 5: Session_id starts at 1 automatically
SUM(new_session_flag) starting from 1 (not 0) because first event has new_session=1
So session_id = 1 for first session→good!
DEDUPLICATION TRAPS
TRAP 6: RANK() vs ROW_NUMBER() for dedup
RANK() with ties: multiple rows can have rank=1 → keeps ALL tied rows!
ROW_NUMBER()→always exactly ONE row per partition → safe for dedup