🗃️
SQL
SQL Patterns 4–6: Gaps & Islands, Sessionization, Deduplication
🗃️
🗃️
SQL · Section 3 of 9

SQL Patterns 4–6: Gaps & Islands, Sessionization, Deduplication

SQL Patterns 4–6: Gaps & Islands, Sessionization, Deduplication

These are the patterns that separate average SQL writers from senior engineers. Gaps & Islands is THE hardest commonly-asked pattern. Master it.

🧠 MEMORY MAP

🧠 HARD PATTERNS = "GSD"
HARD PATTERNS"GSD"
GGaps & Islands: find consecutive sequences (date - ROW_NUMBER = constant!)
SSessionization: group events by inactivity gap (LAG + cumulative SUM)
DDeduplication: keep one row per key (ROW_NUMBER = 1)
GAPS & ISLANDS KEY TRICK:
If dates are CONSECUTIVE(date - ROW_NUMBER) = SAME VALUE for the whole group!
When the sequence BREAKSthe (date - ROW_NUMBER) VALUE CHANGES
Group by that constanteach "island" becomes one group!
date row_num date - row_num = group_key
2024-01-01 1 2024-01-01 - 1 = 2023-12-31 ← island 1
2024-01-02 2 2024-01-02 - 2 = 2023-12-31 ← island 1 (same!)
2024-01-03 3 2024-01-03 - 3 = 2023-12-31 ← island 1 (same!)
← GAP: 01-04 missing →
2024-01-05 4 2024-01-05 - 4 = 2024-01-01 ← island 2 (different!)
2024-01-06 5 2024-01-06 - 5 = 2024-01-01 ← island 2 (same!)

PATTERN 4: GAPS & ISLANDS

What Is It?

Find consecutive sequences ("islands") of events/dates, or the gaps between them. Classic use: login streaks, consecutive active days, periods of server uptime, stocks with N consecutive price increases.

Recognize It When You See:

  • "Consecutive days / weeks"
  • "Continuous period of [status]"
  • "Longest streak of [activity]"
  • "Start and end date of each [active/inactive] period"
  • "N or more consecutive [events]"
  • "Periods of server uptime/downtime"

THE TEMPLATE

sql
-- TEMPLATE: Gaps & Islands
-- Find all consecutive date sequences per user/entity

WITH activity AS (
    -- Step 1: Get one row per user per active date (deduplicate if needed)
    SELECT DISTINCT
        user_id,
        activity_date
    FROM events
),

numbered AS (
    -- Step 2: Assign row number PER USER ordered by date
    SELECT
        user_id,
        activity_date,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY activity_date
        ) AS rn
    FROM activity
),

islands AS (
    -- Step 3: The KEY TRICK — subtract row number from date
    -- Consecutive dates → same (date - rn) value = same island group
    SELECT
        user_id,
        activity_date,
        rn,
        -- This creates a constant "group key" for each consecutive run:
        (activity_date - INTERVAL '1 day' * (rn - 1)) AS island_group
        -- In Hive: DATE_SUB(activity_date, rn - 1) AS island_group
        -- In Spark: DATE_SUB(activity_date, rn - 1) AS island_group
    FROM numbered
)

-- Step 4: Aggregate each island
SELECT
    user_id,
    MIN(activity_date) AS streak_start,
    MAX(activity_date) AS streak_end,
    COUNT(*) AS streak_length
FROM islands
GROUP BY user_id, island_group
ORDER BY user_id, streak_start;

SOLVED EXAMPLE — Q25 (StrataScratch Hard): Top 3 users by longest login streak

Problem:
Table: logins (user_id, login_date)
Find the top 3 users with the longest continuous login streak (no missing days).
Return: user_id, streak_start, streak_end, streak_length
sql
WITH deduplicated AS (
    -- Remove duplicate logins on same day for same user
    SELECT DISTINCT user_id, login_date
    FROM logins
),

numbered AS (
    SELECT
        user_id,
        login_date,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY login_date
        ) AS rn
    FROM deduplicated
),

islands AS (
    SELECT
        user_id,
        login_date,
        -- THE TRICK: consecutive dates minus sequential row numbers = constant
        (login_date - CAST(rn AS INT)) AS island_key
        -- Postgres: login_date - rn  (integer subtraction from date)
        -- Hive/Spark: DATE_SUB(login_date, rn)
    FROM numbered
),

streaks AS (
    SELECT
        user_id,
        MIN(login_date) AS streak_start,
        MAX(login_date) AS streak_end,
        COUNT(*) AS streak_length
    FROM islands
    GROUP BY user_id, island_key
),

ranked_streaks AS (
    SELECT
        user_id,
        streak_start,
        streak_end,
        streak_length,
        RANK() OVER (ORDER BY streak_length DESC) AS streak_rank
    FROM streaks
)

SELECT user_id, streak_start, streak_end, streak_length
FROM ranked_streaks
WHERE streak_rank <= 3
ORDER BY streak_length DESC;

SOLVED EXAMPLE — Q29: Stocks with 3+ consecutive days of price increases

Problem:
Table: stock_prices (stock_symbol, price_date, closing_price)
Find all stocks that had consecutive days of price INCREASES for at least 3 days.
Return: stock_symbol, start_date, end_date, streak_length
sql
WITH price_direction AS (
    SELECT
        stock_symbol,
        price_date,
        closing_price,
        LAG(closing_price) OVER (
            PARTITION BY stock_symbol
            ORDER BY price_date
        ) AS prev_price,
        -- Flag: 1 if price increased from yesterday, 0 otherwise
        CASE WHEN closing_price > LAG(closing_price) OVER (
            PARTITION BY stock_symbol ORDER BY price_date
        ) THEN 1 ELSE 0 END AS is_increase
    FROM stock_prices
),

increases_only AS (
    -- Keep only the increasing-price rows
    SELECT stock_symbol, price_date
    FROM price_direction
    WHERE is_increase = 1
),

numbered AS (
    SELECT
        stock_symbol,
        price_date,
        ROW_NUMBER() OVER (PARTITION BY stock_symbol ORDER BY price_date) AS rn
    FROM increases_only
),

islands AS (
    SELECT
        stock_symbol,
        price_date,
        (price_date - CAST(rn AS INT)) AS island_key
    FROM numbered
),

streaks AS (
    SELECT
        stock_symbol,
        MIN(price_date) AS streak_start,
        MAX(price_date) AS streak_end,
        COUNT(*) AS streak_length
    FROM islands
    GROUP BY stock_symbol, island_key
)

SELECT stock_symbol, streak_start, streak_end, streak_length
FROM streaks
WHERE streak_length >= 3
ORDER BY streak_length DESC;

PATTERN 4 QUESTIONS

Q#QuestionKey Insight
Q25Top 3 users with longest login streakFull gaps & islands + RANK on streak_length
Q26Employees who worked consecutive days 5+ in a rowisland_group on work_date, filter streak >= 5
Q27Longest gap between orders per supplierLEAD to get next_order, subtract dates, find max gap
Q28Users who ordered every week for 4+ consecutive weeksDATE_TRUNC('week') as the unit, then gaps & islands
Q29Stocks with 3+ consecutive days of price increasesAdd price_direction flag first, then gaps & islands on increases
Q30Date ranges when server was continuously onlineGaps & islands on status='up' rows
Q31Users with 30+ day inactivity gaps in historyLEAD to find next event, filter gap > 30 days

PATTERN 5: SESSIONIZATION

What Is It?

Group raw event logs into sessions. A new session starts when a user has been inactive for more than a threshold (e.g., 30 minutes). Every event gets a session_id.

Recognize It When You See:

  • "Group events into sessions"
  • "Session = no activity for more than X minutes"
  • "Count number of sessions per user"
  • "Average session duration"
  • "Did a conversion happen within the same session?"

THE TEMPLATE

sql
-- TEMPLATE: Sessionization (create session_id from events)

WITH events_with_prev AS (
    SELECT
        user_id,
        event_time,
        event_type,
        -- Get the previous event time for this user
        LAG(event_time) OVER (
            PARTITION BY user_id
            ORDER BY event_time
        ) AS prev_event_time
    FROM events
),

session_flags AS (
    SELECT
        user_id,
        event_time,
        event_type,
        -- Flag: 1 = start of a NEW session (gap > 30 min OR first event)
        CASE
            WHEN prev_event_time IS NULL THEN 1                       -- first event ever
            WHEN event_time - prev_event_time > INTERVAL '30 minutes' THEN 1  -- gap > threshold
            ELSE 0
        END AS is_new_session
    FROM events_with_prev
),

sessions AS (
    SELECT
        user_id,
        event_time,
        event_type,
        is_new_session,
        -- Cumulative sum of new_session flags = session_id per user!
        SUM(is_new_session) OVER (
            PARTITION BY user_id
            ORDER BY event_time
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS session_id
    FROM session_flags
)

-- Now you have session_id on every event. Use it for aggregation:
SELECT
    user_id,
    session_id,
    MIN(event_time) AS session_start,
    MAX(event_time) AS session_end,
    COUNT(*) AS events_in_session,
    MAX(event_time) - MIN(event_time) AS session_duration
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_id;

SOLVED EXAMPLE — Q32: Assign session_id (new session = 30 min inactivity)

Problem:
Table: user_events (user_id, event_id, event_time, event_type)
Assign a session_id to each event.
New session starts when user has been inactive for > 30 minutes.
sql
WITH events_ordered AS (
    SELECT
        user_id,
        event_id,
        event_time,
        event_type,
        LAG(event_time) OVER (
            PARTITION BY user_id
            ORDER BY event_time
        ) AS prev_event_time
    FROM user_events
),

new_session_flags AS (
    SELECT
        user_id,
        event_id,
        event_time,
        event_type,
        CASE
            WHEN prev_event_time IS NULL THEN 1
            WHEN EXTRACT(EPOCH FROM (event_time - prev_event_time)) / 60 > 30 THEN 1
            -- Hive/Spark: (UNIX_TIMESTAMP(event_time) - UNIX_TIMESTAMP(prev_event_time)) / 60 > 30
            ELSE 0
        END AS new_session
    FROM events_ordered
)

SELECT
    user_id,
    event_id,
    event_time,
    event_type,
    SUM(new_session) OVER (
        PARTITION BY user_id
        ORDER BY event_time
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS session_id   -- starts at 1 for first session, increments at each new session
FROM new_session_flags
ORDER BY user_id, event_time;

PATTERN 5 QUESTIONS

Q#QuestionKey Insight
Q32Assign session_id (30 min gap = new session)LAG + CASE WHEN gap > threshold + cumulative SUM
Q33Average session duration per userSessionize first, then AVG(session_end - session_start)
Q34Users with most sessions in a monthSessionize, then COUNT(DISTINCT session_id) per user per month
Q35Sessions that resulted in a purchaseSessionize, then LEFT JOIN purchase events within session
Q36Viewers who later became streamersROW_NUMBER to find first session type per user

PATTERN 6: DEDUPLICATION

What Is It?

Remove duplicate rows from a table, keeping only one record per key. Critical for ETL pipelines, CDC (Change Data Capture) processing, and data quality checks.

Recognize It When You See:

  • "Remove duplicates / deduplicate"
  • "Keep only the latest/most recent record"
  • "CDC table — keep current state of each record"
  • "ETL bug created duplicate rows"
  • "Find rows that appear more than once"

THE TEMPLATE

sql
-- TEMPLATE: Deduplication — keep most recent record per key
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY unique_key_col     -- "what defines a duplicate?"
            ORDER BY updated_at DESC        -- "which one to keep?" (DESC = latest first)
        ) AS rn
    FROM source_table
)

SELECT * EXCEPT (rn)   -- or SELECT col1, col2... (exclude the rn column)
FROM ranked
WHERE rn = 1;          -- keep only the first row (= latest record) per key

-- ⚠️ ROW_NUMBER always gives unique ranks (no ties)
-- → exactly 1 row per partition will have rn = 1
-- ⚠️ RANK() could give multiple rows with rank=1 if tie on updated_at → unsafe for dedup

SOLVED EXAMPLE — Q38 (Databricks/DE): CDC dedup — keep most recent per customer

Problem:
Table: customer_cdc (customer_id, name, email, address, modified_at)
This is a CDC (change data capture) table where every UPDATE creates a new row.
Keep only the most recent record per customer_id.
sql
WITH latest_record AS (
    SELECT
        customer_id,
        name,
        email,
        address,
        modified_at,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY modified_at DESC    -- latest timestamp first → rn=1 is most recent
        ) AS rn
    FROM customer_cdc
)

SELECT customer_id, name, email, address, modified_at
FROM latest_record
WHERE rn = 1;

-- ⚠️ In production Databricks/Delta Lake:
-- MERGE INTO customer_dim USING (SELECT ... WHERE rn=1) ON customer_id MATCH
-- This is the SCD Type 1 (overwrite) pattern

SOLVED EXAMPLE — Q40: Users with more than one account (same email, different user_ids)

sql
-- Approach 1: Find the emails that have duplicates
SELECT
    email,
    COUNT(DISTINCT user_id) AS account_count
FROM users
GROUP BY email
HAVING COUNT(DISTINCT user_id) > 1;

-- Approach 2: Return ALL user_ids that share an email with another user
SELECT u.user_id, u.email
FROM users u
WHERE u.email IN (
    SELECT email
    FROM users
    GROUP BY email
    HAVING COUNT(DISTINCT user_id) > 1
)
ORDER BY u.email, u.user_id;

PATTERN 6 QUESTIONS

Q#QuestionKey Insight
Q37Count how many times each customer_id is duplicatedGROUP BY + HAVING COUNT > 1
Q38CDC table — keep most recent record per customerROW_NUMBER PARTITION BY id ORDER BY modified_at DESC, keep rn=1
Q39Duplicate transaction_ids — keep highest amount recordROW_NUMBER PARTITION BY txn_id ORDER BY amount DESC
Q40Users with multiple accounts (same email)GROUP BY email HAVING COUNT(DISTINCT user_id) > 1
Q41Deduplicate user_profiles keeping lowest user_id per emailROW_NUMBER PARTITION BY email ORDER BY user_id ASC, keep rn=1

⚠️ COMMON TRAPS IN HARD PATTERNS

🧠 Memory Map
GAPS & ISLANDS TRAPS:
TRAP 1: Forgetting to DISTINCT first
Same user can have multiple events on same day!
Always: SELECT DISTINCT user_id, activity_date BEFORE the ROW_NUMBER step
Otherwise: two events on 2024-01-01 give rn=1 and rn=2
→ dates - row numbers are different even though dates are consecutive!
TRAP 2: Integer vs Date subtraction syntax
PostgreSQL: date_col - rn (date - integer = date, works directly)
Hive/Spark: DATE_SUB(date_col, rn)
MySQL: DATE_SUB(date_col, INTERVAL rn DAY)
TRAP 3: Weekly streaks need DATE_TRUNC first
For "consecutive WEEKS" (not days): DATE_TRUNC('week', date) as the unit
Then gaps & islands on the week-truncated date
SESSIONIZATION TRAPS
TRAP 4: Hive/Spark UNIX_TIMESTAMP for time differences
PostgreSQL: event_time - prev_event_time > INTERVAL '30 minutes'
Hive/Spark: (UNIX_TIMESTAMP(event_time) - UNIX_TIMESTAMP(prev_event_time)) / 60 > 30
TRAP 5: Session_id starts at 1 automatically
SUM(new_session_flag) starting from 1 (not 0) because first event has new_session=1
So session_id = 1 for first sessiongood!
DEDUPLICATION TRAPS
TRAP 6: RANK() vs ROW_NUMBER() for dedup
RANK() with ties: multiple rows can have rank=1 → keeps ALL tied rows!
ROW_NUMBER()always exactly ONE row per partition → safe for dedup