🗃️
SQL
SQL Patterns 7–9: Self-Join, Cross Join, Market Basket
🗃️
🗃️
SQL · Section 4 of 9

SQL Patterns 7–9: Self-Join, Cross Join, Market Basket

SQL Patterns 7–9: Self-Join, Cross Join, Market Basket

💡 Insight
Join-based patterns. The key insight: SQL can join a table to ITSELF. Once you understand this, a whole category of questions becomes easy.

🧠 MEMORY MAP

🧠 JOIN PATTERNS = "SCM"
JOIN PATTERNS"SCM"
SSelf-Join: same table twice (hierarchy, pairs, comparisons)
CCross Join: every row × every row (combinations, grids)
MMarket Basket: self-join on order_id (items bought together)
SELF-JOIN KEY RULE:
Use TWO aliases for the same table: FROM employees e1 JOIN employees e2
The JOIN condition defines the RELATIONSHIP between the two copies:
→ e2.manager_id = e1.id (hierarchy)
→ e1.score - e2.score <= 5 (similarity)
→ e1.item_id < e2.item_id (pairs, avoid duplicates)
CROSS JOIN RULE
N rows × M rows = N×M rows (Cartesian product)
Use only when you WANT every combination (grids, matchups, combinations)
⚠️Never accidentally CROSS JOIN large tables → explodes data!
MARKET BASKET KEY RULE
Self-join order_items ON same order_id BUT different product
Use: o1.product_id < o2.product_idprevents (A,B) AND (B,A) both appearing

PATTERN 7: SELF-JOIN

What Is It?

Join a table to itself to compare rows within the same table, traverse hierarchies, or find relationships between records of the same type.

Recognize It When You See:

  • "Employee and their manager" (same employees table)
  • "Find pairs that share [something]"
  • "Users who both attended [same event]"
  • "Salary higher than [their own manager]"
  • "Students who scored within N points of each other"
  • "Players who beat [same opponent] twice"

THE TEMPLATE

sql
-- TEMPLATE: Self-Join (two aliases for same table)

-- Pattern A: Hierarchy (employee → manager)
SELECT
    e.name AS employee_name,
    e.salary AS employee_salary,
    m.name AS manager_name,
    m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id   -- join condition defines relationship
WHERE e.salary > m.salary;                          -- filter condition

-- Pattern B: Finding pairs (avoid duplicates with < operator)
SELECT
    t1.id AS id1,
    t2.id AS id2,
    t1.col AS col1,
    t2.col AS col2
FROM table t1
JOIN table t2
    ON t1.shared_key = t2.shared_key    -- what links them
    AND t1.id < t2.id                   -- prevent (A,B) AND (B,A), prevent self-pairs
WHERE ABS(t1.value - t2.value) <= 5;   -- similarity condition

SOLVED EXAMPLE — Q42 (Amazon): Employees earning more than their manager

Problem:
Table: employees (employee_id, name, salary, manager_id)
manager_id references employee_id in the same table.
Find employees who earn MORE than their direct manager.
sql
SELECT
    e.name AS employee_name,
    e.salary AS employee_salary,
    m.name AS manager_name,
    m.salary AS manager_salary
FROM employees e
JOIN employees m
    ON e.manager_id = m.employee_id    -- link employee to their manager
WHERE e.salary > m.salary              -- filter: only employees earning more
ORDER BY e.salary DESC;

-- ⚠️ TRAP: Don't forget to handle NULLs!
--   Top-level managers have manager_id = NULL → JOIN filters them out (correct!)
--   Use INNER JOIN → rows without a manager are excluded automatically

SOLVED EXAMPLE — Q43 (Facebook): Friend recommendations via shared private events

Problem:
Table: event_rsvp (user_id, event_id, event_type)
Table: friendships (user_id_1, user_id_2)
Recommend friends: users who attended 2+ same PRIVATE events but aren't yet friends.
Return: user_id_1, user_id_2, shared_event_count
sql
WITH private_rsvp AS (
    -- Step 1: Only private events
    SELECT user_id, event_id
    FROM event_rsvp
    WHERE event_type = 'private'
),

shared_events AS (
    -- Step 2: Self-join to find users who attended same events
    SELECT
        r1.user_id AS user_1,
        r2.user_id AS user_2,
        COUNT(r1.event_id) AS shared_events
    FROM private_rsvp r1
    JOIN private_rsvp r2
        ON r1.event_id = r2.event_id      -- same event
        AND r1.user_id < r2.user_id        -- avoid (A,B) and (B,A), avoid self-pairs
    GROUP BY r1.user_id, r2.user_id
    HAVING COUNT(r1.event_id) >= 2         -- attended 2+ same events
),

existing_friends AS (
    -- Step 3: Normalize friendships table (ensure both directions covered)
    SELECT user_id_1 AS u1, user_id_2 AS u2 FROM friendships
    UNION
    SELECT user_id_2, user_id_1 FROM friendships
)

-- Step 4: Exclude pairs who are already friends
SELECT s.user_1, s.user_2, s.shared_events
FROM shared_events s
LEFT JOIN existing_friends f
    ON s.user_1 = f.u1 AND s.user_2 = f.u2
WHERE f.u1 IS NULL;   -- not already friends

PATTERN 7 QUESTIONS

Q#QuestionKey Insight
Q42Employees earning more than their managere JOIN e ON e.manager_id = m.id, WHERE e.salary > m.salary
Q43Friend recommendations via shared eventsSelf-join events + HAVING count >= 2 + exclude existing friends
Q44Student pairs in same class within 5 score pointsSelf-join ON same class + ABS(s1.score - s2.score) <= 5
Q45Cheapest two-stop flight routesSelf-join flights ON f1.destination = f2.origin
Q46Players who beat same opponent at least twiceSelf-join matches + GROUP BY player + HAVING COUNT >= 2
Q47Managers with 5+ direct reportsGROUP BY manager_id + HAVING COUNT(*) >= 5 (no self-join needed!)

PATTERN 8: CROSS JOIN — COMBINATION GENERATION

What Is It?

Generate every possible combination of rows from two tables. The result has N × M rows (Cartesian product). Used deliberately for: grid creation, tournament schedules, filling coverage gaps.

Recognize It When You See:

  • "All possible combinations of X and Y"
  • "All matchups / every team plays every other team"
  • "Generate all size-color combinations"
  • "Fill in missing date × product combinations"
  • "Find zero-sales days per product"

THE TEMPLATE

sql
-- TEMPLATE A: Simple combination generation
SELECT
    a.col1,
    b.col2
FROM table_a a
CROSS JOIN table_b b;   -- generates every row of a paired with every row of b

-- TEMPLATE B: Self-cross-join for pairs (e.g., tournament matchups)
SELECT
    t1.team_name AS home_team,
    t2.team_name AS away_team
FROM teams t1
CROSS JOIN teams t2
WHERE t1.team_id < t2.team_id;   -- prevents (A vs B) AND (B vs A), prevents self-matches

-- TEMPLATE C: Fill gaps — generate full grid then LEFT JOIN to actuals
WITH all_dates AS (
    SELECT generate_series(
        '2024-01-01'::date,
        '2024-12-31'::date,
        '1 day'::interval
    )::date AS date_val
    -- Hive/Spark: Use recursive CTE or pre-built date table
),
all_products AS (SELECT DISTINCT product_id FROM products)

SELECT
    d.date_val,
    p.product_id,
    COALESCE(s.revenue, 0) AS revenue   -- 0 for days with no sales
FROM all_dates d
CROSS JOIN all_products p
LEFT JOIN daily_sales s
    ON d.date_val = s.sale_date
    AND p.product_id = s.product_id;

SOLVED EXAMPLE — Q51 (McKinsey): All 3-topping pizza combinations with total cost

Problem:
Table: toppings (topping_name, ingredient_cost)
Generate all possible 3-topping combinations.
Return: topping1, topping2, topping3, total_cost (sorted alphabetically).
No topping can be used twice in a combination.
sql
SELECT
    t1.topping_name AS topping_1,
    t2.topping_name AS topping_2,
    t3.topping_name AS topping_3,
    t1.ingredient_cost + t2.ingredient_cost + t3.ingredient_cost AS total_cost
FROM toppings t1
JOIN toppings t2 ON t1.topping_name < t2.topping_name   -- ensures t1 < t2 < t3 alphabetically
JOIN toppings t3 ON t2.topping_name < t3.topping_name   -- avoids all duplicate combos
ORDER BY total_cost, topping_1, topping_2, topping_3;

-- ⚠️ Using < on the name column (lexicographic order) ensures:
--   (Basil, Mushroom, Peppers) appears but NOT (Mushroom, Basil, Peppers)
--   This gives exactly C(n,3) = n!/3!(n-3)! unique combinations

PATTERN 8 QUESTIONS

Q#QuestionKey Insight
Q48All size-color combinationsCROSS JOIN sizes × colors
Q49Full date × product grid for zero-fill reportingCROSS JOIN dates × products + LEFT JOIN sales
Q50Round-robin tournament scheduleCROSS JOIN teams WHERE t1.id < t2.id
Q513-topping pizza combinations with costThree-way JOIN with t1 < t2 < t3
Q52Region × category grid for monthly reportCROSS JOIN regions × categories + LEFT JOIN actuals

PATTERN 9: MARKET BASKET / PRODUCT PAIRS

What Is It?

Find items (products, events, attributes) that frequently appear TOGETHER in the same transaction/order/session. The secret weapon: self-join on the grouping key (order_id).

Recognize It When You See:

  • "Products/items frequently bought together"
  • "Most common pair in same order"
  • "Items purchased in the same transaction"
  • "Co-occurrence count"
  • "Lift / confidence / association rules"

THE TEMPLATE

sql
-- TEMPLATE: Market Basket — find co-occurring item pairs

WITH item_pairs AS (
    SELECT
        o1.product_id AS product_1,
        o2.product_id AS product_2,
        COUNT(DISTINCT o1.order_id) AS co_purchase_count
    FROM order_items o1
    JOIN order_items o2
        ON o1.order_id = o2.order_id        -- same order
        AND o1.product_id < o2.product_id    -- avoid (A,B) AND (B,A), avoid (A,A)
    GROUP BY o1.product_id, o2.product_id
    ORDER BY co_purchase_count DESC
)

SELECT
    p1.product_name AS product_1,
    p2.product_name AS product_2,
    ip.co_purchase_count
FROM item_pairs ip
JOIN products p1 ON ip.product_1 = p1.product_id
JOIN products p2 ON ip.product_2 = p2.product_id
ORDER BY co_purchase_count DESC
LIMIT 5;   -- top 5 pairs

SOLVED EXAMPLE — Q53 (Interview Query Hard): Top 5 product pairs bought together

Problem:
Table: transactions (order_id, user_id, product_id, quantity)
Table: products (product_id, product_name)
Find the 5 product pairs most frequently purchased in the same order.
Return: product1, product2, purchase_count
sql
WITH product_pairs AS (
    SELECT
        t1.product_id AS product_1_id,
        t2.product_id AS product_2_id,
        COUNT(DISTINCT t1.order_id) AS times_bought_together
    FROM transactions t1
    JOIN transactions t2
        ON t1.order_id = t2.order_id        -- same order
        AND t1.product_id < t2.product_id    -- canonical pair, no duplicates
    GROUP BY t1.product_id, t2.product_id
)

SELECT
    p1.product_name AS product_1,
    p2.product_name AS product_2,
    pp.times_bought_together
FROM product_pairs pp
JOIN products p1 ON pp.product_1_id = p1.product_id
JOIN products p2 ON pp.product_2_id = p2.product_id
ORDER BY pp.times_bought_together DESC
LIMIT 5;

SOLVED EXAMPLE — Q55: Product pairs with co-purchase > 100 and lift score

Problem:
Table: orders (order_id, product_id)
Find product pairs where co-purchase count > 100.
Also compute LIFT = pair_frequency / (freq_A × freq_B)
(Lift > 1 means the pair appears MORE than expected by chance)
sql
WITH total_orders AS (
    SELECT COUNT(DISTINCT order_id) AS total FROM orders
),

product_freq AS (
    -- How often does each product appear in ANY order?
    SELECT product_id, COUNT(DISTINCT order_id) AS freq
    FROM orders
    GROUP BY product_id
),

pair_freq AS (
    -- How often do pairs appear in the SAME order?
    SELECT
        o1.product_id AS prod_1,
        o2.product_id AS prod_2,
        COUNT(DISTINCT o1.order_id) AS pair_count
    FROM orders o1
    JOIN orders o2
        ON o1.order_id = o2.order_id
        AND o1.product_id < o2.product_id
    GROUP BY o1.product_id, o2.product_id
    HAVING COUNT(DISTINCT o1.order_id) > 100
)

SELECT
    pf.prod_1,
    pf.prod_2,
    pf.pair_count,
    -- LIFT = (pair_count / total) / ((freq_A / total) × (freq_B / total))
    --      = pair_count × total / (freq_A × freq_B)
    ROUND(
        pf.pair_count * t.total * 1.0 / (f1.freq * f2.freq),
    3) AS lift
FROM pair_freq pf
JOIN product_freq f1 ON pf.prod_1 = f1.product_id
JOIN product_freq f2 ON pf.prod_2 = f2.product_id
CROSS JOIN total_orders t
ORDER BY lift DESC;

-- Lift interpretation:
-- > 1.0: products bought together MORE than expected → strong association
-- = 1.0: products bought together exactly as expected → no association
-- < 1.0: products bought together LESS than expected → negative association

PATTERN 9 QUESTIONS

Q#QuestionKey Insight
Q53Top 5 product pairs most frequently bought togetherSelf-join ON order_id, prod1 < prod2, GROUP BY + LIMIT
Q54Product most commonly bought alongside Product XSame self-join but filter one side = Product X
Q55Pairs where co-purchase > 100 + lift scoreAdd product_freq and total_orders CTEs for lift calculation
Q56Menu items ordered together > 30% of the timeconfidence = pair_count / item_count; filter HAVING conf > 0.3

⚠️ COMMON TRAPS IN JOIN PATTERNS

🧠 Memory Map
SELF-JOIN TRAPS:
TRAP 1: Forgetting id < id2duplicate pairs
Without: SELECT * FROM t a JOIN t b ON a.shared = b.shared
→ Returns (A,B) AND (B,A) and (A,A) → pairs are doubled + self-pairs!
With: AND a.id < b.ideach pair appears exactly ONCE, no self-pairs
TRAP 2: Manager NULL handling
Top-level employees have manager_id = NULL
Inner JOIN on manager_id excludes them (good for "emp earns more than manager")
Use LEFT JOIN if you want to return ALL employees including top-level
CROSS JOIN TRAPS
TRAP 3: Accidental cross join on large tables
FROM table1, table2 ← old syntax for CROSS JOIN (comma = cross join!)
If you meant to write a regular JOIN and forgot the ON clausemassive data explosion
Always: write CROSS JOIN explicitly when you mean it, or INNER/LEFT JOIN with ON clause
MARKET BASKET TRAPS
TRAP 4: COUNT(order_id) vs COUNT(DISTINCT order_id)
If same product appears twice in one order (quantity > 1), order_id appears twice
Use COUNT(DISTINCT order_id) to count ORDERS (not line items)
Use COUNT(*) if you want to count total line item co-occurrences
TRAP 5: Performance on large order tables
Self-join on order_items can explode: 10M rows × 10M rows = 100 trillion pairs!
Always add PARTITION to narrow the self-join (e.g., same date or same user)
Or: pre-filter to only frequent products before the self-join