🐘
Hadoop
Day 2: Hive + Hadoop Ecosystem — Deep Interview Guide
🐘
🐘
Hadoop · Section 4 of 8

Day 2: Hive + Hadoop Ecosystem — Deep Interview Guide

Day 2: Hive + Hadoop Ecosystem — Deep Interview Guide

🧠 MASTER MEMORY MAP — Day 2

🧠 HIVE ARCHITECTURE = "DMCTE"
HIVE ARCHITECTURE"DMCTE"
DDriver (receives SQL queries, manages lifecycle)
MMetastore (schema + HDFS location mapping — the catalog)
CCompiler (parses SQL → query plan)
TTez/MR Engine (executes the plan)
EExecution (launches tasks on YARN)
HIVE OPTIMIZATION"VECTOR-TOP":
VVectorization (process 1024 rows at once, not 1)
EEngine: use Tez, not MapReduce (10x faster DAG vs chain)
CColumn pruning (SELECT only needed columns)
TTable stats: ANALYZE TABLE for CBO
OORC format (columnar + predicate pushdown)
RREPARTITION/REDISTRIBUTE (avoid data skew)
TTez DAG (avoid MapReduce chains)
OOrder of joins (small table FIRST for map join)
PPartition pruning (WHERE on partition columns only)
HIVE FILE FORMATS"OPTA"
OORC (best for Hive: columnar, ACID, compression)
PParquet (best for cross-tool: Spark, Impala, Hive)
TText (simple, no compression, never use in production)
AAvro (row-based, best for Kafka/Sqoop schema evolution)
PARTITIONING RULES"QDLF"
QQuery filter columns → partition by those columns
DDate/region usually best partition keys
LLow cardinality (10s to 100s of values, not millions!)
FFew files per partition (avoid small files inside partitions)

SECTION 1: HIVE ARCHITECTURE

What is Hive?

Hive is SQL-on-Hadoop. You write HiveQL (SQL dialect), Hive translates it to MapReduce or Tez jobs that run on YARN. Data lives in HDFS, schema lives in Metastore. Analogy: Hive is the "database frontend" for HDFS. HDFS stores the data like a file system, but you want to query it like a database. Hive bridges that gap.

🧠 Memory Map
HOW HIVE WORKS (end-to-end)
You type: SELECT country, COUNT(*) FROM bookings GROUP BY country;
Driver: Receives your query, manages session
Compiler: Parses SQLAST (Abstract Syntax Tree)
Metastore: Compiler asks Metastore: "where is 'bookings' table in HDFS?"
→ Returns: /user/hive/warehouse/bookings, format=ORC, partitions=[2023,2024]
Optimizer: Rewrite query to most efficient plan (Cost-Based Optimizer if stats exist)
→ Decides: map join or shuffle join? apply partition pruning?
Execution: Submit MapReduce/Tez job to YARN
Output: Results written to HDFS or returned to client

Hive Metastore — The Most Important Component

🧠 METASTORE = "The Schema Registry for Hive"
METASTORE"The Schema Registry for Hive"
What it stores:
Table name → HDFS path mapping (/user/hive/warehouse/bookings)
Table schema (columns, data types)
Partition information (which partition directories exist)
File format (ORC, Parquet, Text)
Statistics (number of rows, column stats — used by Cost-Based Optimizer)
Where it lives:
Backend: MySQL/PostgreSQL database (NOT HDFS!)
The MySQL DB stores all the schema metadata as rows
Metastore modes:
1. Embedded: Metastore + Hive in same JVM (dev/testing only)
2. Local: Separate Metastore process, same machine (not used much)
3. Remote: Metastore runs as separate service (PRODUCTION standard)
Multiple HiveServer2 instances connect to ONE Metastore
→ Allows BI tools (Tableau, PowerBI) to connect simultaneously
⚠️CRITICAL: If Metastore MySQL goes down → ALL Hive queries fail!
Because Hive cannot find WHERE tables are in HDFS without Metastore.
Solution: Metastore HA with MySQL replication or AWS RDS Multi-AZ

SECTION 2: HIVE INTERNAL vs EXTERNAL TABLES

The Most Commonly Asked Hive Question

🧠 Memory Map
INTERNAL TABLE (Managed Table)
CREATE TABLE bookings (
booking_id STRING,
passenger STRING,
flight STRING,
amount DOUBLE
)
STORED AS ORC;
-- Data goes to: /user/hive/warehouse/bookings/
-- Hive OWNS this data
WHAT HAPPENS ON DROP TABLE
→ DROP TABLE bookings;
→ Hive deletes BOTH metadata (from Metastore) AND data (from HDFS)!
⚠️Data is GONE. Cannot recover unless Trash is enabled.
EXTERNAL TABLE
CREATE EXTERNAL TABLE bookings_external (
booking_id STRING,
passenger STRING
)
LOCATION '/data/raw/bookings/' -- You specify HDFS path
STORED AS ORC;
-- Hive points TO this data but does NOT own it
WHAT HAPPENS ON DROP TABLE
→ DROP TABLE bookings_external;
→ Hive deletes ONLY metadata from Metastore
→ Data in /data/raw/bookings/ is UNTOUCHED ← this is the key difference!

Decision Guide — Internal vs External

USE INTERNAL WHEN
✓ Hive is the only tool accessing this data
✓ You want Hive to manage lifecycle (cleanup = just DROP TABLE)
✓ Gold/final tables that won't be shared
USE EXTERNAL WHEN
✓ Data is shared with Spark, Presto, Pig, other tools
✓ Data is ingested by another process (Sqoop, Flume, Kafka)
✓ You CANNOT afford accidental data deletion on DROP TABLE
✓ Data lives outside /user/hive/warehouse/
✓ Production Bronze/Silver tables (raw landing, shared data)
AMADEUS EXAMPLE
Raw flight data ingested by Sqoop every hourEXTERNAL TABLE
(Sqoop writes to HDFS path, Hive just reads it, never owns it)
Aggregated booking report tableINTERNAL TABLE
(Hive fully manages this output)

SECTION 3: HIVE PARTITIONING

Static vs Dynamic Partitioning

WHAT IS A HIVE PARTITION?
A partition = a subdirectory in HDFS under the table's location
Example: /user/hive/warehouse/bookings/booking_date=2024-01-15/
Hive reads ONLY the subdirectory for the date you filter on
Result: instead of scanning 365 days of datascan 1 day = 365x faster!
TABLE DEFINITION WITH PARTITION
sql
-- Partitioned table: booking_date is the partition column
-- booking_date does NOT appear in the regular column list!
CREATE TABLE bookings (
    booking_id  STRING,
    passenger   STRING,
    flight_code STRING,
    amount      DOUBLE
)
PARTITIONED BY (booking_date STRING)  -- partition key goes here
STORED AS ORC;

-- This creates directory structure:
-- /user/hive/warehouse/bookings/booking_date=2024-01-15/part-00000.orc
-- /user/hive/warehouse/bookings/booking_date=2024-01-16/part-00000.orc
-- ...
sql
-- STATIC PARTITIONING: you manually specify the partition value
INSERT INTO bookings PARTITION (booking_date='2024-01-15')
SELECT booking_id, passenger, flight_code, amount
FROM bookings_raw
WHERE booking_date = '2024-01-15';
-- Loads ONLY one partition at a time
-- SAFE: you control exactly which partition is written
-- Use for: loading historical data one date at a time
sql
-- DYNAMIC PARTITIONING: Hive reads the partition value from the data itself
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
-- nonstrict = allow ALL partitions to be dynamic (at least 1 static needed for strict mode)

INSERT INTO bookings PARTITION (booking_date)
SELECT booking_id, passenger, flight_code, amount, booking_date  -- booking_date last!
FROM bookings_raw;
-- Hive reads booking_date column from each row, creates partition automatically
-- ⚠️ Can create HUNDREDS of partitions in one query → many small files problem!
-- ⚠️ If booking_date has 1000 unique dates → 1000 partitions × reducers = MANY files

Partition Optimization Settings

sql
-- Limit max dynamic partitions (safety net)
SET hive.exec.max.dynamic.partitions=1000;
SET hive.exec.max.dynamic.partitions.pernode=256;
-- If query would create more than 1000 partitions → FAIL (prevents cluster overload)

-- Merge small output files after insert
SET hive.merge.mapredfiles=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;  -- 128 MB target
SET hive.merge.size.per.task=256000000;       -- 256 MB max
-- After INSERT, Hive runs a merge job to combine small output files into larger ones
-- This prevents small files inside partitions!

CRITICAL: MSCK REPAIR TABLE

sql
-- THE SCENARIO: Files added directly to HDFS (bypassing Hive)
-- Example: Sqoop wrote data to /user/hive/warehouse/bookings/booking_date=2024-02-01/
-- But Hive Metastore doesn't know this partition exists!

-- SYMPTOM:
SELECT * FROM bookings WHERE booking_date = '2024-02-01';
-- Returns: 0 rows (Metastore says this partition doesn't exist!)

-- FIX:
MSCK REPAIR TABLE bookings;
-- "MSCK" = MetaStore Check
-- What it does: scans HDFS for partition directories, adds missing partitions to Metastore
-- After this: Hive knows about the new partition → query works!

-- FASTER ALTERNATIVE (add one specific partition):
ALTER TABLE bookings ADD PARTITION (booking_date='2024-02-01')
LOCATION '/user/hive/warehouse/bookings/booking_date=2024-02-01/';

-- ⚠️ MSCK REPAIR can be SLOW on tables with THOUSANDS of partitions
-- (scans all of HDFS under the table path)
-- For large tables: use ALTER TABLE ADD PARTITION instead!

SECTION 4: HIVE BUCKETING

WHAT IS BUCKETING?
Divide table data into N fixed buckets (fixed number of files)
based on HASH of a column's value
Each bucket = one file in the partition directory
BUCKETING vs PARTITIONING:
PARTITIONING: BUCKETING:
Subdirectory per value Fixed N files per table/partition
Good for: date/region Good for: user_id, booking_id (high cardinality)
Cardinality: LOW Cardinality: HIGH
Pruning: YES Pruning: NO (reads all buckets usually)
Benefit: skip partitions Benefit: efficient sampling + JOIN optimization
sql
-- Create bucketed table: divide by booking_id into 32 buckets
CREATE TABLE bookings_bucketed (
    booking_id  STRING,
    passenger   STRING,
    amount      DOUBLE
)
CLUSTERED BY (booking_id) INTO 32 BUCKETS
STORED AS ORC;

-- Insert data — Hive automatically routes rows to correct bucket via hash(booking_id) % 32
INSERT INTO bookings_bucketed SELECT * FROM bookings_raw;

-- Bucket JOIN optimization (two tables bucketed by SAME column in SAME number of buckets)
-- Bucket 1 from table A joins with Bucket 1 from table B → NO SHUFFLE!
SET hive.optimize.bucketmapjoin=true;
SET hive.optimize.bucketmapjoin.sortedmerge=true;

-- Efficient SAMPLING (get 10% of data exactly):
SELECT * FROM bookings_bucketed TABLESAMPLE(BUCKET 1 OUT OF 10 ON booking_id);
-- Returns exactly 1/10 of rows = 10% sample (not random, but consistent)

SECTION 5: HIVE FILE FORMATS — DEEP COMPARISON

🧠 Memory Map
ORC (Optimized Row Columnar) — Best for HIVE:
Structure: StripeRow group → Column data
Each stripe = 250 MB (configurable)
Each stripe has: min/max stats for every column
→ Predicate pushdown: "WHERE amount > 1000"skip stripes where max < 1000!
Features:
✓ Built-in compression (ZLIB, Snappy)
✓ ACID support (INSERT/UPDATE/DELETE in Hive)
✓ Column-level min/max for predicate pushdown
✓ Native to Hive — best optimization
✓ Bloom filters per column (point lookup optimization)
✗ Less inter-operable with non-Hive tools
PARQUET — Best for CROSS-TOOL use (Hive + Spark + Impala):
Structure: Row groupColumn chunk → Data pages
Default row group: 128 MB
Column encoding: dictionary, RLE, delta
Features:
✓ Supported by Hive, Spark, Impala, Presto, Drill, Pandas
✓ Nested schemas (arrays, maps inside columns) — great for JSON-like data
✓ Good compression (Snappy or GZIP)
✗ ACID support in Hive is weaker than ORC
✗ Slightly less optimized in pure Hive workloads vs ORC
TEXT/CSV — Never use in production:
✗ No compression
✗ Full row scan always (no column pruning)
✗ No predicate pushdown
✗ Slow to read/write
✓ Human-readable (good for quick debugging only)
AVRO — Best for SCHEMA EVOLUTION + KAFKA/SQOOP:
Row-based (not columnar) → not great for analytics
✓ Schema evolution: add/remove fields without breaking downstream
✓ Works perfectly with Kafka (Confluent Schema Registry uses Avro)
✓ Sqoop uses Avro by default for imports
✗ Not columnarfull row scan
Use for: staging/landing tables, Kafka consumers, Sqoop imports

File Format Decision Guide

sql
-- Create ORC table with compression:
CREATE TABLE bookings_orc (
    booking_id STRING, passenger STRING, amount DOUBLE
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");
-- SNAPPY = fast (good for hot frequently-read data)
-- ZLIB = more compressed (good for cold archive data)

-- Create Parquet table:
CREATE TABLE bookings_parquet (
    booking_id STRING, passenger STRING, amount DOUBLE
)
STORED AS PARQUET;

-- Convert existing Text table to ORC (HUGE performance win):
CREATE TABLE bookings_orc STORED AS ORC AS
SELECT * FROM bookings_text;
-- This alone can make queries 10-50x faster!

SECTION 6: HIVE QUERY OPTIMIZATION — ALL SETTINGS

⚡ THIS IS WHAT A SENIOR ENGINEER MUST KNOW DEEPLY

Optimization 1: Execution Engine — Use Tez, NOT MapReduce

sql
-- DEFAULT in older clusters: MapReduce (slow)
-- SET this at the START of every Hive session:
SET hive.execution.engine=tez;  -- Or 'mr' for MapReduce (don't use)

-- WHY TEZ IS FASTER:
-- MapReduce: Chain of MR jobs → each writes to HDFS between stages
-- SELECT country, COUNT(*), SUM(amount) GROUP BY country ORDER BY country
-- MapReduce: Job1 (GROUP BY) → write HDFS → Job2 (ORDER BY) → write HDFS
-- Tez:       One DAG: GROUP BY → ORDER BY (in memory, no intermediate HDFS write!)
-- Result: 5-10x faster for multi-stage queries

Optimization 2: Vectorization (Process Rows in Batches)

sql
-- Enable vectorized query execution
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;

-- WHAT IT DOES:
-- Default: process 1 row at a time (row-by-row)
-- Vectorized: process 1024 rows at once (batch)
-- Uses CPU SIMD instructions (process multiple values in parallel on the CPU)
-- Result: 2-5x faster for column-heavy aggregations (SUM, AVG, COUNT)
-- Works BEST with ORC format (which stores data column-by-column)
-- ⚠️ Only works with ORC (not Text or CSV tables)

Optimization 3: Cost-Based Optimizer (CBO)

sql
-- Enable Cost-Based Optimizer
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;

-- CBO needs TABLE STATISTICS to work:
ANALYZE TABLE bookings COMPUTE STATISTICS;           -- row count, file sizes
ANALYZE TABLE bookings COMPUTE STATISTICS FOR COLUMNS booking_id, country, amount;
-- Column stats: min, max, distinct count, nulls
-- CBO uses these stats to choose optimal join order and join type

-- WITHOUT stats: Hive guesses → wrong join order → shuffle join for a small table
-- WITH stats:    CBO says "passengers table has 10k rows → use MAP JOIN, no shuffle"

Optimization 4: Map Join (Small Table Broadcast)

sql
-- MAP JOIN = load small table into EVERY mapper's RAM, join locally (NO SHUFFLE!)
-- Auto map join (CBO decides based on table size):
SET hive.auto.convert.join=true;
SET hive.mapjoin.smalltable.filesize=25000000;  -- 25 MB threshold
-- If the smaller table < 25 MB → Hive automatically uses map join
-- ⚠️ Increase this if you have slightly larger but still "small" dimension tables

-- FORCE map join for a specific query with hint:
SELECT /*+ MAPJOIN(airports) */
    b.booking_id,
    a.airport_name
FROM bookings b
JOIN airports a ON b.departure_airport = a.code;
-- airports table is small (100 airports) → broadcast to all mappers → NO shuffle!

-- BUCKET MAP JOIN (both tables bucketed by join key):
SET hive.optimize.bucketmapjoin=true;
-- Even faster: only load matching bucket from small table into mapper

Optimization 5: Partition Pruning

sql
-- PARTITION PRUNING: Hive SKIPS partitions not matching WHERE clause
-- ⚠️ ONLY WORKS if you filter on the PARTITION COLUMN!

-- ✓ GOOD — partition pruning works (only scans booking_date=2024-01-15):
SELECT COUNT(*) FROM bookings WHERE booking_date = '2024-01-15';

-- ✗ BAD — NO partition pruning (scans ALL partitions!):
SELECT COUNT(*) FROM bookings WHERE YEAR(booking_date) = 2024;
-- Hive applies function YEAR() to each row → can't prune without scanning all!
-- FIX: WHERE booking_date BETWEEN '2024-01-01' AND '2024-12-31'

-- ✗ ANOTHER TRAP — dynamic partition filter from subquery:
-- This may or may not prune depending on Hive version and optimizer
SELECT * FROM bookings b
JOIN date_dim d ON b.booking_date = d.date
WHERE d.year = 2024;
-- May not prune! Use: WHERE b.booking_date >= '2024-01-01'

Optimization 6: GROUP BY Skew Handling

sql
-- Data skew in GROUP BY (one key dominates):
-- Example: 50% of bookings are from 'US' → one reducer handles all US data!

SET hive.groupby.skewindata=true;
-- What Hive does: 2-phase aggregation:
-- Phase 1: randomly distribute keys → partial aggregation in parallel
-- Phase 2: final aggregation by real key
-- Result: no single reducer bottleneck for skewed keys

-- Also for skewed JOIN:
SET hive.optimize.skewjoin=true;
SET hive.skewjoin.key=100000;  -- Keys with > 100k rows are treated as "skew keys"

Optimization 7: LLAP (Live Long And Process) — Hive Sub-Second Queries

LLAPLive Long And Process
Problem: Every Hive query spins up new containersJVM startup = 5-30 seconds overhead
LLAP Solution:
Persistent JVM daemons on each node (always running, no startup!)
In-memory columnar cache (keep hot data in RAM across queries)
Sub-second queries for dashboard/BI workloads (like Impala but inside Hive)
SET hive.llap.execution.mode=auto; -- Use LLAP when available, fallback to Tez
⚡ INTERVIEW TIP:
"LLAP is Hive's answer to Impala. It adds persistent daemons with in-memory cache.
Without LLAP, every query has cold-start overhead. With LLAP, same data stays warm
in memory — especially valuable for BI dashboards hitting the same tables repeatedly."

Optimization 8: Sorting and Ordering

sql
-- ORDER BY vs SORT BY vs DISTRIBUTE BY vs CLUSTER BY:

ORDER BY column     -- Global sort (all data goes to ONE reducer) → SLOW for large data
SORT BY column      -- Local sort (each reducer sorts its own output) → FAST but not globally sorted
DISTRIBUTE BY col   -- Control which rows go to which reducer (like partitioner)
CLUSTER BY col      -- = DISTRIBUTE BY col SORT BY col (same column for both)

-- ⚡ INTERVIEW TRAP: "What's the difference between ORDER BY and SORT BY?"
-- ORDER BY: guaranteed global order, uses 1 reducer (slow for billions of rows!)
-- SORT BY: each reducer output is sorted, but globally NOT sorted (multiple reducers)
-- Use SORT BY when you just need local sorted chunks (e.g., before writing partitioned output)
-- Use ORDER BY only for final result sets that need to be globally ordered

SECTION 7: WHAT-IF SCENARIOS — HIVE

SCENARIO 1: Data deleted manually from HDFS (most common interview question!)

sql
-- WHAT HAPPENED:
-- Someone ran: hdfs dfs -rm -r /user/hive/warehouse/bookings/booking_date=2024-01-15/
-- Now this query returns 0 rows:
SELECT COUNT(*) FROM bookings WHERE booking_date = '2024-01-15';
-- Even though Metastore thinks the partition exists!

-- DIAGNOSIS:
-- 1. Check if partition exists in Metastore:
SHOW PARTITIONS bookings;
-- Shows: booking_date=2024-01-15 (it's there in Metastore!)
-- 2. Check if files exist in HDFS:
-- hdfs dfs -ls /user/hive/warehouse/bookings/booking_date=2024-01-15/
-- Output: No such file or directory ← CONFIRMED — data deleted!

-- FIX OPTIONS:
-- Option 1: Restore data from backup and MSCK REPAIR TABLE bookings
-- Option 2: If no backup — DROP the stale partition from Metastore:
ALTER TABLE bookings DROP PARTITION (booking_date='2024-01-15');
-- Now Hive is consistent again (it won't show non-existent partition)
-- Then: reload data into that partition from source

-- PREVENTION:
-- 1. NEVER run hdfs dfs -rm directly on Hive-managed table locations
-- 2. Use HDFS Trash (hdfs dfs -rm moves to .Trash, not permanent delete)
-- 3. Enable HDFS Snapshots for the warehouse directory

SCENARIO 2: New data added to HDFS but Hive doesn't see it

sql
-- WHAT HAPPENED: Sqoop/Spark wrote new partition directory to HDFS
-- but forgot to register it in Hive Metastore

-- DIAGNOSIS:
SELECT * FROM bookings WHERE booking_date = '2024-03-01';
-- Returns 0 rows... but file exists in HDFS!

-- FIX — Option 1: MSCK REPAIR (for EXTERNAL tables, scans all partitions):
MSCK REPAIR TABLE bookings;

-- FIX — Option 2: ADD PARTITION (fast, for specific known partition):
ALTER TABLE bookings ADD IF NOT EXISTS PARTITION (booking_date='2024-03-01')
LOCATION '/user/hive/warehouse/bookings/booking_date=2024-03-01/';

-- ⚠️ MSCK REPAIR on a table with 10,000 partitions → scans 10,000 HDFS paths!
-- Use ALTER TABLE ADD PARTITION when you know exactly which partition to add

SCENARIO 3: Hive query is very slow — debug and fix

sql
-- STEP 1: Enable EXPLAIN to see query plan:
EXPLAIN SELECT country, COUNT(*) FROM bookings GROUP BY country;
-- Look for: Map Join vs Shuffle Join, partition pruning status, # of Map/Reduce stages

-- STEP 2: Check query stats after run:
SET hive.stats.autogather=true;
-- After INSERT: Hive auto-collects stats (rows, bytes)

-- STEP 3: Check if vectorization is working:
EXPLAIN VECTORIZATION SELECT country, COUNT(*) FROM bookings_orc GROUP BY country;
-- Should show "Vectorized execution: true" for ORC tables

-- STEP 4: Check for data skew:
-- If one reducer takes much longer than others → skew problem
SET hive.groupby.skewindata=true;

-- STEP 5: Confirm Tez engine:
SET hive.execution.engine;
-- Should show: tez (if mr → change to tez immediately!)

SECTION 8: HBASE

What is HBase?

HBASENoSQL column-family database on top of HDFS
WHY HBASE? HDFS is write-once (append-only). HBase gives you random read/write!
Use when:
✓ Need random row lookups by key (e.g., GET booking_id = 'B12345')
✓ Need sub-10 ms response time for individual record access
✓ Millions of concurrent point queries
✓ Wide-column schema (each row can have different columns)
HBASE vs HIVE:
HBase: real-time point lookups (GET/PUT/DELETE by row key)
Hive: batch analytics (SELECT COUNT(*) GROUP BY)
HBase: sub-millisecond access, Hive: seconds to minutes
HBase: NoSQL, Hive: SQL
Both: run on top of HDFS

HBase Architecture

🧠 Memory Map
HBASE ARCHITECTURE
HMaster (like NameNode):
Manages region assignment (which RegionServer handles which range of rows)
Handles RegionServer failures, load balancing
ZooKeeper: elects HMaster (Active/Standby HA)
RegionServer (like DataNode):
Each RegionServer handles multiple Regions
Each Region = a range of row keys (e.g., rows A000-A999)
Contains MemStore (RAM) + HFiles (HDFS)
WRITE PATH (HBase)
1. Client writes to WAL (Write-Ahead Log = HLog on HDFS) → durability
2. Data written to MemStore (in RAM) → fast
3. When MemStore full (128 MB default) → flush to HFile on HDFS
4. Background: compact small HFiles into larger ones (compaction)
READ PATH
1. ClientZooKeeper: "Where is the META table?" (META = region location index)
2. ZooKeeper"META is on RegionServer 3"
3. ClientMETA: "Where is row key BK12345?"
4. META"BK12345 is in Region on RegionServer 7"
5. ClientRegionServer 7: GET row BK12345
6. RegionServer: check MemStore (RAM first) → if not found → BlockCache → HFile
7. Return row to client

HBase Row Key Design — CRITICAL INTERVIEW TOPIC

🧠 ROW KEY = Primary key + sorting key + access key in HBase
ROW KEYPrimary key + sorting key + access key in HBase
All HBase reads are by row key (or row key range scan)
❌ BAD ROW KEY: timestamp prefix
Row key: "2024-01-15_BK12345"
Problem: ALL writes go to ONE region (latest timestamp → same RegionServer = HOTSPOT!)
All 1000 writes/secondRegion Server 7 handles everything → overload!
✓ GOOD ROW KEY: reversed timestamp or salted key
Reversed timestamp: Long.MAX_VALUE - timestamp = rows stored newest-first
Salted: hash(booking_id) % 10 as prefix → distribute writes across 10 regions
ROW KEY DESIGN RULES
1. AVOID monotonically increasing keys (timestamps, auto-increment IDs)
→ Creates write hotspot on last region
2. DISTRIBUTE writes across regions (salt prefix, hash prefix, reverse timestamp)
3. Row key = most frequent access pattern (you can only get rows by key!)
4. Keep row keys SHORT (they're stored with every row, saved bytes matter)
5. Row key is LEXICOGRAPHICALLY sorted (alphabetical, not numeric!)
→ "9" > "10" in HBase (string sort, not number sort!)
→ Store numbers with zero-padding: "0009", "0010"
AMADEUS EXAMPLE
Booking lookup service:
✓ Row key: booking_id (used for direct lookups)
✗ NOT: booking_date + booking_id (hotspot on recent dates)
Use: salt_prefix(booking_id) + booking_id → distribute across regions

SECTION 9: SQOOP — RDBMS to Hadoop

SQOOPSQL-to-Hadoop
Purpose: Import data FROM relational databases (Oracle, MySQL, PostgreSQL) INTO HDFS/Hive
Export data FROM HDFS TO relational databases
Use case at Amadeus: Import flight inventory from OracleHDFS for Hive analysis
bash
# BASIC IMPORT: Oracle table → HDFS
sqoop import \
  --connect jdbc:oracle:thin:@//db-host:1521/AMADEUS \
  --username etl_user \
  --password 'secret' \
  --table FLIGHT_INVENTORY \
  --target-dir /data/raw/flight_inventory/ \
  --num-mappers 4             # 4 parallel import tasks

# IMPORT WITH HIVE INTEGRATION: directly to Hive table
sqoop import \
  --connect jdbc:mysql://db-host/bookings \
  --username user --password pass \
  --table bookings \
  --hive-import \             # create Hive table automatically
  --hive-table booking_db.bookings_raw \
  --create-hive-table \
  --num-mappers 8

# INCREMENTAL IMPORT (most important for production!):
# Only import NEW rows since last run (much faster than full import daily)

# Mode 1: append — for insert-only tables (new rows have higher IDs)
sqoop import \
  --connect jdbc:oracle:thin:@//db-host:1521/AMADEUS \
  --table BOOKINGS \
  --incremental append \
  --check-column booking_id \     # column to check for new rows
  --last-value 1000000 \          # only import booking_id > 1000000
  --target-dir /data/raw/bookings_incremental/

# Mode 2: lastmodified — for tables with an update timestamp
sqoop import \
  --connect jdbc:oracle:thin:@//db-host:1521/AMADEUS \
  --table BOOKINGS \
  --incremental lastmodified \
  --check-column updated_at \     # timestamp column
  --last-value "2024-01-15 00:00:00" \
  --target-dir /data/raw/bookings_incremental/

# --split-by: control how data is split across mappers (affects parallelism)
sqoop import \
  --table BOOKINGS \
  --split-by booking_date \       # each mapper handles a date range
  --num-mappers 8
# ⚠️ Default: split by primary key (must be numeric!)
# ⚠️ If no primary key and no --split-by → must use --num-mappers 1 (slow!)

Sqoop Interview Key Points

🧠 Memory Map
1. Sqoop uses MapReduce underneath (each mapper connects to DB, reads a range)
2. --num-mappers = number of DB connections simultaneously!
⚠️Too many mappers → DB connection pool exhausted → DB overload
Production recommendation: 4-8 mappers max (check with DBA)
3. Incremental import saves Sqoop job config in "metastore":
sqoop job --create daily_booking_import -- import ...
sqoop job --exec daily_booking_import
→ Saves last-value, auto-updates each run
4. Sqoop EXPORT (HDFS → Oracle):
sqoop export --connect ... --table TARGET_TABLE --export-dir /data/output/
⚠️EXPORT is NOT atomic — if it fails halfway, partial data is in DB!
Solution: use staging table + DB transaction to swap atomically

SECTION 10: OOZIE — WORKFLOW SCHEDULER

🧠 OOZIE = Hadoop workflow scheduler
OOZIEHadoop workflow scheduler
Think: Apache Oozie is like Apache Airflow for Hadoop
Schedule chains of: MapReduceHive → Sqoop → Pig jobs
TWO JOB TYPES
1. WORKFLOW JOB:
A DAG of actions (nodes)
Actions: MapReduce, Hive, Sqoop, Pig, Shell, Java, Spark
Runs once (you submit it)
Has: start node → action nodes → decision nodes → fork/join → end node
2. COORDINATOR JOB:
Schedules Workflow jobs on:
a) TIME trigger: run every day at 8am
b) DATA trigger: run when this HDFS path has new data
"Coordinator" = orchestrator of workflows
More powerful than cron: waits for both time AND data availability
BUNDLE JOB
A collection of Coordinator jobs managed together
Like a "project" that contains multiple pipelines
xml
<!-- workflow.xml — simple Hive workflow -->
<workflow-app name="daily-booking-report" xmlns="uri:oozie:workflow:0.5">

    <start to="hive-action"/>

    <!-- Action 1: Run Hive query -->
    <action name="hive-action">
        <hive xmlns="uri:oozie:hive-action:0.5">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <script>/user/oozie/scripts/aggregate_bookings.hql</script>
            <param>booking_date=${booking_date}</param>
        </hive>
        <ok to="sqoop-export"/>
        <error to="fail"/>
    </action>

    <!-- Action 2: Export result to Oracle -->
    <action name="sqoop-export">
        <sqoop xmlns="uri:oozie:sqoop-action:0.4">
            <command>export --connect jdbc:oracle:... --table REPORT_TABLE ...</command>
        </sqoop>
        <ok to="end"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Workflow failed: ${wf:errorMessage(wf:lastErrorNode())}</message>
    </kill>

    <end name="end"/>
</workflow-app>

SECTION 11: ZOOKEEPER — DISTRIBUTED COORDINATION

🧠 ZOOKEEPER = Distributed coordination service
ZOOKEEPERDistributed coordination service
Think: ZooKeeper is like a "trusted referee" for distributed systems.
When nodes disagree (who is the master? is this node alive?),
ZooKeeper is the single source of truth.
WHAT ZOOKEEPER DOES IN HADOOP
1. NameNode HA:
Keeps track of which NameNode is Active
ZKFC (ZooKeeper Failover Controller) on each NameNode
If Active NN dies → ZKFC detects → ZK elects new Active
2. HBase:
HMaster election (Active/Standby)
Stores META table location (where is the region directory?)
RegionServer health monitoring
3. YARN ResourceManager HA:
Active/Standby RM election
Stores AM state for recovery after RM restart
ZooKeeper DATA MODEL:
Tree of "znodes" (like directories/files)
Ephemeral znodes: auto-deleted when creator disconnects
→ Used for leader election (NN creates /namenode/active as ephemeral znode)
→ If NN diesephemeral znode deleted → Standby sees znode gone → takes over!
Persistent znodes: stay until explicitly deleted
→ Used for configuration storage
ZooKeeper ENSEMBLE:
Always odd number of ZooKeeper servers (3, 5, 7)
Quorum needed: majority must be up (3-node → need 2; 5-node → need 3)
3-node ZK: can handle 1 failure
5-node ZK: can handle 2 failures
⚠️NEVER run 2 ZooKeeper nodes! (1 failure = no quorum = total outage)

SECTION 12: FLUME — LOG INGESTION

🧠 FLUME = Distributed, reliable log collection service
FLUMEDistributed, reliable log collection service
Use case: Stream web server access logs from hundreds of servers INTO HDFS in real-time
FLUME ARCHITECTURE
SourceChannel → Sink
↑ ↑ ↓
Events Buffer Output
SOURCE: Where data comes from
Exec source: tail -F /var/log/apache/access.log
Spooling directory: watches a folder for new files
Avro source: receives Avro-encoded events from other Flume agents
Kafka source: reads from Kafka topic
CHANNEL: Buffer between source and sink
Memory channel: events in RAM → FAST but data loss on crash!
File channel: events on local disk → DURABLE, survives restart
Kafka channel: uses Kafka as the buffer (most durable, 2023+)
⚡ INTERVIEW: "Memory channel = fast but not durable, File channel = durable"
SINK: Where data goes
HDFS sink: writes to HDFS (most common)
Kafka sink: forward to Kafka topic
Avro sink: forward to another Flume agent (fan-out)
Logger sink: just logs events (debugging)
HDFS SINK SETTINGS
hdfs.rollInterval = 3600 # create new HDFS file every 1 hour
hdfs.rollSize = 134217728 # roll file when it reaches 128 MB
hdfs.rollCount = 0 # don't roll by event count (0 = disabled)
hdfs.path = /user/logs/%Y/%m/%d/%H # time-based partitioning!

SECTION 13: DESIGN SCENARIO — Amadeus End-to-End Pipeline

"Design a Hadoop pipeline to process 100 million daily flight bookings at Amadeus"

🧠 Memory Map
INTERVIEWER: "Walk me through how you'd design the data pipeline."
YOUR ANSWER
INGESTION LAYER
Option A: Batch (daily) — Sqoop incremental import from Oracle booking DB
→ Run at 2am, import new bookings since last run
→ Write to HDFS: /data/raw/bookings/booking_date=YYYY-MM-DD/ (ORC format)
Option B: Real-time — Flume agents on web serversKafka → Spark Streaming
→ Near-real-time (minutes), for fraud detection use case
STORAGE
Raw (Bronze): HDFS, External Hive table, Avro or ORC format, partitioned by date
Processed (Silver): HDFS, ORC + Snappy, partitioned by booking_date + region
Aggregated (Gold): HDFS, ORC + ZLIB (more compressed for smaller aggregations)
PROCESSING
Hive + Tez for batch transformations
SET hive.execution.engine=tez;
SET hive.vectorized.execution.enabled=true;
Partition pruning for date-range queries
Map join for dimension tables (airports 10k rows, aircraft_types 500 rows)
SERVING
HBase: booking lookup service (real-time GET by booking_id)
Hive: ad-hoc analytics queries by data scientists
LLAP: BI dashboards (Tableau/PowerBI connecting via HiveServer2)
ORCHESTRATION
Oozie Coordinator: time trigger at 3am + data availability check
Workflow: SqoopHive transformations → Hive aggregations → Sqoop export to Oracle
OPTIMIZATION APPLIED
✓ ORC format throughout Silver/Gold layers
✓ Partition by booking_date (pruning for date-range queries)
✓ Tez execution engine (not MapReduce)
✓ Vectorization enabled
✓ CBO with ANALYZE TABLE for optimal join plans
✓ Map join for small dimension tables (airports, aircraft)
✓ Sort buffer 512 MB (not default 100 MB)
✓ Compression: Snappy for hot data, ZLIB for cold archive