🐘
Hadoop
Day 2: Hive + Ecosystem — Quick Recall Guide
🐘
🐘
Hadoop · Section 5 of 8

Day 2: Hive + Ecosystem — Quick Recall Guide

Day 2: Hive + Ecosystem — Quick Recall Guide

Must remember🔑Key concept⚠️Common trap🧠Memory Map📝One-liner

🧠 MASTER MEMORY MAP — Day 2

🧠 HIVE ARCHITECTURE = "DMCTE"
HIVE ARCHITECTURE"DMCTE"
DDriver (receives SQL, manages lifecycle)
MMetastore (schema + HDFS location → MySQL backend!)
CCompiler (SQL → query plan)
TTez/MR engine (executes plan on YARN)
EExecution (submits to YARN, reads from HDFS)
HIVE OPTIMIZATION"VECTOR-TOP":
VVectorization (1024 rows at once: hive.vectorized.execution.enabled=true)
EEngine=Tez (hive.execution.engine=tez, not mr)
CCBO stats (ANALYZE TABLE + hive.cbo.enable=true)
TTez DAG (in-memory multi-stage, no intermediate HDFS writes)
OORC format (columnar + predicate pushdown + ACID)
RREPARTITION/skew fix (hive.groupby.skewindata=true)
TTable map join (hive.auto.convert.join=true, small table → broadcast)
OORDER BY → SORT BY (local sort, use instead of global sort)
PPartition pruning (filter on partition column, not derived expressions)
HIVE FILE FORMATS"OPTA":
OORC (Hive-native, ACID, predicate pushdown — BEST FOR HIVE)
PParquet (cross-tool: Spark+Impala+Hive — BEST FOR MULTI-TOOL)
TText (never in production, no compression, full scan always)
AAvro (schema evolution, Kafka/Sqoop landing — ROW-based)
INTERNAL vs EXTERNAL:
Internal = Hive OWNS dataDROP TABLE = data DELETED from HDFS
External = Hive POINTS to dataDROP TABLE = only metadata deleted, data SAFE
⚡ Rule: External for shared/raw data, Internal for Hive-only output
ECOSYSTEM"SQOF-ZK":
SSqoop (RDBMS ↔ HDFS transfer)
Q(Hive = SQL on HDFS)
OOozie (workflow + coordinator jobs)
FFlume (log streaming → HDFS)
ZK — ZooKeeper (distributed coordination, leader election)

SECTION 1: HIVE — DIRECT QUESTIONS

Q1What is the Hive Metastore?

MySQL/PostgreSQL database that stores schema metadata: table names, column types, HDFS paths, partition information, file formats. Hive queries it to know WHERE data lives in HDFS. If Metastore goes down → ALL Hive queries fail.

Q2Internal vs External table — key difference?

Internal: Hive owns data. DROP TABLE = delete metadata + delete HDFS data. External: Hive points to data. DROP TABLE = delete metadata only, HDFS data untouched.

Q3When to use External table?

Always for: raw/Bronze landing tables, shared data (Spark also reads it), Sqoop-ingested data, data you can't afford to accidentally delete.

Q4What is a Hive partition?

A subdirectory in HDFS organized by a column value. booking_date=2024-01-15/ is one partition. Partition pruning skips unrelated directories → reads only what's needed.

⚠️ Q5Common partition column trap?

WHERE YEAR(booking_date) = 2024 → NO partition pruning (function applied to column). Fix: WHERE booking_date BETWEEN '2024-01-01' AND '2024-12-31' — Hive can prune date range.

Q6Static vs Dynamic partitioning?

Static: you specify partition value in INSERT (PARTITION (date='2024-01-15')). Dynamic: Hive reads partition value from the data itself (PARTITION (date), last column in SELECT). Dynamic can create many files → enable merge settings.

SECTION 2: HIVE SCENARIOS

Q7 SCENARIO: Sqoop wrote new data to HDFS but Hive shows 0 rows. Fix?

Metastore doesn't know about the new partition. Fix: MSCK REPAIR TABLE bookings; — scans HDFS, adds missing partitions to Metastore. Or faster: ALTER TABLE bookings ADD PARTITION (booking_date='2024-03-01') LOCATION '...'

⚠️ Q8 SCENARIO: Someone ran hdfs dfs -rm -r on a Hive partition. Now what?

Check Trash first: hdfs dfs -ls /user/username/.Trash/Current/ → restore if found. If no trash: restore from backup. If no backup: ALTER TABLE bookings DROP PARTITION (booking_date='...') to remove stale metadata. Prevent with snapshots.

Q9 SCENARIO: Hive query runs for hours. Top 5 things to check?

    1. SET hive.execution.engine → should be tez, not mr
    2. 2SET hive.vectorized.execution.enabled=true (ORC only)
    3. 3Check if partition pruning works (WHERE on partition column directly)
    4. 4Any shuffle joins that could be map joins? (small tables < 25 MB)
    5. 5Data skew? SET hive.groupby.skewindata=true

SECTION 3: HIVE OPTIMIZATIONS — FLASH CARDS

Q10What does hive.execution.engine=tez do?

Replaces MapReduce with Apache Tez. Tez builds a DAG of operators in memory — no intermediate HDFS writes between stages. Result: 5-10x faster for multi-stage queries.

Q11What does vectorization do?

SET hive.vectorized.execution.enabled=true — processes 1024 rows at once using CPU SIMD instructions instead of row-by-row. 2-5x faster for aggregations. Only works with ORC format.

Q12What is a Map Join and when does Hive use it?

Load small table into every mapper's RAM → join locally without shuffle. Hive auto-decides: hive.auto.convert.join=true with threshold hive.mapjoin.smalltable.filesize=25 MB. Force with /*+ MAPJOIN(table) */ hint.

⚠️ Q13ORDER BY vs SORT BY — when to use each?
Pro Tip
ORDER BY: global sort → all data → ONE reducer (slow for billions!). SORT BY: each reducer sorts its own output (local sort, faster). Use SORT BY + multiple reducers when global order not required.
Q14What is MSCK REPAIR TABLE?

MetaStore Check — scans HDFS for new partition directories and adds them to the Hive Metastore. Needed when files are added directly to HDFS without going through Hive INSERT. ⚠️ Slow on tables with 10,000+ partitions.

Q15What is ORC predicate pushdown?

ORC stores min/max statistics per column per stripe. When you do WHERE amount > 1000, Hive reads only stripes where max(amount) > 1000. Skip stripes without touching them → can skip 90%+ of data!

Q16What is hive.groupby.skewindata=true?

Two-phase aggregation for skewed GROUP BY. Phase 1: randomly distribute keys → partial aggregation in parallel. Phase 2: final aggregation by real key. Prevents one reducer getting all "US" country records.

SECTION 4: HBASE — FLASH CARDS

Q17What is HBase and when do you use it?

NoSQL column-family database on top of HDFS. Use for: random row-level reads/writes (GET/PUT by key), sub-millisecond access, millions of concurrent point queries. NOT for: batch analytics (use Hive for that).

Q18What is a Row Key in HBase?

Primary sort key for all HBase data. All access is BY row key. Lexicographically sorted. Design it based on your access pattern (how will you look up rows?).

⚠️ Q19Why is timestamp as row key bad?

Monotonically increasing timestamps → all new writes go to the LAST region (last RegionServer) → write hotspot! All 1000 writes/second hit one machine. Fix: salt prefix (hash(id) % 10) or reverse timestamp (Long.MAX_VALUE - timestamp).

Q20HBase write path?
    1. WAL (Write-Ahead Log on HDFS) for durability
    2. 2MemStore (RAM) for fast writes
    3. 3When MemStore full (128 MB) → flush to HFile on HDFS
    4. 4Background: compact HFiles to reduce read overhead.

SECTION 5: SQOOP — FLASH CARDS

Q21What is Sqoop incremental import?

Import only NEW rows since last run. Mode append: only rows where check-column > last-value. Mode lastmodified: rows updated after last-value timestamp. Much faster than full daily import!

Q22What does --num-mappers do in Sqoop?

Number of parallel DB connections to use. Each mapper reads a range of rows. ⚠️ Too high → DB connection pool exhausted. Typical production: 4-8 mappers.

⚠️ Q23Sqoop export data safety issue?

Sqoop EXPORT (HDFS → DB) is NOT atomic. If it fails halfway → partial data in DB. Solution: export to staging table → run DB transaction to swap (DELETE + INSERT in one transaction).

SECTION 6: OOZIE — FLASH CARDS

Q24Oozie Workflow vs Coordinator job?

Workflow: a DAG of actions (Hive→Sqoop→MapReduce), runs once when submitted. Coordinator: schedules Workflows based on TIME trigger (every 8am) or DATA trigger (when HDFS path has new data). Use Coordinator for production pipelines.

Q25What's the advantage of Oozie Coordinator data trigger?

Job starts only when upstream data arrives in HDFS — not just at a scheduled time. So if upstream job is delayed 2 hours, Coordinator waits → no failed run due to missing data. Airflow equivalent: ExternalTaskSensor.

SECTION 7: ZOOKEEPER + FLUME — FLASH CARDS

Q26What does ZooKeeper do in Hadoop?

Distributed coordination: NameNode HA leader election, HMaster election for HBase, RegionServer health monitoring, YARN ResourceManager HA. Uses ephemeral znodes (auto-deleted on disconnect) for leader tracking.

Q27Why must ZooKeeper have an odd number of nodes?

Quorum = majority needed for decisions (3 → need 2, 5 → need 3). Even number: possible tie → no majority → ZK stuck. Minimum 3 nodes. ⚠️ Never run 2-node ZK!

Q28Flume Memory Channel vs File Channel?

Memory channel: events in RAM, FAST but DATA LOSS if agent crashes. File channel: events on local disk, DURABLE (survives restart). Production: always use File channel or Kafka channel.

🧠 FINAL REVISION — Day 2 Summary Card

📐 Architecture Diagram
┌──────────────────────────────────────────────────────────────────┐
│                  DAY 2: HIVE + ECOSYSTEM                          │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  HIVE ARCHITECTURE = "DMCTE":                                    │
│  Driver → Metastore → Compiler → Tez Engine → Execution          │
│  Metastore backend = MySQL (NOT HDFS) — it goes down → all fail  │
│                                                                  │
│  INTERNAL vs EXTERNAL:                                           │
│  Internal: DROP TABLE = metadata + HDFS data deleted!            │
│  External: DROP TABLE = only metadata, HDFS data SAFE            │
│  Use External for: raw data, shared data, Sqoop imports          │
│                                                                  │
│  PARTITIONING:                                                   │
│  Static: PARTITION (date='2024-01-15') — manual                  │
│  Dynamic: SET hive.exec.dynamic.partition.mode=nonstrict         │
│  MSCK REPAIR TABLE = add missing partitions to Metastore         │
│  Partition pruning: filter DIRECTLY on partition column!         │
│                                                                  │
│  HIVE OPTIMIZATION = "VECTOR-TOP":                               │
│  1. Engine=Tez (5-10x vs MapReduce)                              │
│  2. Vectorization (1024 rows batch, ORC only)                    │
│  3. CBO + ANALYZE TABLE (stats for optimal join plan)            │
│  4. Map Join (small table < 25 MB → broadcast, no shuffle)        │
│  5. ORC + predicate pushdown (skip stripes by min/max)           │
│  6. skewindata=true (2-phase GROUP BY for hot keys)              │
│  7. SORT BY not ORDER BY (unless global sort needed)             │
│                                                                  │
│  FILE FORMATS = "OPTA":                                          │
│  ORC: Hive-native, ACID, predicate pushdown                      │
│  Parquet: cross-tool (Spark+Impala+Hive)                         │
│  Text: never in production                                       │
│  Avro: schema evolution, Kafka/Sqoop landing                     │
│                                                                  │
│  HBASE:                                                          │
│  Random read/write on HDFS (what Hive can't do!)                 │
│  Row key: NEVER monotonic (hotspot!), use salt/reverse timestamp │
│  Write path: WAL → MemStore → HFile (flush on 128 MB)            │
│                                                                  │
│  SQOOP:                                                          │
│  Incremental import: --incremental append/lastmodified           │
│  --split-by: parallelize by column (default: primary key)        │
│  Export: NOT atomic → use staging table!                         │
│                                                                  │
│  OOZIE:                                                          │
│  Workflow: DAG of actions (run once)                             │
│  Coordinator: time + data trigger (production pipelines)         │
│                                                                  │
│  ZOOKEEPER:                                                      │
│  Leader election for NameNode/HMaster/YARN RM                    │
│  Always odd nodes (3/5/7) for quorum                             │
│  Ephemeral znodes = auto-deleted on crash = leader detection     │
│                                                                  │
│  TOP 5 THINGS TO SAY IN INTERVIEW:                               │
│  1. "External tables for all raw data — DROP TABLE is safe"      │
│  2. "MSCK REPAIR TABLE when Sqoop adds files outside Hive"       │
│  3. "Tez + Vectorization + ORC = 10-50x faster than MR+Text"    │
│  4. "HBase row key must not be monotonic — causes hotspot!"      │
│  5. "Oozie Coordinator = time + data trigger (smart scheduling)" │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘