Day 2: Hive + Ecosystem — Quick Recall Guide
🧠 MASTER MEMORY MAP — Day 2
SECTION 1: HIVE — DIRECT QUESTIONS
MySQL/PostgreSQL database that stores schema metadata: table names, column types, HDFS paths, partition information, file formats. Hive queries it to know WHERE data lives in HDFS. If Metastore goes down → ALL Hive queries fail.
Internal: Hive owns data. DROP TABLE = delete metadata + delete HDFS data. External: Hive points to data. DROP TABLE = delete metadata only, HDFS data untouched.
Always for: raw/Bronze landing tables, shared data (Spark also reads it), Sqoop-ingested data, data you can't afford to accidentally delete.
A subdirectory in HDFS organized by a column value. booking_date=2024-01-15/ is one partition. Partition pruning skips unrelated directories → reads only what's needed.
WHERE YEAR(booking_date) = 2024 → NO partition pruning (function applied to column). Fix: WHERE booking_date BETWEEN '2024-01-01' AND '2024-12-31' — Hive can prune date range.
Static: you specify partition value in INSERT (PARTITION (date='2024-01-15')). Dynamic: Hive reads partition value from the data itself (PARTITION (date), last column in SELECT). Dynamic can create many files → enable merge settings.
SECTION 2: HIVE SCENARIOS
⚡ Q7 SCENARIO: Sqoop wrote new data to HDFS but Hive shows 0 rows. Fix?
Metastore doesn't know about the new partition. Fix: MSCK REPAIR TABLE bookings; — scans HDFS, adds missing partitions to Metastore. Or faster: ALTER TABLE bookings ADD PARTITION (booking_date='2024-03-01') LOCATION '...'
⚠️ Q8 SCENARIO: Someone ran hdfs dfs -rm -r on a Hive partition. Now what?
Check Trash first: hdfs dfs -ls /user/username/.Trash/Current/ → restore if found. If no trash: restore from backup. If no backup: ALTER TABLE bookings DROP PARTITION (booking_date='...') to remove stale metadata. Prevent with snapshots.
Q9 SCENARIO: Hive query runs for hours. Top 5 things to check?
SET hive.execution.engine→ should betez, notmr
- 2
SET hive.vectorized.execution.enabled=true(ORC only) - 3Check if partition pruning works (WHERE on partition column directly)
- 4Any shuffle joins that could be map joins? (small tables < 25 MB)
- 5Data skew?
SET hive.groupby.skewindata=true
SECTION 3: HIVE OPTIMIZATIONS — FLASH CARDS
hive.execution.engine=tez do?Replaces MapReduce with Apache Tez. Tez builds a DAG of operators in memory — no intermediate HDFS writes between stages. Result: 5-10x faster for multi-stage queries.
SET hive.vectorized.execution.enabled=true — processes 1024 rows at once using CPU SIMD instructions instead of row-by-row. 2-5x faster for aggregations. Only works with ORC format.
Load small table into every mapper's RAM → join locally without shuffle. Hive auto-decides: hive.auto.convert.join=true with threshold hive.mapjoin.smalltable.filesize=25 MB. Force with /*+ MAPJOIN(table) */ hint.
ORDER BY: global sort → all data → ONE reducer (slow for billions!). SORT BY: each reducer sorts its own output (local sort, faster). Use SORT BY + multiple reducers when global order not required.MetaStore Check — scans HDFS for new partition directories and adds them to the Hive Metastore. Needed when files are added directly to HDFS without going through Hive INSERT. ⚠️ Slow on tables with 10,000+ partitions.
ORC stores min/max statistics per column per stripe. When you do WHERE amount > 1000, Hive reads only stripes where max(amount) > 1000. Skip stripes without touching them → can skip 90%+ of data!
hive.groupby.skewindata=true?Two-phase aggregation for skewed GROUP BY. Phase 1: randomly distribute keys → partial aggregation in parallel. Phase 2: final aggregation by real key. Prevents one reducer getting all "US" country records.
SECTION 4: HBASE — FLASH CARDS
NoSQL column-family database on top of HDFS. Use for: random row-level reads/writes (GET/PUT by key), sub-millisecond access, millions of concurrent point queries. NOT for: batch analytics (use Hive for that).
Primary sort key for all HBase data. All access is BY row key. Lexicographically sorted. Design it based on your access pattern (how will you look up rows?).
Monotonically increasing timestamps → all new writes go to the LAST region (last RegionServer) → write hotspot! All 1000 writes/second hit one machine. Fix: salt prefix (hash(id) % 10) or reverse timestamp (Long.MAX_VALUE - timestamp).
- WAL (Write-Ahead Log on HDFS) for durability
- 2MemStore (RAM) for fast writes
- 3When MemStore full (128 MB) → flush to HFile on HDFS
- 4Background: compact HFiles to reduce read overhead.
SECTION 5: SQOOP — FLASH CARDS
Import only NEW rows since last run. Mode append: only rows where check-column > last-value. Mode lastmodified: rows updated after last-value timestamp. Much faster than full daily import!
--num-mappers do in Sqoop?Number of parallel DB connections to use. Each mapper reads a range of rows. ⚠️ Too high → DB connection pool exhausted. Typical production: 4-8 mappers.
Sqoop EXPORT (HDFS → DB) is NOT atomic. If it fails halfway → partial data in DB. Solution: export to staging table → run DB transaction to swap (DELETE + INSERT in one transaction).
SECTION 6: OOZIE — FLASH CARDS
Workflow: a DAG of actions (Hive→Sqoop→MapReduce), runs once when submitted. Coordinator: schedules Workflows based on TIME trigger (every 8am) or DATA trigger (when HDFS path has new data). Use Coordinator for production pipelines.
Job starts only when upstream data arrives in HDFS — not just at a scheduled time. So if upstream job is delayed 2 hours, Coordinator waits → no failed run due to missing data. Airflow equivalent: ExternalTaskSensor.
SECTION 7: ZOOKEEPER + FLUME — FLASH CARDS
Distributed coordination: NameNode HA leader election, HMaster election for HBase, RegionServer health monitoring, YARN ResourceManager HA. Uses ephemeral znodes (auto-deleted on disconnect) for leader tracking.
Quorum = majority needed for decisions (3 → need 2, 5 → need 3). Even number: possible tie → no majority → ZK stuck. Minimum 3 nodes. ⚠️ Never run 2-node ZK!
Memory channel: events in RAM, FAST but DATA LOSS if agent crashes. File channel: events on local disk, DURABLE (survives restart). Production: always use File channel or Kafka channel.
🧠 FINAL REVISION — Day 2 Summary Card
┌──────────────────────────────────────────────────────────────────┐ │ DAY 2: HIVE + ECOSYSTEM │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ HIVE ARCHITECTURE = "DMCTE": │ │ Driver → Metastore → Compiler → Tez Engine → Execution │ │ Metastore backend = MySQL (NOT HDFS) — it goes down → all fail │ │ │ │ INTERNAL vs EXTERNAL: │ │ Internal: DROP TABLE = metadata + HDFS data deleted! │ │ External: DROP TABLE = only metadata, HDFS data SAFE │ │ Use External for: raw data, shared data, Sqoop imports │ │ │ │ PARTITIONING: │ │ Static: PARTITION (date='2024-01-15') — manual │ │ Dynamic: SET hive.exec.dynamic.partition.mode=nonstrict │ │ MSCK REPAIR TABLE = add missing partitions to Metastore │ │ Partition pruning: filter DIRECTLY on partition column! │ │ │ │ HIVE OPTIMIZATION = "VECTOR-TOP": │ │ 1. Engine=Tez (5-10x vs MapReduce) │ │ 2. Vectorization (1024 rows batch, ORC only) │ │ 3. CBO + ANALYZE TABLE (stats for optimal join plan) │ │ 4. Map Join (small table < 25 MB → broadcast, no shuffle) │ │ 5. ORC + predicate pushdown (skip stripes by min/max) │ │ 6. skewindata=true (2-phase GROUP BY for hot keys) │ │ 7. SORT BY not ORDER BY (unless global sort needed) │ │ │ │ FILE FORMATS = "OPTA": │ │ ORC: Hive-native, ACID, predicate pushdown │ │ Parquet: cross-tool (Spark+Impala+Hive) │ │ Text: never in production │ │ Avro: schema evolution, Kafka/Sqoop landing │ │ │ │ HBASE: │ │ Random read/write on HDFS (what Hive can't do!) │ │ Row key: NEVER monotonic (hotspot!), use salt/reverse timestamp │ │ Write path: WAL → MemStore → HFile (flush on 128 MB) │ │ │ │ SQOOP: │ │ Incremental import: --incremental append/lastmodified │ │ --split-by: parallelize by column (default: primary key) │ │ Export: NOT atomic → use staging table! │ │ │ │ OOZIE: │ │ Workflow: DAG of actions (run once) │ │ Coordinator: time + data trigger (production pipelines) │ │ │ │ ZOOKEEPER: │ │ Leader election for NameNode/HMaster/YARN RM │ │ Always odd nodes (3/5/7) for quorum │ │ Ephemeral znodes = auto-deleted on crash = leader detection │ │ │ │ TOP 5 THINGS TO SAY IN INTERVIEW: │ │ 1. "External tables for all raw data — DROP TABLE is safe" │ │ 2. "MSCK REPAIR TABLE when Sqoop adds files outside Hive" │ │ 3. "Tez + Vectorization + ORC = 10-50x faster than MR+Text" │ │ 4. "HBase row key must not be monotonic — causes hotspot!" │ │ 5. "Oozie Coordinator = time + data trigger (smart scheduling)" │ │ │ └──────────────────────────────────────────────────────────────────┘