🐘
Hadoop
Day 2: Hive + Ecosystem — Quick Recall Guide
🐘
🐘
Hadoop · Section 4 of 9

Day 2: Hive + Ecosystem — Quick Recall Guide

🔒

This section is locked

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

View Plans — from ₹299/month

Already have a plan? Sign in

Day 2: Hive + Ecosystem — Quick Recall Guide

Must remember🔑Key concept⚠️Common trap🧠Memory Map📝One-liner

🧠 MASTER MEMORY MAP — Day 2

🧠 HIVE ARCHITECTURE = "DMCTE"
HIVE ARCHITECTURE"DMCTE"
DDriver (receives SQL, manages lifecycle)
MMetastore (schema + HDFS location → MySQL backend!)
CCompiler (SQL → query plan)
TTez/MR engine (executes plan on YARN)
EExecution (submits to YARN, reads from HDFS)
HIVE OPTIMIZATION"VECTOR-TOP":
VVectorization (1024 rows at once: hive.vectorized.execution.enabled=true)
EEngine=Tez (hive.execution.engine=tez, not mr)
CCBO stats (ANALYZE TABLE + hive.cbo.enable=true)
TTez DAG (in-memory multi-stage, no intermediate HDFS writes)
OORC format (columnar + predicate pushdown + ACID)
RREPARTITION/skew fix (hive.groupby.skewindata=true)
TTable map join (hive.auto.convert.join=true, small table → broadcast)
OORDER BY → SORT BY (local sort, use instead of global sort)
PPartition pruning (filter on partition column, not derived expressions)
HIVE FILE FORMATS"OPTA":
OORC (Hive-native, ACID, predicate pushdown — BEST FOR HIVE)
PParquet (cross-tool: Spark+Impala+Hive — BEST FOR MULTI-TOOL)
TText (never in production, no compression, full scan always)
AAvro (schema evolution, Kafka/Sqoop landing — ROW-based)
INTERNAL vs EXTERNAL:
Internal = Hive OWNS dataDROP TABLE = data DELETED from HDFS
External = Hive POINTS to dataDROP TABLE = only metadata deleted, data SAFE
⚡ Rule: External for shared/raw data, Internal for Hive-only output
ECOSYSTEM"SQOF-ZK":
SSqoop (RDBMS ↔ HDFS transfer)
Q(Hive = SQL on HDFS)
OOozie (workflow + coordinator jobs)
FFlume (log streaming → HDFS)
ZK — ZooKeeper (distributed coordination, leader election)

SECTION 1: HIVE — DIRECT QUESTIONS

Q1What is the Hive Metastore?

MySQL/PostgreSQL dat