Day 2: Hive + Ecosystem — Quick Recall Guide
Must rememberKey conceptCommon trapMemory MapOne-liner
🧠 MASTER MEMORY MAP — Day 2
🧠 HIVE ARCHITECTURE = "DMCTE"
HIVE ARCHITECTURE"DMCTE"
DDriver (receives SQL, manages lifecycle)
MMetastore (schema + HDFS location → MySQL backend!)
CCompiler (SQL → query plan)
TTez/MR engine (executes plan on YARN)
EExecution (submits to YARN, reads from HDFS)
HIVE OPTIMIZATION"VECTOR-TOP":
VVectorization (1024 rows at once: hive.vectorized.execution.enabled=true)
EEngine=Tez (hive.execution.engine=tez, not mr)
CCBO stats (ANALYZE TABLE + hive.cbo.enable=true)
TTez DAG (in-memory multi-stage, no intermediate HDFS writes)
OORC format (columnar + predicate pushdown + ACID)
RREPARTITION/skew fix (hive.groupby.skewindata=true)
TTable map join (hive.auto.convert.join=true, small table → broadcast)
OORDER BY → SORT BY (local sort, use instead of global sort)
PPartition pruning (filter on partition column, not derived expressions)
HIVE FILE FORMATS"OPTA":
OORC (Hive-native, ACID, predicate pushdown — BEST FOR HIVE)
PParquet (cross-tool: Spark+Impala+Hive — BEST FOR MULTI-TOOL)
TText (never in production, no compression, full scan always)
AAvro (schema evolution, Kafka/Sqoop landing — ROW-based)
INTERNAL vs EXTERNAL:
Internal = Hive OWNS data→DROP TABLE = data DELETED from HDFS
External = Hive POINTS to data→DROP TABLE = only metadata deleted, data SAFE
⚡ Rule: External for shared/raw data, Internal for Hive-only output
ECOSYSTEM"SQOF-ZK":
SSqoop (RDBMS ↔ HDFS transfer)
Q(Hive = SQL on HDFS)
OOozie (workflow + coordinator jobs)
FFlume (log streaming → HDFS)
ZK — ZooKeeper (distributed coordination, leader election)
SECTION 1: HIVE — DIRECT QUESTIONS
⚡ Q1What is the Hive Metastore?
MySQL/PostgreSQL dat