Day 2: Hive + Ecosystem — Quick Recall Guide

🔒

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

Already have a plan? Sign in

⚡Must remember🔑Key concept⚠️Common trap🧠Memory Map📝One-liner

🧠 MASTER MEMORY MAP — Day 2

🧠 HIVE ARCHITECTURE = "DMCTE"

HIVE ARCHITECTURE"DMCTE"

DDriver (receives SQL, manages lifecycle)

MMetastore (schema + HDFS location → MySQL backend!)

CCompiler (SQL → query plan)

TTez/MR engine (executes plan on YARN)

EExecution (submits to YARN, reads from HDFS)

HIVE OPTIMIZATION"VECTOR-TOP":

VVectorization (1024 rows at once: hive.vectorized.execution.enabled=true)

EEngine=Tez (hive.execution.engine=tez, not mr)

CCBO stats (ANALYZE TABLE + hive.cbo.enable=true)

TTez DAG (in-memory multi-stage, no intermediate HDFS writes)

OORC format (columnar + predicate pushdown + ACID)

RREPARTITION/skew fix (hive.groupby.skewindata=true)

TTable map join (hive.auto.convert.join=true, small table → broadcast)

OORDER BY → SORT BY (local sort, use instead of global sort)

PPartition pruning (filter on partition column, not derived expressions)

HIVE FILE FORMATS"OPTA":

OORC (Hive-native, ACID, predicate pushdown — BEST FOR HIVE)

PParquet (cross-tool: Spark+Impala+Hive — BEST FOR MULTI-TOOL)

TText (never in production, no compression, full scan always)

AAvro (schema evolution, Kafka/Sqoop landing — ROW-based)

INTERNAL vs EXTERNAL:

Internal = Hive OWNS data→DROP TABLE = data DELETED from HDFS

External = Hive POINTS to data→DROP TABLE = only metadata deleted, data SAFE

⚡ Rule: External for shared/raw data, Internal for Hive-only output

ECOSYSTEM"SQOF-ZK":

SSqoop (RDBMS ↔ HDFS transfer)

Q(Hive = SQL on HDFS)

OOozie (workflow + coordinator jobs)

FFlume (log streaming → HDFS)

ZK — ZooKeeper (distributed coordination, leader election)

⚡ Q1What is the Hive Metastore?

MySQL/PostgreSQL dat