Day 1: HDFS + YARN + MapReduce — Quick Recall Guide
Must rememberKey conceptCommon trapMemory MapOne-liner
🧠 MASTER MEMORY MAP — Day 1
🧠 HDFS KEY NUMBERS = "1-3-128-3-64"
HDFS KEY NUMBERS"1-3-128-3-64"
1 — NameNode (one namespace, stores ALL metadata in RAM)
3 — Default replication factor (3 copies of every block)
128 — Default block size (128 MB per block)
3 — Heartbeat timeout before DataNode marked dead (3 seconds interval)
64 — Old Hadoop 1 block size (64 MB, now 128 MB)
NAMENODE HA"JZ-FENCE"
JJournalNodes (quorum log — both Active+Standby read edit log here)
ZZooKeeper (elects Active, runs ZKFC on each NameNode)
FENCE — Fencing (kill the old Active before new one takes over — prevents split-brain!)
MAPREDUCE PHASES"I-Map-CBS-Reduce-O" (8 steps)
IInput Splits (decide how many mappers)
Map — Map phase (your map() function runs)
CCombiner (optional mini-reducer, runs locally on mapper node)
BBuffer (sort buffer in RAM — mapreduce.task.io.sort.mb)
SShuffle + Sort (data crosses network sorted by key)
Reduce — Reduce phase (your reduce() function runs)
OOutput (written to HDFS)
YARN"RM-NM-AM"
RM — ResourceManager (cluster boss: Scheduler + ApplicationsManager)
NM — NodeManager (per-node worker, runs containers, monitors health)
AM — ApplicationMaster (per-JOB manager, negotiates resources from RM)
OPTIMIZATION CHECKLIST"MCJ-COMP"
MMemory: increase sort buffer + container memory
CCombiner: cut shuffle data by 60-80%
JJVM Reuse: reuse JVM across tasks (jvm.numtasks=10)
CCompression: Snappy for shuffle, GZIP for output
OOutput: compress final output to save disk
MMore reducers: increase from 1 to N×containers×0.95
PPartitioner: even distribution (custom for skewed data)
SECTION 1: HDFS — DIRECT QUESTIONS
⚡ MUST KNOW DIRECT QUESTIONS
**📝 Q1: What