Day 1: HDFS + YARN + MapReduce — Quick Recall Guide

🔒

Unlock every deep-dive, lab, mock interview, and memory map across all 10 topics.

Already have a plan? Sign in

⚡Must remember🔑Key concept⚠️Common trap🧠Memory Map📝One-liner

🧠 MASTER MEMORY MAP — Day 1

🧠 HDFS KEY NUMBERS = "1-3-128-3-64"

HDFS KEY NUMBERS"1-3-128-3-64"

1 — NameNode (one namespace, stores ALL metadata in RAM)

3 — Default replication factor (3 copies of every block)

128 — Default block size (128 MB per block)

3 — Heartbeat timeout before DataNode marked dead (3 seconds interval)

64 — Old Hadoop 1 block size (64 MB, now 128 MB)

NAMENODE HA"JZ-FENCE"

JJournalNodes (quorum log — both Active+Standby read edit log here)

ZZooKeeper (elects Active, runs ZKFC on each NameNode)

FENCE — Fencing (kill the old Active before new one takes over — prevents split-brain!)

MAPREDUCE PHASES"I-Map-CBS-Reduce-O" (8 steps)

IInput Splits (decide how many mappers)

Map — Map phase (your map() function runs)

CCombiner (optional mini-reducer, runs locally on mapper node)

BBuffer (sort buffer in RAM — mapreduce.task.io.sort.mb)

SShuffle + Sort (data crosses network sorted by key)

Reduce — Reduce phase (your reduce() function runs)

OOutput (written to HDFS)

YARN"RM-NM-AM"

RM — ResourceManager (cluster boss: Scheduler + ApplicationsManager)

NM — NodeManager (per-node worker, runs containers, monitors health)

AM — ApplicationMaster (per-JOB manager, negotiates resources from RM)

OPTIMIZATION CHECKLIST"MCJ-COMP"

MMemory: increase sort buffer + container memory

CCombiner: cut shuffle data by 60-80%

JJVM Reuse: reuse JVM across tasks (jvm.numtasks=10)

CCompression: Snappy for shuffle, GZIP for output

OOutput: compress final output to save disk

MMore reducers: increase from 1 to N×containers×0.95

PPartitioner: even distribution (custom for skewed data)

**📝 Q1: What