Day 1: HDFS + YARN + MapReduce — Quick Recall Guide
🧠 MASTER MEMORY MAP — Day 1
SECTION 1: HDFS — DIRECT QUESTIONS
⚡ MUST KNOW DIRECT QUESTIONS
Hadoop Distributed File System — splits large files into 128 MB blocks and distributes them across DataNodes. NameNode stores metadata (which blocks are where), DataNodes store actual data.
File namespace, directory tree, file-to-block mapping, block-to-DataNode mapping. ALL IN RAM — that's why NameNode needs a lot of memory.
Actual data blocks (128 MB each). It sends heartbeats every 3 seconds + block reports every 6 hours to NameNode. If no heartbeat for 10 minutes → DataNode marked dead.
Number of copies of each block. Default = 3. First replica on writer's node, second on a different rack, third on same rack as second (rack-aware). If a DataNode dies, NameNode triggers re-replication.
128 MB (Hadoop 2+). Was 64 MB in Hadoop 1. Larger blocks = fewer blocks = less NameNode memory. Each block = one mapper input (usually).
Millions of tiny files (1 KB, 10 KB) each create a separate block entry in NameNode. 1 million files × 150 bytes = 150 MB of NameNode RAM just for metadata! NameNode runs OOM. Also: one mapper per file = overhead with no actual work.
- HAR files (archive, read-only)
- 2SequenceFile (bundle many → one file)
- 3CombineInputFormat (one mapper for many small files)
- 4Hive:
hive.merge.smallfiles.avgsize - 5Spark: coalesce() before writing
- 6Fix upstream pipeline to write fewer large files.
SECTION 1: HDFS
- Client → NameNode: "I want to create /data/file.txt"
- 2NN checks permissions, creates metadata, returns list of 3 DataNodes
- 3Client opens pipeline to DN1 → DN1 connects to DN2 → DN2 connects to DN3
- 4Client sends blocks (64 KB packets) down the pipeline
- 5DN3 sends ACK → DN2 → DN1 → Client (ACK chain confirms receipt)
- 6After all blocks written, client closes stream, notifies NameNode
- 7NameNode marks file as complete.
- Client → NameNode: "I want to read /data/file.txt"
- 2NN returns list of block locations (DataNodes, ordered by proximity)
- 3Client reads DIRECTLY from DataNodes (bypasses NameNode)
- 4Priority: local node > same rack > different rack (data locality!)
- 5If DataNode fails mid-read, client tries next DataNode for that block.
NameNode stops receiving heartbeats. After timeout (~10 min) marks DataNode dead. Identifies all blocks that were on that node with replication < target. Queues re-replication from other nodes. HDFS automatically recovers without manual intervention.
SECTION 1: HDFS — SCENARIO QUESTIONS
⚡ Q11 SCENARIO: A DataNode is dead and your replication factor is 3. What happens?
NameNode detects missing heartbeat → marks DataNode dead → checks all blocks that were on it → finds blocks now have replication factor 2 → schedules re-replication from surviving DataNodes → target replication (3) is restored. Process is automatic. Time depends on block size and network.
Q12 SCENARIO: Someone accidentally deleted a critical file from HDFS. How do you recover?
- Check HDFS Trash:
hdfs dfs -ls /user//.Trash/Current/
- Check HDFS Trash:
- 2If in trash:
hdfs dfs -mv /user/.Trash/Current/data/file.txt /data/file.txt - 3If trash emptied: restore from Snapshot (if snapshots were enabled):
hdfs dfs -cp /data/.snapshot/s1/file.txt /data/file.txt - 4If no snapshot: restore from backup (HDFS to S3/Azure, Hive metastore backup). Lesson: always enable snapshots on critical directories!
Q13 SCENARIO: NameNode is in safe mode and cluster is not accepting writes. Fix it?
NameNode enters safe mode on startup or when blocks fall below minimum replication. Check with hdfs dfsadmin -safemode get. Wait for DataNodes to report blocks — safe mode usually exits automatically. If stuck: hdfs dfsadmin -safemode leave (only after verifying blocks are actually replicated, not just because you're impatient!).
SECTION 2: NameNode HA — QUESTIONS
Hadoop 1 had ONE NameNode — if it crashed, entire HDFS went down. No writes, no reads. Hadoop 2 introduced Active/Standby HA to eliminate this.
Active NameNode writes edit log to JournalNodes (quorum = must write to N/2+1 JNs). Standby NameNode reads same JournalNode logs, stays in sync. ZooKeeper detects Active failure (via ZKFC - ZooKeeper Failover Controller). ZKFC triggers fencing → kills old Active → promotes Standby to Active.
Split-brain: both Active and Standby think they are Active → two NameNodes write metadata simultaneously → data corruption. Fencing: before promoting Standby, ZKFC sends SSH command to old Active to kill itself (or calls a fencing script). New Active only starts AFTER old Active is confirmed dead.
Separate lightweight daemons (typically 3 or 5, odd number for quorum). Active NameNode writes every edit log entry to a MAJORITY (quorum) of JournalNodes. Standby reads from JournalNodes to stay in sync. If a JournalNode is down, writes still proceed (quorum, not all-or-nothing). Typical setup: 3 JournalNodes → can tolerate 1 failure.
SECTION 3: YARN — QUESTIONS
Yet Another Resource Negotiator — Hadoop's cluster resource manager. Decouples resource management from processing framework. Before YARN (Hadoop 1): only MapReduce ran on the cluster. With YARN: Spark, Flink, MapReduce, Tez all share same cluster resources.
ResourceManager (RM): Cluster boss. Scheduler allocates containers, ApplicationsManager tracks running apps. NodeManager (NM): Per-node daemon. Launches containers, monitors CPU/memory per container, kills containers that exceed limits. ApplicationMaster (AM): Per-JOB daemon. Requests containers from RM, coordinates job execution, handles task failures within a job.
A unit of resource allocation (CPU vcores + memory) on a specific NodeManager. A MapReduce mapper = one container. A Spark executor = one container. RM assigns containers, NM enforces resource limits.
RM detects missing heartbeat → marks NM dead → all containers on that NM are lost → ApplicationMaster is notified → AM re-requests containers from RM for failed tasks → job continues but with delay. YARN handles NM failures transparently.
SECTION 4: MAPREDUCE — QUESTIONS
- Input Splits (decide mapper count) →
- 2Map (process each split) →
- 3Combiner (optional local aggregation) →
- 4Partitioner (decides which reducer gets which key) →
- 5Shuffle+Sort (move + sort data across network) →
- 6Reduce (aggregate, output) →
- 7Output (write to HDFS).
- Mini-reducer that runs on the mapper's local node BEFORE shuffle. Reduces network traffic by pre-aggregating locally. RULES:
- 1Must be commutative AND associative (order doesn't matter, grouping doesn't matter).
- 2Works for: SUM, COUNT, MAX, MIN.
- 3DOES NOT work for: AVERAGE (average of averages ≠ total average).
Average is NOT associative. avg(avg(1,2), avg(3,4)) = avg(1.5, 3.5) = 2.5, but correct avg(1,2,3,4) = 2.5... wait — this particular example works but with unequal group sizes it fails: avg(avg(1), avg(2,3,4)) = avg(1, 3) = 2, but correct avg = 2.5. Solution: emit (sum, count) as value → reduce calculates sum/count.
Decides which reducer receives which key. Default: HashPartitioner (key.hashCode() % numReducers). Custom partitioner used when you want: specific keys to go to specific reducers, or when default causes data skew.
When a task runs much slower than peers (straggler), Hadoop launches a DUPLICATE of that task on another node. Whichever finishes first wins, the other is killed. Enabled by default. ⚠️ DISABLE if tasks have side effects (writing to external DB, calling APIs) — otherwise same operation runs TWICE!
SECTION 5: OPTIMIZATIONS — QUESTIONS
mapreduce.task.io.sort.mb and why does it matter?The in-memory sort buffer for map output. Default: 100 MB (too small!). When 80% full, Hadoop spills to disk. Many spills = many small disk files + merge overhead = slow shuffle. Increase to 512 MB to reduce spills. Rule: fits within mapper container memory (usually 80% of container).
Set mapreduce.map.output.compress=true and mapreduce.map.output.compress.codec=SnappyCodec. Snappy is fast compress/decompress — compresses shuffle data 50-70% → massive network bandwidth savings. Cost: small CPU overhead (almost always worth it).
Snappy: Fast, medium compression, NOT splittable → use for shuffle intermediate data. GZIP: Medium speed, high compression, NOT splittable → use for final archive output. LZO: Fast, medium compression, SPLITTABLE (with index) → use for output that will be INPUT to another MapReduce job. ⚠️ Non-splittable codec on large output file = ONE mapper handles entire file!
mapreduce.job.jvm.numtasks=10 — same JVM handles 10 tasks before restart. Saves JVM startup time (100-500 ms per task). Use when: thousands of small tasks (small files). ⚠️ Can cause memory leaks if tasks have bugs → only use with stable code.
- Classic DATA SKEW. One key has disproportionate data. Solutions:
- 1Salting (add random suffix to hot key to spread across reducers)
- 2Custom partitioner
- 3Add Combiner to pre-reduce hot key data
- 4Increase reducer count so hot key impact is diluted. Root cause: HashPartitioner sent all "US" records to one reducer.
0.95 × (num_nodes × containers_per_node). The 0.95 factor ensures slightly fewer reducers than total capacity → room for ApplicationMaster and other overhead. ⚠️ Default is 1 reducer — terrible for large jobs!
SECTION 6: HDFS COMMANDS — QUICK FIRE
hdfs fsck / -files -blocks -locations — shows corrupt blocks, missing blocks, under-replicated blocks. -files: list each file. -blocks: show block details. -locations: show which DataNodes have each block.
hdfs dfsadmin -report — shows each DataNode: capacity, used, remaining. Shows total cluster storage.
hdfs dfsadmin -safemode get → outputs Safe mode is OFF or Safe mode is ON.
hdfs balancer -threshold 10 — rebalances blocks so no DataNode is more than 10% above/below average utilization. Run after adding new DataNodes!
🧠 FINAL REVISION — Day 1 Summary Card
┌──────────────────────────────────────────────────────────────────┐ │ DAY 1: HDFS + YARN + MAPREDUCE │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ HDFS KEY FACTS: │ │ NameNode: metadata IN RAM (file→block→DataNode mapping) │ │ DataNode: actual 128 MB blocks, heartbeat every 3 seconds │ │ Replication: 3 copies, rack-aware (cross-rack for 2nd copy) │ │ Write path: client→NN→pipeline(DN1→DN2→DN3)→ACK chain │ │ Read path: NN gives locations → client reads directly from DNs │ │ │ │ NAMENODE HA: │ │ Active + Standby (both read JournalNodes) │ │ ZooKeeper + ZKFC = automatic failover │ │ FENCING = kill old Active before promoting Standby (split-brain!)│ │ 3 JournalNodes (quorum) → can tolerate 1 JN failure │ │ │ │ YARN: │ │ RM (cluster boss) + NM (per-node) + AM (per-job) │ │ Schedulers: FIFO (dev) / Capacity (multi-tenant) / Fair (mixed) │ │ Container = CPU + Memory allocation on a NodeManager │ │ │ │ MAPREDUCE FLOW = "I-Map-CBS-Reduce-O": │ │ Input Splits → Map → Combiner → Buffer+Sort → Shuffle → Reduce │ │ Shuffle: most expensive (disk + network) │ │ Combiner: cut network 60-80% (only for commutative+associative) │ │ Speculative: duplicate slow tasks (disable if side effects!) │ │ │ │ OPTIMIZATIONS = "MCJ-COMP": │ │ M-Memory: sort buffer 100 MB→512 MB (biggest win!) │ │ C-Combiner: pre-aggregate locally (SUM/COUNT/MAX/MIN) │ │ J-JVM Reuse: numtasks=10 for small file jobs │ │ C-Compress shuffle: Snappy codec → 50-70% network reduction │ │ O-Output compress: GZIP for archives, LZO for pipeline output │ │ M-More reducers: 1→N (formula: nodes×containers×0.95) │ │ P-Partitioner: custom for skewed data (salting) │ │ │ │ SMALL FILES PROBLEM: │ │ Each file = 1 metadata entry in NameNode RAM │ │ Fix: HAR / SequenceFile / CombineInputFormat / Hive merge │ │ │ │ HADOOP VERSIONS: │ │ v1: Single NameNode (SPOF), only MapReduce │ │ v2: YARN, NameNode HA, HDFS Federation │ │ v3: Erasure Coding (50% storage savings for cold data) │ │ │ │ TOP 5 THINGS TO SAY IN INTERVIEW: │ │ 1. "NameNode HA uses JournalNodes + ZooKeeper + FENCING" │ │ 2. "Shuffle is the bottleneck — minimize with combiner+compress" │ │ 3. "Sort buffer default 100 MB is too small — increase to 512 MB" │ │ 4. "Speculative execution: disable if tasks have side effects" │ │ 5. "Erasure coding in Hadoop 3: 50% storage savings for cold" │ │ │ └──────────────────────────────────────────────────────────────────┘