🐘
Hadoop
Day 1: HDFS + YARN + MapReduce — Quick Recall Guide
🐘
🐘
Hadoop · Section 3 of 8

Day 1: HDFS + YARN + MapReduce — Quick Recall Guide

Day 1: HDFS + YARN + MapReduce — Quick Recall Guide

Must remember🔑Key concept⚠️Common trap🧠Memory Map📝One-liner

🧠 MASTER MEMORY MAP — Day 1

🧠 HDFS KEY NUMBERS = "1-3-128-3-64"
HDFS KEY NUMBERS"1-3-128-3-64"
1 — NameNode (one namespace, stores ALL metadata in RAM)
3 — Default replication factor (3 copies of every block)
128 — Default block size (128 MB per block)
3 — Heartbeat timeout before DataNode marked dead (3 seconds interval)
64 — Old Hadoop 1 block size (64 MB, now 128 MB)
NAMENODE HA"JZ-FENCE"
JJournalNodes (quorum log — both Active+Standby read edit log here)
ZZooKeeper (elects Active, runs ZKFC on each NameNode)
FENCE — Fencing (kill the old Active before new one takes over — prevents split-brain!)
MAPREDUCE PHASES"I-Map-CBS-Reduce-O" (8 steps)
IInput Splits (decide how many mappers)
Map — Map phase (your map() function runs)
CCombiner (optional mini-reducer, runs locally on mapper node)
BBuffer (sort buffer in RAM — mapreduce.task.io.sort.mb)
SShuffle + Sort (data crosses network sorted by key)
Reduce — Reduce phase (your reduce() function runs)
OOutput (written to HDFS)
YARN"RM-NM-AM"
RM — ResourceManager (cluster boss: Scheduler + ApplicationsManager)
NM — NodeManager (per-node worker, runs containers, monitors health)
AM — ApplicationMaster (per-JOB manager, negotiates resources from RM)
OPTIMIZATION CHECKLIST"MCJ-COMP"
MMemory: increase sort buffer + container memory
CCombiner: cut shuffle data by 60-80%
JJVM Reuse: reuse JVM across tasks (jvm.numtasks=10)
CCompression: Snappy for shuffle, GZIP for output
OOutput: compress final output to save disk
MMore reducers: increase from 1 to N×containers×0.95
PPartitioner: even distribution (custom for skewed data)

SECTION 1: HDFS — DIRECT QUESTIONS

⚡ MUST KNOW DIRECT QUESTIONS

Q1What is HDFS?

Hadoop Distributed File System — splits large files into 128 MB blocks and distributes them across DataNodes. NameNode stores metadata (which blocks are where), DataNodes store actual data.

Q2What does NameNode store?

File namespace, directory tree, file-to-block mapping, block-to-DataNode mapping. ALL IN RAM — that's why NameNode needs a lot of memory.

Q3What does DataNode store?

Actual data blocks (128 MB each). It sends heartbeats every 3 seconds + block reports every 6 hours to NameNode. If no heartbeat for 10 minutes → DataNode marked dead.

Q4What is replication factor?

Number of copies of each block. Default = 3. First replica on writer's node, second on a different rack, third on same rack as second (rack-aware). If a DataNode dies, NameNode triggers re-replication.

Q5What is the default HDFS block size?

128 MB (Hadoop 2+). Was 64 MB in Hadoop 1. Larger blocks = fewer blocks = less NameNode memory. Each block = one mapper input (usually).

⚠️ Q6What is the small files problem?

Millions of tiny files (1 KB, 10 KB) each create a separate block entry in NameNode. 1 million files × 150 bytes = 150 MB of NameNode RAM just for metadata! NameNode runs OOM. Also: one mapper per file = overhead with no actual work.

Q7Solutions to the small files problem?
    1. HAR files (archive, read-only)
    2. 2SequenceFile (bundle many → one file)
    3. 3CombineInputFormat (one mapper for many small files)
    4. 4Hive: hive.merge.smallfiles.avgsize
    5. 5Spark: coalesce() before writing
    6. 6Fix upstream pipeline to write fewer large files.

SECTION 1: HDFS

Q8Explain the HDFS Write Path step by step.
    1. Client → NameNode: "I want to create /data/file.txt"
    2. 2NN checks permissions, creates metadata, returns list of 3 DataNodes
    3. 3Client opens pipeline to DN1 → DN1 connects to DN2 → DN2 connects to DN3
    4. 4Client sends blocks (64 KB packets) down the pipeline
    5. 5DN3 sends ACK → DN2 → DN1 → Client (ACK chain confirms receipt)
    6. 6After all blocks written, client closes stream, notifies NameNode
    7. 7NameNode marks file as complete.
Q9Explain the HDFS Read Path.
    1. Client → NameNode: "I want to read /data/file.txt"
    2. 2NN returns list of block locations (DataNodes, ordered by proximity)
    3. 3Client reads DIRECTLY from DataNodes (bypasses NameNode)
    4. 4Priority: local node > same rack > different rack (data locality!)
    5. 5If DataNode fails mid-read, client tries next DataNode for that block.
⚠️ Q10What happens when a DataNode dies?

NameNode stops receiving heartbeats. After timeout (~10 min) marks DataNode dead. Identifies all blocks that were on that node with replication < target. Queues re-replication from other nodes. HDFS automatically recovers without manual intervention.

SECTION 1: HDFS — SCENARIO QUESTIONS

Q11 SCENARIO: A DataNode is dead and your replication factor is 3. What happens?

NameNode detects missing heartbeat → marks DataNode dead → checks all blocks that were on it → finds blocks now have replication factor 2 → schedules re-replication from surviving DataNodes → target replication (3) is restored. Process is automatic. Time depends on block size and network.

Q12 SCENARIO: Someone accidentally deleted a critical file from HDFS. How do you recover?

    1. Check HDFS Trash: hdfs dfs -ls /user//.Trash/Current/
    2. 2If in trash: hdfs dfs -mv /user/.Trash/Current/data/file.txt /data/file.txt
    3. 3If trash emptied: restore from Snapshot (if snapshots were enabled): hdfs dfs -cp /data/.snapshot/s1/file.txt /data/file.txt
    4. 4If no snapshot: restore from backup (HDFS to S3/Azure, Hive metastore backup). Lesson: always enable snapshots on critical directories!

Q13 SCENARIO: NameNode is in safe mode and cluster is not accepting writes. Fix it?

NameNode enters safe mode on startup or when blocks fall below minimum replication. Check with hdfs dfsadmin -safemode get. Wait for DataNodes to report blocks — safe mode usually exits automatically. If stuck: hdfs dfsadmin -safemode leave (only after verifying blocks are actually replicated, not just because you're impatient!).

SECTION 2: NameNode HA — QUESTIONS

Q14What is NameNode Single Point of Failure (SPOF)?

Hadoop 1 had ONE NameNode — if it crashed, entire HDFS went down. No writes, no reads. Hadoop 2 introduced Active/Standby HA to eliminate this.

Q15How does NameNode HA work?

Active NameNode writes edit log to JournalNodes (quorum = must write to N/2+1 JNs). Standby NameNode reads same JournalNode logs, stays in sync. ZooKeeper detects Active failure (via ZKFC - ZooKeeper Failover Controller). ZKFC triggers fencing → kills old Active → promotes Standby to Active.

⚠️ Q16What is split-brain in NameNode HA? How does fencing prevent it?

Split-brain: both Active and Standby think they are Active → two NameNodes write metadata simultaneously → data corruption. Fencing: before promoting Standby, ZKFC sends SSH command to old Active to kill itself (or calls a fencing script). New Active only starts AFTER old Active is confirmed dead.

Q17What are JournalNodes?

Separate lightweight daemons (typically 3 or 5, odd number for quorum). Active NameNode writes every edit log entry to a MAJORITY (quorum) of JournalNodes. Standby reads from JournalNodes to stay in sync. If a JournalNode is down, writes still proceed (quorum, not all-or-nothing). Typical setup: 3 JournalNodes → can tolerate 1 failure.

SECTION 3: YARN — QUESTIONS

Q18What is YARN?

Yet Another Resource Negotiator — Hadoop's cluster resource manager. Decouples resource management from processing framework. Before YARN (Hadoop 1): only MapReduce ran on the cluster. With YARN: Spark, Flink, MapReduce, Tez all share same cluster resources.

Q19YARN components — explain each.

ResourceManager (RM): Cluster boss. Scheduler allocates containers, ApplicationsManager tracks running apps. NodeManager (NM): Per-node daemon. Launches containers, monitors CPU/memory per container, kills containers that exceed limits. ApplicationMaster (AM): Per-JOB daemon. Requests containers from RM, coordinates job execution, handles task failures within a job.

Q20What is a Container in YARN?

A unit of resource allocation (CPU vcores + memory) on a specific NodeManager. A MapReduce mapper = one container. A Spark executor = one container. RM assigns containers, NM enforces resource limits.

Q21YARN Scheduler types — when to use each?
Pro Tip
FIFO: First in first out. Never use in multi-user production. Capacity Scheduler: Multiple queues, guaranteed % capacity per team. Use for enterprise multi-tenant. Fair Scheduler: All jobs get equal resources over time. Use when running many small-medium jobs from multiple users.
⚠️ Q22What happens when a NodeManager dies?

RM detects missing heartbeat → marks NM dead → all containers on that NM are lost → ApplicationMaster is notified → AM re-requests containers from RM for failed tasks → job continues but with delay. YARN handles NM failures transparently.

SECTION 4: MAPREDUCE — QUESTIONS

Q23Explain MapReduce phases in order.
    1. Input Splits (decide mapper count) →
    2. 2Map (process each split) →
    3. 3Combiner (optional local aggregation) →
    4. 4Partitioner (decides which reducer gets which key) →
    5. 5Shuffle+Sort (move + sort data across network) →
    6. 6Reduce (aggregate, output) →
    7. 7Output (write to HDFS).
Q24What is the Shuffle phase? Why is it the most expensive?
Pro Tip
Shuffle = moving map outputs to the correct reducer across the network. Sorted by key. Data goes: mapper → local disk → network → reducer's disk → reducer. It's expensive because: hits local disk twice + crosses network + sorts data. Performance tip: minimize shuffle data with combiners and compression.
Q25What is a Combiner? Rules for using it?
  1. Mini-reducer that runs on the mapper's local node BEFORE shuffle. Reduces network traffic by pre-aggregating locally. RULES:
  2. 1Must be commutative AND associative (order doesn't matter, grouping doesn't matter).
  3. 2Works for: SUM, COUNT, MAX, MIN.
  4. 3DOES NOT work for: AVERAGE (average of averages ≠ total average).
⚠️ Q26Why can't you use a Combiner for calculating Average?

Average is NOT associative. avg(avg(1,2), avg(3,4)) = avg(1.5, 3.5) = 2.5, but correct avg(1,2,3,4) = 2.5... wait — this particular example works but with unequal group sizes it fails: avg(avg(1), avg(2,3,4)) = avg(1, 3) = 2, but correct avg = 2.5. Solution: emit (sum, count) as value → reduce calculates sum/count.

Q27What is a Partitioner?

Decides which reducer receives which key. Default: HashPartitioner (key.hashCode() % numReducers). Custom partitioner used when you want: specific keys to go to specific reducers, or when default causes data skew.

Q28What is Speculative Execution?

When a task runs much slower than peers (straggler), Hadoop launches a DUPLICATE of that task on another node. Whichever finishes first wins, the other is killed. Enabled by default. ⚠️ DISABLE if tasks have side effects (writing to external DB, calling APIs) — otherwise same operation runs TWICE!

SECTION 5: OPTIMIZATIONS — QUESTIONS

Q29What is mapreduce.task.io.sort.mb and why does it matter?

The in-memory sort buffer for map output. Default: 100 MB (too small!). When 80% full, Hadoop spills to disk. Many spills = many small disk files + merge overhead = slow shuffle. Increase to 512 MB to reduce spills. Rule: fits within mapper container memory (usually 80% of container).

Q30How do you compress map output and why?

Set mapreduce.map.output.compress=true and mapreduce.map.output.compress.codec=SnappyCodec. Snappy is fast compress/decompress — compresses shuffle data 50-70% → massive network bandwidth savings. Cost: small CPU overhead (almost always worth it).

⚠️ Q31GZIP vs Snappy vs LZO — when to use each?

Snappy: Fast, medium compression, NOT splittable → use for shuffle intermediate data. GZIP: Medium speed, high compression, NOT splittable → use for final archive output. LZO: Fast, medium compression, SPLITTABLE (with index) → use for output that will be INPUT to another MapReduce job. ⚠️ Non-splittable codec on large output file = ONE mapper handles entire file!

Q32What is JVM Reuse and when should you use it?

mapreduce.job.jvm.numtasks=10 — same JVM handles 10 tasks before restart. Saves JVM startup time (100-500 ms per task). Use when: thousands of small tasks (small files). ⚠️ Can cause memory leaks if tasks have bugs → only use with stable code.

Q33How to fix a slow MapReduce job that has 99 reducers done but 1 still running for hours?
  1. Classic DATA SKEW. One key has disproportionate data. Solutions:
  2. 1Salting (add random suffix to hot key to spread across reducers)
  3. 2Custom partitioner
  4. 3Add Combiner to pre-reduce hot key data
  5. 4Increase reducer count so hot key impact is diluted. Root cause: HashPartitioner sent all "US" records to one reducer.
Q34What is the optimal number of reducers formula?

0.95 × (num_nodes × containers_per_node). The 0.95 factor ensures slightly fewer reducers than total capacity → room for ApplicationMaster and other overhead. ⚠️ Default is 1 reducer — terrible for large jobs!

SECTION 6: HDFS COMMANDS — QUICK FIRE

Q35How to check HDFS file system health?

hdfs fsck / -files -blocks -locations — shows corrupt blocks, missing blocks, under-replicated blocks. -files: list each file. -blocks: show block details. -locations: show which DataNodes have each block.

Q36How to check cluster usage?

hdfs dfsadmin -report — shows each DataNode: capacity, used, remaining. Shows total cluster storage.

Q37How to check safe mode status?

hdfs dfsadmin -safemode get → outputs Safe mode is OFF or Safe mode is ON.

Q38How to run the HDFS balancer?

hdfs balancer -threshold 10 — rebalances blocks so no DataNode is more than 10% above/below average utilization. Run after adding new DataNodes!

🧠 FINAL REVISION — Day 1 Summary Card

📐 Architecture Diagram
┌──────────────────────────────────────────────────────────────────┐
│               DAY 1: HDFS + YARN + MAPREDUCE                      │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  HDFS KEY FACTS:                                                 │
│  NameNode: metadata IN RAM (file→block→DataNode mapping)         │
│  DataNode: actual 128 MB blocks, heartbeat every 3 seconds        │
│  Replication: 3 copies, rack-aware (cross-rack for 2nd copy)     │
│  Write path: client→NN→pipeline(DN1→DN2→DN3)→ACK chain          │
│  Read path: NN gives locations → client reads directly from DNs  │
│                                                                  │
│  NAMENODE HA:                                                    │
│  Active + Standby (both read JournalNodes)                       │
│  ZooKeeper + ZKFC = automatic failover                           │
│  FENCING = kill old Active before promoting Standby (split-brain!)│
│  3 JournalNodes (quorum) → can tolerate 1 JN failure            │
│                                                                  │
│  YARN:                                                           │
│  RM (cluster boss) + NM (per-node) + AM (per-job)               │
│  Schedulers: FIFO (dev) / Capacity (multi-tenant) / Fair (mixed) │
│  Container = CPU + Memory allocation on a NodeManager            │
│                                                                  │
│  MAPREDUCE FLOW = "I-Map-CBS-Reduce-O":                         │
│  Input Splits → Map → Combiner → Buffer+Sort → Shuffle → Reduce  │
│  Shuffle: most expensive (disk + network)                        │
│  Combiner: cut network 60-80% (only for commutative+associative) │
│  Speculative: duplicate slow tasks (disable if side effects!)    │
│                                                                  │
│  OPTIMIZATIONS = "MCJ-COMP":                                     │
│  M-Memory: sort buffer 100 MB512 MB (biggest win!)               │
│  C-Combiner: pre-aggregate locally (SUM/COUNT/MAX/MIN)           │
│  J-JVM Reuse: numtasks=10 for small file jobs                    │
│  C-Compress shuffle: Snappy codec → 50-70% network reduction     │
│  O-Output compress: GZIP for archives, LZO for pipeline output   │
│  M-More reducers: 1→N (formula: nodes×containers×0.95)          │
│  P-Partitioner: custom for skewed data (salting)                 │
│                                                                  │
│  SMALL FILES PROBLEM:                                            │
│  Each file = 1 metadata entry in NameNode RAM                    │
│  Fix: HAR / SequenceFile / CombineInputFormat / Hive merge       │
│                                                                  │
│  HADOOP VERSIONS:                                                │
│  v1: Single NameNode (SPOF), only MapReduce                      │
│  v2: YARN, NameNode HA, HDFS Federation                          │
│  v3: Erasure Coding (50% storage savings for cold data)          │
│                                                                  │
│  TOP 5 THINGS TO SAY IN INTERVIEW:                               │
│  1. "NameNode HA uses JournalNodes + ZooKeeper + FENCING"        │
│  2. "Shuffle is the bottleneck — minimize with combiner+compress" │
│  3. "Sort buffer default 100 MB is too small — increase to 512 MB" │
│  4. "Speculative execution: disable if tasks have side effects"  │
│  5. "Erasure coding in Hadoop 3: 50% storage savings for cold"   │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘