Hadoop — Confusions, Labs, Gotchas & Mock Interview
💡 Interview Tip
The video-free pack. Read this end-to-end and you can walk into any Hadoop/Hive interview without opening YouTube.
🧠 Memory Map: BLOCK-YARN-HIVE
Hadoop interviews boil down to 3 pillars. Remember BYH:
| Letter | Pillar | What it controls |
|---|---|---|
| B | Block storage (HDFS) | How data is SPLIT and REPLICATED across nodes |
| Y | YARN (resources) | How CPU/RAM are SCHEDULED for jobs |
| H | Hive (SQL layer) | How you QUERY data sitting on HDFS |
Master these 3 and you can explain 90% of Hadoop questions.
SECTION 1 — TOP 8 CONFUSIONS CLEARED
Confusion #1 — HDFS Block vs OS Block vs Split
All three sound similar but are different layers:
| Concept | Size | Controlled by | Purpose |
|---|---|---|---|
| OS block | 4 KB (typical) | Linux/filesystem | Physical disk I/O unit |
| HDFS block | 128 MB (default) | HDFS config | Storage + replication unit |
| Input split | ~= HDFS block | InputFormat | Unit of work per mapper |
Why HDFS block is huge: seeks are expensive. Bigger blocks = less metadata pressure on NameNode + more sequential reads.
Interview one-liner: "HDFS block is the storage unit; split is the computation unit. They're usually the same size so one mapper = one block = no network shuffle for reading."
Confusion #2 — NameNode vs DataNode vs Secondary NameNode vs Standby NameNode
Common trap: Secondary ≠ Standby.
| Node | Role | HA? |
|---|---|---|
| NameNode (active) | Holds filesystem metadata (where blocks live) | Single point of failure in Hadoop 1 |
| DataNode | Stores actual blocks, sends heartbeats | Horizontal scale, N copies |
| Secondary NameNode | Periodically merges fsimage + edits log. NOT a backup. | Housekeeping helper |
| Standby NameNode (HA) | Hot replica of Active NameNode. Can take over instantly. | True HA (Hadoop 2+) |
Memory trick: Secondary = "Scroll edi