Day 1: HDFS + YARN + MapReduce — Deep Dive
Time: 5-6 hours | Covers internals + scenario questions at all levels Philosophy: Learn the WHY, not just the WHAT. Every concept taught through the problem it solves. Levels: ⬜ Direct (what/define) | 🟨 Mid-level (how/why) | 🟥 Scenario (design/debug/fix)
SECTION 1: HDFS — HOW HADOOP STORES DATA
🧠 The Core Problem HDFS Solves
Before HDFS: You have 100 TB of booking logs. One server has 4 TB disk. You need 25 servers. But:
- If one server crashes → that portion of data is LOST
- To process data → move it all to one machine → network bottleneck
- No way to scale beyond one machine's processing power
HDFS solution: Split files into blocks, store each block on 3 different servers (replication), run computation WHERE the data lives (data locality). No data movement, built-in fault tolerance.
Q1: What is HDFS Architecture? Explain every component.
Simple Explanation: HDFS (Hadoop Distributed File System) is a distributed file system that splits large files into fixed-size blocks and stores them across many commodity servers. It has two types of nodes: one master (NameNode) that tracks WHERE everything is, and many workers (DataNodes) that actually store the data blocks.
HDFS CLUSTER ARCHITECTURE:
═══════════════════════════════════════════════════════════════════
Client (your application)
│
▼
┌─────────────────────────────────────────────────────────────┐
│ NAMENODE (Master — 1 per cluster, critical!) │
│ │
│ What it stores IN MEMORY (RAM — everything in RAM!): │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ FsImage = snapshot of entire filesystem namespace │ │
│ │ (all file paths, permissions, replication) │ │
│ │ EditLog = transaction log of every change since │ │
│ │ last FsImage snapshot │