Day 3: Performance, Security + Cloud Migration — Quick Recall Guide
🧠 MASTER MEMORY MAP — Day 3
SECTION 1: SECURITY — DIRECT QUESTIONS
Authentication protocol. Users/services prove identity to KDC (Key Distribution Center) and get cryptographic tickets. Without Kerberos, Hadoop accepts any claimed identity — zero security.
File containing pre-stored Kerberos credentials for a service account. Used by automated processes (Oozie jobs, cron) to authenticate without a password prompt. kinit -kt /etc/keytabs/hive.keytab hive/host@REALM
Job fails with "Authentication failure" mid-execution. Fix: use long-lived keytabs + configure kinit -R (renewal) before jobs start, or set hadoop.security.auth_to_local token delegation for MapReduce/Spark jobs.
Centralized fine-grained authorization for ALL Hadoop services. Policies define: who can do what on which resource. Supports table/column/row-level policies, column masking (show last 4 digits of SSN), full audit logging.
API Gateway and SSL proxy for Hadoop cluster. Single HTTPS entry point — users never connect directly to internal services. Simplifies firewall rules, provides SSO, terminates SSL at edge.
Kerberos = AUTHENTICATION (verifies who you are). Ranger = AUTHORIZATION (decides what you can do). Both are needed: Kerberos proves identity, Ranger enforces permissions. Like: Kerberos = ID check at door, Ranger = bouncer with guest list.
SECTION 2: PERFORMANCE — FLASH CARDS
1 GB per 1 million files/blocks. 10 million files → 10 GB minimum. Add 50% buffer. Use G1GC for low-pause GC: -Xmx32g -Xms32g -XX:+UseG1GC. ⚠️ Default heap is only 1 GB — way too small for production!
dfs.namenode.handler.count?Number of RPC handler threads in NameNode. Default: 10 (too small for large clusters!). Rule: 20 × log2(cluster_nodes). 100 nodes → ~132 → set to 100-150. Low handler count → RPC queue builds up → timeouts.
mapreduce.task.io.sort.mb — the in-memory sort buffer. Default 100 MB causes constant disk spills during shuffle. Increase to 512 MB → fewer spills → much faster shuffle. This one change often reduces job time 30-60%.
mapreduce.reduce.shuffle.parallelcopies?Number of parallel threads each reducer uses to copy map outputs. Default: 5 (very low). Increase to 50 → reducers fetch shuffle data faster → reduce phase starts sooner.
G1GC (Garbage First): concurrent collector with predictable short pauses (< 200 ms). CMS GC: can have multi-second "stop-the-world" pauses on large heaps → NameNode appears unresponsive → DataNodes time out → false "dead DataNode" alerts.
dfs.client.read.shortcircuit=true — if client runs on SAME node as DataNode, it reads the block file directly from local disk (bypasses network stack). Faster local reads for jobs with data locality.
SECTION 3: DATA SKEW — FLASH CARDS
99 reducers finish in 20 minutes. 1 reducer runs for 6 hours. One key (like "US") has millions of values → all sent to one reducer → bottleneck.
SET hive.groupby.skewindata=true → Hive runs 2-phase GROUP BY. SET hive.optimize.skewjoin=true for skewed JOIN. ⚠️ NULL values all go to one reducer by default — use COALESCE(key, CONCAT('null_', RAND())).
SECTION 4: CLOUD MIGRATION — FLASH CARDS
Lift-and-Shift: copy HDFS → S3/ADLS with DistCp, keep same code. Replatform: keep cloud storage, replace MapReduce with Spark. Refactor: full rebuild with Delta Lake + Databricks + Unity Catalog.
Distributed Copy — MapReduce-based tool to copy data between HDFS paths or HDFS to cloud. hdfs distcp hdfs://namenode/data/ s3a://bucket/data/. Use -update flag for incremental copies (skip already-copied files).
-skipcrccheck with DistCp to S3/ADLS?HDFS uses CRC32 checksums; S3/ADLS use MD5-based ETags. They're incompatible — DistCp sees them as mismatches and re-copies everything on each run. -skipcrccheck skips checksum comparison → only copies genuinely new/changed files.
Official Databricks migration tool (2025) for Hive → Delta Lake migration. Auto-converts Hive DDL to Delta Lake CREATE TABLE, migrates Hive Metastore → Unity Catalog, translates HiveQL → Databricks SQL, assesses compatibility.
MapReduce: disk-based (reads HDFS, writes HDFS between every stage). Spark: in-memory (keeps data in RAM across stages). Result: Spark is 10-100x faster. Spark also supports Python/SQL, streaming, and ML. MapReduce: only Java, batch only.
SECTION 5: HADOOP 3 + ERASURE CODING — FLASH CARDS
Alternative to 3x replication for cold data. RS-6-3: 6 data blocks + 3 parity blocks = 9 blocks total. Storage overhead: 1.5x vs 3x replication. Saves ~50% storage on archive data. Cost: higher CPU for reconstruction.
EC: cold/archive data (accessed rarely), large files, /archive directories. Replication: hot production data, small files, data locality needed for MapReduce jobs. Never use EC for frequently-accessed data (reconstruction is slow!).
SECTION 6: CAPACITY PLANNING — FLASH CARDS
150 bytes per file/block in NameNode heap. 100 million files → 15 GB minimum. Add 50% buffer → 22.5 GB → round up to 32 GB. Plus OS overhead. Use dedicated high-memory server for NameNode.
64-128 GB RAM, 12-24 CPU cores, 10-12 × 4 TB SATA drives (JBOD — no RAID, HDFS handles redundancy), 10GbE network. Co-locate HBase RegionServers on DataNodes for data locality.
🧠 HADOOP ULTRA CHEAT SHEET — ALL 3 DAYS
╔══════════════════════════════════════════════════════════════════════╗ ║ HADOOP COMPLETE CHEAT SHEET ║ ║ 3-Day Interview Prep ║ ╠══════════════════════════════════════════════════════════════════════╣ ║ ║ ║ ■ DAY 1: HDFS + YARN + MAPREDUCE ║ ║ ║ ║ HDFS KEY NUMBERS: 128 MB blocks, 3x replication, 3s heartbeat ║ ║ NameNode: metadata IN RAM (150 bytes per file) ║ ║ DataNode: actual blocks, heartbeat every 3s, block report every 6h ║ ║ ║ ║ NameNode HA = "JZ-FENCE": ║ ║ JournalNodes (quorum log) + ZooKeeper (election) + FENCING ║ ║ Fencing = kill old Active BEFORE promoting Standby (split-brain!) ║ ║ ║ ║ YARN = RM + NM + AM: ║ ║ ResourceManager: cluster boss (Scheduler + AppManager) ║ ║ NodeManager: per-node (runs containers, monitors resources) ║ ║ ApplicationMaster: per-JOB (negotiates resources, handles fails) ║ ║ Schedulers: FIFO (dev) / Capacity (multi-tenant) / Fair (mixed) ║ ║ ║ ║ MapReduce = "I-Map-CBS-Reduce-O": ║ ║ Input → Map → Combiner → Buffer+Sort → Shuffle → Reduce → Output ║ ║ Shuffle = BOTTLENECK (disk + network + sort) ║ ║ Combiner: cut shuffle 60-80% (only commutative+associative!) ║ ║ Speculative: duplicate slow tasks (DISABLE if side effects!) ║ ║ ║ ║ Optimizations = "MCJ-COMP": ║ ║ Memory: sort buffer 100 MB→512 MB (biggest single win) ║ ║ Combiner: pre-aggregate locally ║ ║ JVM Reuse: jvm.numtasks=10 for many small tasks ║ ║ Compression: Snappy shuffle, GZIP output ║ ║ Output: LZO if output is input to next MR job (splittable!) ║ ║ More Reducers: nodes×containers×0.95 (not 1!) ║ ║ Partitioner: custom for skewed keys ║ ║ ║ ╠══════════════════════════════════════════════════════════════════════╣ ║ ║ ║ ■ DAY 2: HIVE + ECOSYSTEM ║ ║ ║ ║ Hive Architecture = "DMCTE": ║ ║ Driver → Metastore → Compiler → Tez Engine → Execution ║ ║ Metastore = MySQL backend, stores schema+HDFS paths ║ ║ ⚠️ Metastore DOWN → ALL Hive queries FAIL ║ ║ ║ ║ Internal vs External: ║ ║ Internal: DROP TABLE = data DELETED from HDFS! ║ ║ External: DROP TABLE = metadata only, data SAFE ║ ║ Rule: always External for raw/shared/Sqoop-ingested data ║ ║ ║ ║ Partitioning: ║ ║ Static: PARTITION (date='..') — one partition at a time ║ ║ Dynamic: nonstrict mode, Hive reads value from data ║ ║ MSCK REPAIR TABLE → sync new partitions from HDFS to Metastore ║ ║ Pruning: filter DIRECTLY on partition column (not YEAR(date)!) ║ ║ ║ ║ Hive Optimization = "VECTOR-TOP": ║ ║ Engine=Tez (5-10x vs MapReduce), Vectorization (1024-row batches) ║ ║ CBO + ANALYZE TABLE, Map Join (< 25 MB → broadcast) ║ ║ ORC format + predicate pushdown (skip stripes by min/max) ║ ║ skewindata=true (2-phase GROUP BY for hot keys) ║ ║ SORT BY not ORDER BY (unless global order needed) ║ ║ Merge small files: hive.merge.mapredfiles=true ║ ║ ║ ║ File Formats = "OPTA": ║ ║ ORC: Hive-native, ACID, predicate pushdown (best for Hive) ║ ║ Parquet: cross-tool Spark+Impala+Hive (best for multi-tool) ║ ║ Text: never in production (no compress, full scan) ║ ║ Avro: schema evolution, Kafka/Sqoop landing (row-based) ║ ║ ║ ║ HBase: NoSQL random R/W on HDFS ║ ║ Row key: NEVER monotonic (hotspot!) → use salt/reverse timestamp ║ ║ Write: WAL → MemStore → HFile flush at 128 MB ║ ║ ║ ║ Sqoop: RDBMS ↔ HDFS ║ ║ Incremental: --incremental append (new rows) or lastmodified ║ ║ --num-mappers: parallel DB connections (4-8 max!) ║ ║ Export NOT atomic → use staging table! ║ ║ ║ ║ Oozie: Workflow (run once) + Coordinator (time+data trigger) ║ ║ ZooKeeper: leader election (NameNode/HMaster/YARN RM) ║ ║ Always odd nodes (3/5/7), ephemeral znodes for leader detection ║ ║ Flume: Source → Channel → Sink for log ingestion ║ ║ File channel = durable; Memory channel = fast but lossy ║ ║ ║ ╠══════════════════════════════════════════════════════════════════════╣ ║ ║ ║ ■ DAY 3: PERFORMANCE, SECURITY, MIGRATION ║ ║ ║ ║ Security = "KRK": ║ ║ Kerberos: Authentication (kinit → TGT → Service Ticket) ║ ║ Ranger: Authorization (table/column/row-level policies + audit) ║ ║ Knox: Gateway (single HTTPS entry + SSL + SSO) ║ ║ ║ ║ JVM Tuning: ║ ║ NameNode: -Xmx32g -Xms32g -XX:+UseG1GC ║ ║ NameNode heap: 1 GB per 1 million files (150 bytes/file) ║ ║ G1GC: short predictable pauses vs CMS long stop-the-world ║ ║ ║ ║ MapReduce Critical Settings: ║ ║ mapreduce.task.io.sort.mb=512 (default 100 = too small!) ║ ║ mapreduce.map.output.compress=true + Snappy codec ║ ║ mapreduce.reduce.shuffle.parallelcopies=50 (default 5) ║ ║ mapreduce.job.jvm.numtasks=10 (JVM reuse for small task jobs) ║ ║ mapreduce.job.reduces=N (never leave at default 1!) ║ ║ ║ ║ Data Skew Solutions: ║ ║ MapReduce: salting (US → US_0...US_99) + custom Partitioner ║ ║ Hive GROUP BY: hive.groupby.skewindata=true (2-phase agg) ║ ║ Hive JOIN: hive.optimize.skewjoin=true ║ ║ NULL keys: COALESCE(key, CONCAT('null_', RAND())) ║ ║ ║ ║ Erasure Coding (Hadoop 3): ║ ║ RS-6-3: 6 data + 3 parity blocks = 1.5x overhead vs 3x ║ ║ 50% storage savings for cold/archive data ║ ║ ⚠️ High CPU for reconstruction → cold data only! ║ ║ ║ ║ Cloud Migration = "LRR": ║ ║ Lift-and-Shift: DistCp HDFS→S3/ADLS, same code ║ ║ Replatform: MapReduce → Spark (10-100x faster) ║ ║ Refactor: Hive → Delta Lake + Databricks Lakebridge (2025) ║ ║ Always: parallel run → compare → cutover (never big-bang!) ║ ║ ║ ║ HADOOP vs SPARK: ║ ║ MapReduce: disk-based, Java, batch, slow (minutes to hours) ║ ║ Spark: in-memory, Python/SQL, streaming+batch (seconds-minutes) ║ ║ New development: always Spark. Legacy: MapReduce still runs. ║ ║ ║ ╠══════════════════════════════════════════════════════════════════════╣ ║ ║ ║ TOP 10 THINGS TO SAY IN INTERVIEW: ║ ║ ║ ║ 1. "NameNode HA: JournalNodes + ZooKeeper + FENCING (splits-brain)" ║ ║ 2. "Shuffle is bottleneck: sort buffer 100 MB → 512 MB is biggest win"║ ║ 3. "External tables for all raw/shared data — DROP TABLE is safe" ║ ║ 4. "MSCK REPAIR TABLE when Sqoop adds partitions outside Hive" ║ ║ 5. "Tez + Vectorization + ORC = 10-50x faster than MR+Text" ║ ║ 6. "HBase row key: NEVER monotonic (hotspot) → use salting" ║ ║ 7. "Kerberos=authentication, Ranger=authorization, Knox=gateway" ║ ║ 8. "Erasure Coding saves 50% storage — use for cold/archive data" ║ ║ 9. "Migration: Lift-Shift→ Replatform→ Refactor (never big-bang)" ║ ║ 10. "skewindata=true for GROUP BY skew, custom partitioner for MR" ║ ║ ║ ║ 10-YEAR ENGINEER FRAMING: ║ ║ "I've not just run Hadoop queries — I've tuned NameNode heap, ║ ║ fixed GC pauses causing DataNode timeouts, debugged ZKFC failover, ║ ║ implemented salting for skewed keys, and migrated pipelines from ║ ║ MapReduce to Spark and then to Databricks lakehouses." ║ ║ ║ ╚══════════════════════════════════════════════════════════════════════╝