🐘
Hadoop
Day 3: Performance, Security + Cloud Migration — Quick Recall Guide
🐘
🐘
Hadoop · Section 7 of 8

Day 3: Performance, Security + Cloud Migration — Quick Recall Guide

Day 3: Performance, Security + Cloud Migration — Quick Recall Guide

Must remember🔑Key concept⚠️Common trap🧠Memory Map📝One-liner

🧠 MASTER MEMORY MAP — Day 3

🧠 HADOOP SECURITY = "KRK" (Kerberos → Ranger → Knox):
HADOOP SECURITY"KRK" (Kerberos → Ranger → Knox):
KKerberos: AUTHENTICATION (who are you? → TGT from KDC)
RRanger: AUTHORIZATION (what can you do? → table/column/row policies)
KKnox: GATEWAY (how do you get in? → SSL + single entry point)
PERFORMANCE TUNING"JYH-COMP":
JJVM: heap sizes (NN heap = 1 GB per million files!), G1GC
YYARN: container sizes, node memory, scheduler config
HHadoop: block size, handler count, short-circuit reads
CCompression: Snappy shuffle, GZIP/ZLIB archive
OOptimize sort buffer: 100 MB512 MB (mapreduce.task.io.sort.mb)
MMore reducers: 1 → nodes×containers×0.95
PParallel shuffle copies: parallelcopies default 5 → 50
CLOUD MIGRATION"LRR":
LLift-and-Shift: HDFS → S3/ADLS + same code (fastest, least benefit)
RReplatform: MapReduce → Spark (same data, better processing)
RRefactor: Hive → Delta Lake + Databricks (full modernization)
HADOOP vs SPARK = "Disk vs RAM":
MapReduce: disk-based, slow, Java only, fault-tolerant
Spark: in-memory, 100x faster, Python/SQL/Scala, streaming+batch

SECTION 1: SECURITY — DIRECT QUESTIONS

Q1What is Kerberos in Hadoop?

Authentication protocol. Users/services prove identity to KDC (Key Distribution Center) and get cryptographic tickets. Without Kerberos, Hadoop accepts any claimed identity — zero security.

Q2What is a keytab file?

File containing pre-stored Kerberos credentials for a service account. Used by automated processes (Oozie jobs, cron) to authenticate without a password prompt. kinit -kt /etc/keytabs/hive.keytab hive/host@REALM

⚠️ Q3What happens if a Kerberos ticket expires during a long-running job?

Job fails with "Authentication failure" mid-execution. Fix: use long-lived keytabs + configure kinit -R (renewal) before jobs start, or set hadoop.security.auth_to_local token delegation for MapReduce/Spark jobs.

Q4What does Apache Ranger do?

Centralized fine-grained authorization for ALL Hadoop services. Policies define: who can do what on which resource. Supports table/column/row-level policies, column masking (show last 4 digits of SSN), full audit logging.

Q5What does Apache Knox do?

API Gateway and SSL proxy for Hadoop cluster. Single HTTPS entry point — users never connect directly to internal services. Simplifies firewall rules, provides SSO, terminates SSL at edge.

Q6Kerberos vs Ranger — what's the difference?

Kerberos = AUTHENTICATION (verifies who you are). Ranger = AUTHORIZATION (decides what you can do). Both are needed: Kerberos proves identity, Ranger enforces permissions. Like: Kerberos = ID check at door, Ranger = bouncer with guest list.

SECTION 2: PERFORMANCE — FLASH CARDS

Q7NameNode heap sizing rule?

1 GB per 1 million files/blocks. 10 million files → 10 GB minimum. Add 50% buffer. Use G1GC for low-pause GC: -Xmx32g -Xms32g -XX:+UseG1GC. ⚠️ Default heap is only 1 GB — way too small for production!

Q8What is dfs.namenode.handler.count?

Number of RPC handler threads in NameNode. Default: 10 (too small for large clusters!). Rule: 20 × log2(cluster_nodes). 100 nodes → ~132 → set to 100-150. Low handler count → RPC queue builds up → timeouts.

Q9What setting has the biggest MapReduce performance impact?

mapreduce.task.io.sort.mb — the in-memory sort buffer. Default 100 MB causes constant disk spills during shuffle. Increase to 512 MB → fewer spills → much faster shuffle. This one change often reduces job time 30-60%.

Q10What is mapreduce.reduce.shuffle.parallelcopies?

Number of parallel threads each reducer uses to copy map outputs. Default: 5 (very low). Increase to 50 → reducers fetch shuffle data faster → reduce phase starts sooner.

⚠️ Q11Why is G1GC preferred over default CMS GC for NameNode?

G1GC (Garbage First): concurrent collector with predictable short pauses (< 200 ms). CMS GC: can have multi-second "stop-the-world" pauses on large heaps → NameNode appears unresponsive → DataNodes time out → false "dead DataNode" alerts.

Q12What is short-circuit read in HDFS?

dfs.client.read.shortcircuit=true — if client runs on SAME node as DataNode, it reads the block file directly from local disk (bypasses network stack). Faster local reads for jobs with data locality.

SECTION 3: DATA SKEW — FLASH CARDS

Q13What is data skew symptom in MapReduce?

99 reducers finish in 20 minutes. 1 reducer runs for 6 hours. One key (like "US") has millions of values → all sent to one reducer → bottleneck.

Q14How do you fix skew in MapReduce?
Pro Tip
Salting: append random suffix to hot key ("US" → "US_0" to "US_99"). Distributes across 100 reducers. Then second pass to merge partial results. Or: custom Partitioner that splits hot keys across multiple reducers.
Q15How do you fix skew in Hive?

SET hive.groupby.skewindata=true → Hive runs 2-phase GROUP BY. SET hive.optimize.skewjoin=true for skewed JOIN. ⚠️ NULL values all go to one reducer by default — use COALESCE(key, CONCAT('null_', RAND())).

SECTION 4: CLOUD MIGRATION — FLASH CARDS

Q16Three migration patterns from Hadoop to cloud?

Lift-and-Shift: copy HDFS → S3/ADLS with DistCp, keep same code. Replatform: keep cloud storage, replace MapReduce with Spark. Refactor: full rebuild with Delta Lake + Databricks + Unity Catalog.

Q17What is DistCp?

Distributed Copy — MapReduce-based tool to copy data between HDFS paths or HDFS to cloud. hdfs distcp hdfs://namenode/data/ s3a://bucket/data/. Use -update flag for incremental copies (skip already-copied files).

⚠️ Q18Why use -skipcrccheck with DistCp to S3/ADLS?

HDFS uses CRC32 checksums; S3/ADLS use MD5-based ETags. They're incompatible — DistCp sees them as mismatches and re-copies everything on each run. -skipcrccheck skips checksum comparison → only copies genuinely new/changed files.

Q19What is Databricks Lakebridge?

Official Databricks migration tool (2025) for Hive → Delta Lake migration. Auto-converts Hive DDL to Delta Lake CREATE TABLE, migrates Hive Metastore → Unity Catalog, translates HiveQL → Databricks SQL, assesses compatibility.

Q20Hadoop MapReduce vs Spark — key difference?

MapReduce: disk-based (reads HDFS, writes HDFS between every stage). Spark: in-memory (keeps data in RAM across stages). Result: Spark is 10-100x faster. Spark also supports Python/SQL, streaming, and ML. MapReduce: only Java, batch only.

SECTION 5: HADOOP 3 + ERASURE CODING — FLASH CARDS

Q21What is Erasure Coding in Hadoop 3?

Alternative to 3x replication for cold data. RS-6-3: 6 data blocks + 3 parity blocks = 9 blocks total. Storage overhead: 1.5x vs 3x replication. Saves ~50% storage on archive data. Cost: higher CPU for reconstruction.

Q22When to use Erasure Coding vs Replication?

EC: cold/archive data (accessed rarely), large files, /archive directories. Replication: hot production data, small files, data locality needed for MapReduce jobs. Never use EC for frequently-accessed data (reconstruction is slow!).

Q23New YARN features in Hadoop 3?
Pro Tip
Opportunistic Containers (run low-priority tasks in spare capacity), Timeline Server v2 (better job history), Multiple Standby NameNodes (not just 1 standby), Intra-DataNode Balancer (balance disks within one DataNode).

SECTION 6: CAPACITY PLANNING — FLASH CARDS

Q24NameNode memory formula?

150 bytes per file/block in NameNode heap. 100 million files → 15 GB minimum. Add 50% buffer → 22.5 GB → round up to 32 GB. Plus OS overhead. Use dedicated high-memory server for NameNode.

Q25DataNode hardware recommendation?

64-128 GB RAM, 12-24 CPU cores, 10-12 × 4 TB SATA drives (JBOD — no RAID, HDFS handles redundancy), 10GbE network. Co-locate HBase RegionServers on DataNodes for data locality.

🧠 HADOOP ULTRA CHEAT SHEET — ALL 3 DAYS

📐 Architecture Diagram
╔══════════════════════════════════════════════════════════════════════╗
║                  HADOOP COMPLETE CHEAT SHEET                          ║
║                    3-Day Interview Prep                                ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  ■ DAY 1: HDFS + YARN + MAPREDUCE                                    ║
║                                                                      ║
║  HDFS KEY NUMBERS: 128 MB blocks, 3x replication, 3s heartbeat        ║
║  NameNode: metadata IN RAM (150 bytes per file)                      ║
║  DataNode: actual blocks, heartbeat every 3s, block report every 6h  ║
║                                                                      ║
║  NameNode HA = "JZ-FENCE":                                           ║
║    JournalNodes (quorum log) + ZooKeeper (election) + FENCING        ║
║    Fencing = kill old Active BEFORE promoting Standby (split-brain!) ║
║                                                                      ║
║  YARN = RM + NM + AM:                                                ║
║    ResourceManager: cluster boss (Scheduler + AppManager)            ║
║    NodeManager: per-node (runs containers, monitors resources)       ║
║    ApplicationMaster: per-JOB (negotiates resources, handles fails)  ║
║  Schedulers: FIFO (dev) / Capacity (multi-tenant) / Fair (mixed)     ║
║                                                                      ║
║  MapReduce = "I-Map-CBS-Reduce-O":                                   ║
║    Input → Map → Combiner → Buffer+Sort → Shuffle → Reduce → Output  ║
║    Shuffle = BOTTLENECK (disk + network + sort)                      ║
║    Combiner: cut shuffle 60-80% (only commutative+associative!)       ║
║    Speculative: duplicate slow tasks (DISABLE if side effects!)       ║
║                                                                      ║
║  Optimizations = "MCJ-COMP":                                         ║
║    Memory: sort buffer 100 MB512 MB (biggest single win)              ║
║    Combiner: pre-aggregate locally                                    ║
║    JVM Reuse: jvm.numtasks=10 for many small tasks                   ║
║    Compression: Snappy shuffle, GZIP output                          ║
║    Output: LZO if output is input to next MR job (splittable!)       ║
║    More Reducers: nodes×containers×0.95 (not 1!)                    ║
║    Partitioner: custom for skewed keys                               ║
║                                                                      ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  ■ DAY 2: HIVE + ECOSYSTEM                                           ║
║                                                                      ║
║  Hive Architecture = "DMCTE":                                        ║
║    Driver → Metastore → Compiler → Tez Engine → Execution            ║
║    Metastore = MySQL backend, stores schema+HDFS paths               ║
║    ⚠️ Metastore DOWN → ALL Hive queries FAIL                         ║
║                                                                      ║
║  Internal vs External:                                               ║
║    Internal: DROP TABLE = data DELETED from HDFS!                    ║
║    External: DROP TABLE = metadata only, data SAFE                   ║
║    Rule: always External for raw/shared/Sqoop-ingested data          ║
║                                                                      ║
║  Partitioning:                                                       ║
║    Static: PARTITION (date='..') — one partition at a time           ║
║    Dynamic: nonstrict mode, Hive reads value from data               ║
║    MSCK REPAIR TABLE → sync new partitions from HDFS to Metastore    ║
║    Pruning: filter DIRECTLY on partition column (not YEAR(date)!)    ║
║                                                                      ║
║  Hive Optimization = "VECTOR-TOP":                                   ║
║    Engine=Tez (5-10x vs MapReduce), Vectorization (1024-row batches) ║
║    CBO + ANALYZE TABLE, Map Join (< 25 MB → broadcast)                ║
║    ORC format + predicate pushdown (skip stripes by min/max)         ║
║    skewindata=true (2-phase GROUP BY for hot keys)                   ║
║    SORT BY not ORDER BY (unless global order needed)                 ║
║    Merge small files: hive.merge.mapredfiles=true                    ║
║                                                                      ║
║  File Formats = "OPTA":                                              ║
║    ORC: Hive-native, ACID, predicate pushdown (best for Hive)        ║
║    Parquet: cross-tool Spark+Impala+Hive (best for multi-tool)       ║
║    Text: never in production (no compress, full scan)                ║
║    Avro: schema evolution, Kafka/Sqoop landing (row-based)           ║
║                                                                      ║
║  HBase: NoSQL random R/W on HDFS                                     ║
║    Row key: NEVER monotonic (hotspot!) → use salt/reverse timestamp  ║
║    Write: WAL → MemStore → HFile flush at 128 MB                      ║
║                                                                      ║
║  Sqoop: RDBMS ↔ HDFS                                                 ║
║    Incremental: --incremental append (new rows) or lastmodified      ║
║    --num-mappers: parallel DB connections (4-8 max!)                 ║
║    Export NOT atomic → use staging table!                            ║
║                                                                      ║
║  Oozie: Workflow (run once) + Coordinator (time+data trigger)        ║
║  ZooKeeper: leader election (NameNode/HMaster/YARN RM)               ║
║    Always odd nodes (3/5/7), ephemeral znodes for leader detection   ║
║  Flume: Source → Channel → Sink for log ingestion                    ║
║    File channel = durable; Memory channel = fast but lossy           ║
║                                                                      ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  ■ DAY 3: PERFORMANCE, SECURITY, MIGRATION                           ║
║                                                                      ║
║  Security = "KRK":                                                   ║
║    Kerberos: Authentication (kinit → TGT → Service Ticket)           ║
║    Ranger: Authorization (table/column/row-level policies + audit)   ║
║    Knox: Gateway (single HTTPS entry + SSL + SSO)                    ║
║                                                                      ║
║  JVM Tuning:                                                         ║
║    NameNode: -Xmx32g -Xms32g -XX:+UseG1GC                           ║
║    NameNode heap: 1 GB per 1 million files (150 bytes/file)           ║
║    G1GC: short predictable pauses vs CMS long stop-the-world         ║
║                                                                      ║
║  MapReduce Critical Settings:                                        ║
║    mapreduce.task.io.sort.mb=512     (default 100 = too small!)      ║
║    mapreduce.map.output.compress=true + Snappy codec                 ║
║    mapreduce.reduce.shuffle.parallelcopies=50 (default 5)            ║
║    mapreduce.job.jvm.numtasks=10 (JVM reuse for small task jobs)     ║
║    mapreduce.job.reduces=N (never leave at default 1!)               ║
║                                                                      ║
║  Data Skew Solutions:                                                ║
║    MapReduce: salting (US → US_0...US_99) + custom Partitioner       ║
║    Hive GROUP BY: hive.groupby.skewindata=true (2-phase agg)         ║
║    Hive JOIN: hive.optimize.skewjoin=true                            ║
║    NULL keys: COALESCE(key, CONCAT('null_', RAND()))                 ║
║                                                                      ║
║  Erasure Coding (Hadoop 3):                                          ║
║    RS-6-3: 6 data + 3 parity blocks = 1.5x overhead vs 3x           ║
║    50% storage savings for cold/archive data                         ║
║    ⚠️ High CPU for reconstruction → cold data only!                  ║
║                                                                      ║
║  Cloud Migration = "LRR":                                            ║
║    Lift-and-Shift: DistCp HDFS→S3/ADLS, same code                   ║
║    Replatform: MapReduce → Spark (10-100x faster)                    ║
║    Refactor: Hive → Delta Lake + Databricks Lakebridge (2025)        ║
║    Always: parallel run → compare → cutover (never big-bang!)        ║
║                                                                      ║
║  HADOOP vs SPARK:                                                    ║
║    MapReduce: disk-based, Java, batch, slow (minutes to hours)       ║
║    Spark: in-memory, Python/SQL, streaming+batch (seconds-minutes)   ║
║    New development: always Spark. Legacy: MapReduce still runs.      ║
║                                                                      ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  TOP 10 THINGS TO SAY IN INTERVIEW:                                  ║
║                                                                      ║
║  1. "NameNode HA: JournalNodes + ZooKeeper + FENCING (splits-brain)" ║
║  2. "Shuffle is bottleneck: sort buffer 100 MB512 MB is biggest win"║
║  3. "External tables for all raw/shared data — DROP TABLE is safe"   ║
║  4. "MSCK REPAIR TABLE when Sqoop adds partitions outside Hive"      ║
║  5. "Tez + Vectorization + ORC = 10-50x faster than MR+Text"        ║
║  6. "HBase row key: NEVER monotonic (hotspot) → use salting"         ║
║  7. "Kerberos=authentication, Ranger=authorization, Knox=gateway"    ║
║  8. "Erasure Coding saves 50% storage — use for cold/archive data"   ║
║  9. "Migration: Lift-Shift→ Replatform→ Refactor (never big-bang)"  ║
║  10. "skewindata=true for GROUP BY skew, custom partitioner for MR"  ║
║                                                                      ║
║  10-YEAR ENGINEER FRAMING:                                           ║
║  "I've not just run Hadoop queries — I've tuned NameNode heap,       ║
║   fixed GC pauses causing DataNode timeouts, debugged ZKFC failover, ║
║   implemented salting for skewed keys, and migrated pipelines from   ║
║   MapReduce to Spark and then to Databricks lakehouses."             ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝