🐘
Hadoop
Day 3: Performance, Security + Cloud Migration — Deep Interview Guide
🐘
🐘
Hadoop · Section 6 of 8

Day 3: Performance, Security + Cloud Migration — Deep Interview Guide

Day 3: Performance, Security + Cloud Migration — Deep Interview Guide

🧠 MASTER MEMORY MAP — Day 3

🧠 HADOOP SECURITY = "KRK" (Kerberos → Ranger → Knox):
HADOOP SECURITY"KRK" (Kerberos → Ranger → Knox):
KKerberos: AUTHENTICATION (who are you?)
RRanger: AUTHORIZATION (what can you do?)
KKnox: GATEWAY (how do you get in? SSL + API proxy)
PERFORMANCE TUNING LAYERS"JYH" (JVM → YARN → Hadoop):
JJVM tuning (heap sizes, GC, JVM reuse)
YYARN tuning (container sizes, scheduler config)
HHadoop-level (block size, replication, compression)
CLOUD MIGRATION PATTERNS"LRR" (Lift → Replatform → Refactor):
LLift-and-Shift: HDFS → S3/ADLS (same MapReduce/Hive, just different storage)
RReplatform: MapReduce → Spark (same data in cloud, better processing)
RRefactor: Hive → Delta Lake / Snowflake (rebuild for cloud-native)
HADOOP vs SPARK = "DISK vs RAM":
Hadoop MapReduce: disk-based, fault-tolerant, Java-only
Spark: in-memory (100x faster), Python/SQL/Scala, streaming + batch
When to still use Hadoop: legacy code, can't migrate budget, HBase is on-prem

SECTION 1: HADOOP SECURITY

Layer 1: Kerberos — Authentication

KERBEROSMIT protocol for distributed authentication
WITHOUT KERBEROS (plain Hadoop)
User: "I am root"
Hadoop: "OK, here's all the data" ← takes your word for it!
Security: ZERO
WITH KERBEROS
User: must have valid Kerberos ticket (cryptographically signed by KDC)
Hadoop: verifies ticket with KDC before granting access
Can't fake identityproper enterprise security
📋 Overview
HOW KERBEROS WORKS (simplified)
KDCKey Distribution Center (the central auth server)
Authentication Service (AS): verifies your password
Ticket Granting Service (TGS): gives service-specific tickets
STEP 1: User kinit (login):
$ kinit krishna@AMADEUS.COM
Enter passwordKDC verifies → gives Ticket-Granting Ticket (TGT)
TGT stored locally (~8 hour validity)
STEP 2: Access Hadoop service:
User app: "I want to access HDFS, here's my TGT"
KDC/TGS: verifies TGTissues Service Ticket for HDFS (specific to HDFS daemon)
User app presents Service Ticket to HDFS NameNode
NameNode: "This ticket is valid"access granted
STEP 3: Each service has its own Kerberos principal:
hdfs/namenode-host@AMADEUS.COM (HDFS NameNode)
yarn/rm-host@AMADEUS.COM (YARN ResourceManager)
hive/hiveserver2-host@AMADEUS.COM (HiveServer2)
KEY TERMS
Principal: identity in Kerberos realm (user or service)
keytab: file with pre-stored credentials (for automated services like Oozie)
→ No password prompt, used in cron jobs and service accounts
realm: Kerberos domain (e.g., AMADEUS.COM)
kinit: manual login to get TGT (for humans)
klist: show current Kerberos tickets
kdestroy: destroy tickets (logout)
bash
# Production Hadoop with Kerberos:
kinit -kt /etc/security/keytabs/hive.service.keytab hive/hiveserver2@AMADEUS.COM
# Login using keytab file (automated, no password prompt)
# Used in Oozie jobs, cron jobs, service accounts

klist
# Credentials cache: FILE:/tmp/krb5cc_1000
# Default principal: hive/hiveserver2@AMADEUS.COM
# Expires: Thu Jan 16 08:00:00 2025

# Renew ticket before it expires (important for long-running jobs!):
kinit -R
# ⚠️ If ticket expires mid-job → job fails with "Authentication failure"
# Solution: use long-lived keytabs + auto-renewal in production

Layer 2: Apache Ranger — Authorization

🧠 RANGER = Role-Based Access Control for ALL Hadoop services
RANGERRole-Based Access Control for ALL Hadoop services
Without Ranger: access control via HDFS POSIX permissions only
→ hdfs dfs -chmod 750 /data/pii/ (coarse-grained, file-level)
→ Can't say "marketing can read column1 but not SSN column"
With Ranger: fine-grained policies for EVERY service:
→ Hive: table-level, column-level, row-level masking
→ HDFS: path-level access
→ HBase: table + column family level
→ Kafka: topic-level
→ YARN: queue-level
All in ONE central UI!
RANGER POLICY EXAMPLES
Policy 1: Hive table access
Name: "Marketing team can read bookings"
Resource: database=bookings_db, table=bookings, columns=*
Permissions: user=marketing_roleSELECT
Exclude: columns=credit_card_number (masked!)
Policy 2: Column masking (PII protection)
Name: "Mask SSN for non-PII team"
Resource: database=bookings_db, table=customers, column=ssn
Masking: MASK (show only last 4 digits: ***-**-1234)
Applied to: all users EXCEPT pii_approved_role
Policy 3: Row-level filter
Name: "Regional teams see only their region"
Resource: database=bookings_db, table=bookings
Row filter: region = current_user_region()
Applied to: regional_analyst role
RANGER AUDIT
Every access (success AND failure) is logged
"Who accessed what, when, from where"
Critical for compliance (GDPR, PCI-DSS, SOX)
Stored in Solr or HDFS for long-term retention

Layer 3: Apache Knox — Gateway

🧠 KNOX = API Gateway for Hadoop cluster
KNOXAPI Gateway for Hadoop cluster
Problems Knox solves:
1. Users need to access multiple services (HiveServer2, HDFS, YARN UI, Oozie)
Each service has its own host:portcomplex firewall rules
2. No SSL on internal Hadoop services (encrypted in transit needed for compliance)
3. Direct service access exposes internal cluster topology
Knox solution:
Single entry point: https://knox-gateway:8443/gateway/
→ Users never connect directly to internal services
→ Knox proxies requests to appropriate backend service
→ SSL termination at Knox (external traffic encrypted)
→ Authentication at Knox (SSO with LDAP/Kerberos)
External: UserHTTPS → Knox:8443
Internal: KnoxHTTP → HiveServer2, HDFS, YARN UI, Oozie
KNOX PATHS
/gateway/default/webhdfs/HDFS WebHDFS API
/gateway/default/hive/HiveServer2
/gateway/default/yarn/YARN ResourceManager UI
/gateway/default/oozie/Oozie API

SECTION 2: PERFORMANCE TUNING

Level 1: JVM Tuning

xml
<!-- Tune NameNode JVM heap (most critical — all metadata in RAM!) -->
export HADOOP_NAMENODE_OPTS="-Xmx16g -Xms16g -XX:+UseG1GC"
<!-- -Xmx16g: max heap 16 GB -->
<!-- -Xms16g: initial heap = max heap (prevents resize pauses) -->
<!-- -XX:+UseG1GC: G1 garbage collector (low-pause, good for large heaps) -->
<!-- Default: only 1 GB → too small for clusters with >10 million files! -->

<!-- Rule for NameNode heap sizing: -->
<!-- 1 GB per 1 million files/blocks (rough estimate) -->
<!-- 10M files → 10 GB NameNode heap minimum -->
<!-- 100M files → 100 GB NameNode heap (need large server!) -->

<!-- DataNode JVM: less critical (doesn't keep metadata) -->
export HADOOP_DATANODE_OPTS="-Xmx4g -XX:+UseG1GC"

<!-- YARN NodeManager JVM: -->
export YARN_NODEMANAGER_OPTS="-Xmx4g -XX:+UseG1GC"
GC PAUSES — THE SILENT KILLER:
JVM Garbage Collection = JVM pauses ALL threads to reclaim memory
On NameNode: GC pause = HDFS appears DOWN during pause (RPCs timeout!)
On DataNode: GC pause = heartbeats delayedNN marks DN as dead!
Fix:
1. Use G1GC (XX:+UseG1GC) — concurrent collector, shorter pauses
2. Set -Xms = -Xmx (avoid heap resizing GC)
3. Don't make heap too large — longer full GC if it happens
NameNode: max ~32 GB is practical; larger = painful GC pauses
Signs of GC problems: "Long GC pause" in logs, DataNodes showing as dead
Check: hdfs dfsadmin -report | grep "Dead datanodes"

Level 2: YARN Memory Tuning (Full Reference)

xml
<!-- yarn-site.xml — Complete memory configuration -->

<!-- Node-level memory available to YARN: -->
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>49152</value>   <!-- 48 GB on a 64 GB server -->
  <!-- Reserve: 8 GB for OS + 8 GB for other services (HBase, DN, NM JVMs) -->
</property>

<!-- Node-level CPU vcores for YARN: -->
<property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
  <value>16</value>   <!-- All cores on a 16-core server -->
</property>

<!-- Scheduler min/max container sizes: -->
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1024</value>   <!-- No container gets < 1 GB -->
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>16384</value>   <!-- No container gets > 16 GB -->
</property>

<!-- Memory increment: allocations happen in multiples of this: -->
<property>
  <name>yarn.scheduler.increment-allocation-mb</name>
  <value>512</value>
</property>
🧠 Memory Map
MEMORY CALCULATION EXAMPLE — 10 node cluster, 64 GB each:
PER NODE
Total RAM: 64 GB
OS overhead: -4 GB
HDFS DataNode: -2 GB (hdfs heap)
YARN NodeManager: -2 GB (nm heap)
Available for YARN containers: 56 GBset to 49152 MB (leave buffer)
CONTAINER SIZING
Mapper: 2048 MB (map container) → Java heap: 1638 MB (80%)
Reducer: 4096 MBJava heap: 3276 MB
Containers per node: 49152 / 2048 = 24 mappers OR 12 reducers
TOTAL CLUSTER
10 nodes × 24 containers = 240 concurrent mapper slots
10 nodes × 12 containers = 120 concurrent reducer slots

Level 3: Hadoop Configuration Performance Settings

xml
<!-- hdfs-site.xml performance settings -->

<!-- Block size: increase for large sequential files -->
<property>
  <name>dfs.blocksize</name>
  <value>268435456</value>   <!-- 256 MB for large files (default: 128 MB) -->
</property>

<!-- DataNode transfer thread pool (how many concurrent block transfers): -->
<property>
  <name>dfs.datanode.max.transfer.threads</name>
  <value>4096</value>   <!-- Default: 4096, can increase for high-throughput nodes -->
</property>

<!-- NameNode handler count (parallel RPC threads for client requests): -->
<property>
  <name>dfs.namenode.handler.count</name>
  <value>100</value>
  <!-- Default: 10 → way too small for large clusters! -->
  <!-- Rule: 20 * log2(cluster_size) -->
  <!-- 100 nodes → 20 * log2(100) = 20 * 6.6 = 132 → set to 100-150 -->
</property>

<!-- Short-circuit reads: client reads directly from local DataNode (no network!) -->
<property>
  <name>dfs.client.read.shortcircuit</name>
  <value>true</value>   <!-- If client is on same node as DataNode → read locally -->
</property>
xml
<!-- mapred-site.xml performance settings — FULL REFERENCE -->

<!-- MAP task memory -->
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>2048</value>
</property>
<property>
  <name>mapreduce.map.java.opts</name>
  <value>-Xmx1638m -XX:+UseG1GC</value>   <!-- 80% of container + G1GC -->
</property>

<!-- REDUCE task memory -->
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>4096</value>
</property>
<property>
  <name>mapreduce.reduce.java.opts</name>
  <value>-Xmx3276m -XX:+UseG1GC</value>

<!-- Sort buffer: MOST IMPORTANT for shuffle performance -->
</property>
<property>
  <name>mapreduce.task.io.sort.mb</name>
  <value>512</value>   <!-- Default: 100 MB → increase to 512 MB! -->
</property>

<!-- Sort spill threshold: spill to disk when X% of sort buffer used -->
<property>
  <name>mapreduce.map.sort.spill.percent</name>
  <value>0.80</value>
</property>

<!-- Parallel copies during shuffle (how many parallel transfers from mappers): -->
<property>
  <name>mapreduce.reduce.shuffle.parallelcopies</name>
  <value>50</value>   <!-- Default: 5 → very low! Increase for faster shuffle -->
</property>

<!-- In-memory merge threshold (before writing merged spills to disk): -->
<property>
  <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
  <value>0.70</value>   <!-- 70% of reducer heap used for in-memory shuffled data -->
</property>

<!-- Output compression -->
<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

<!-- JVM reuse for jobs with many small tasks: -->
<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>   <!-- Reuse same JVM for 10 tasks before restart -->
</property>

<!-- Number of parallel reducers: -->
<property>
  <name>mapreduce.job.reduces</name>
  <value>10</value>   <!-- Default: 1 — always set this based on data volume! -->
</property>

Level 4: Hive Performance Settings (Full Reference)

sql
-- HIVE PERFORMANCE SETTINGS — run at start of session or in hive-site.xml

-- MUST-SET EVERY TIME:
SET hive.execution.engine=tez;                        -- NOT mr!
SET hive.vectorized.execution.enabled=true;           -- batch row processing
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.cbo.enable=true;                             -- Cost-Based Optimizer
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;

-- JOIN OPTIMIZATION:
SET hive.auto.convert.join=true;                      -- auto map join for small tables
SET hive.mapjoin.smalltable.filesize=25000000;        -- 25 MB threshold
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=20971520;  -- 20 MB

-- SKEW HANDLING:
SET hive.groupby.skewindata=true;                     -- 2-phase GROUP BY
SET hive.optimize.skewjoin=true;
SET hive.skewjoin.key=100000;

-- DYNAMIC PARTITIONING:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=2000;
SET hive.exec.max.dynamic.partitions.pernode=500;

-- SMALL FILE MERGING (prevents small files after INSERT):
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.smallfiles.avgsize=134217728;          -- target 128 MB
SET hive.merge.size.per.task=268435456;               -- max 256 MB per output file

-- PARALLELISM:
SET hive.exec.parallel=true;                          -- run independent stages in parallel
SET hive.exec.parallel.thread.number=8;               -- max parallel threads

-- LLAP (if available):
SET hive.llap.execution.mode=auto;

Level 5: HDFS Balancer Tuning

bash
# HDFS Balancer: rebalance blocks across DataNodes
# When to run:
#   1. After adding new DataNodes (new nodes start empty!)
#   2. After replacing failed DataNodes
#   3. When some DataNodes are much more full than others

# Run balancer with specific bandwidth limit (don't saturate network):
hdfs balancer -threshold 10 -bandwidth 100m
# -threshold 10: a node is "balanced" if within 10% of average utilization
# -bandwidth 100m: use max 100 MB/s for rebalancing (don't flood production network)
# Without bandwidth limit: balancer uses ALL available bandwidth → impacts jobs!

# Check balance:
hdfs dfsadmin -report
# Look for: "DFS Used%" column across DataNodes
# If one node is at 90% and another at 20% → run balancer

# Intra-DataNode balancer (Hadoop 3 — balance disks within one DataNode):
hdfs diskbalancer -plan datanode-hostname
hdfs diskbalancer -execute plan.json
# Useful when one disk on a DataNode is much fuller than others

SECTION 3: DATA SKEW IN HADOOP — COMPLETE GUIDE

🧠 DATA SKEW = uneven data distribution across nodes/reducers
DATA SKEWuneven data distribution across nodes/reducers
SYMPTOMS
MapReduce: 99 reducers done, 1 reducer running for hours
Hive: GROUP BY query takes much longer than expected
YARN: one container using 10x memory of others
ROOT CAUSES
Hot keys: "US" has 10M rows, "Fiji" has 100 rows → one reducer handles all US
Non-uniform partitioning: HashPartitioner with non-uniform key distribution
NULL values: all NULL keys go to same reducer by default
SOLUTIONS BY TYPE
1. MAPREDUCE SKEWCustom Partitioner + Salting
Hot key "US"spread as "US_0", "US_1", ..., "US_99" across 100 reducers
Then second step: combine partial aggregations back
2. HIVE GROUP BY SKEWskewindata
SET hive.groupby.skewindata=true;
Hive automatically uses 2-phase aggregation
3. HIVE JOIN SKEWskewjoin
SET hive.optimize.skewjoin=true;
SET hive.skewjoin.key=100000;
Hive splits join: hot keys via separate broadcast, rest via shuffle join
4. NULL KEY SKEWhandle NULLs explicitly
-- Instead of: SELECT a.*, b.* FROM a JOIN b ON a.key = b.key
-- (all NULLs in a.key go to one reducer!)
-- Use:
SELECT a.*, b.*
FROM a JOIN b ON COALESCE(a.key, CONCAT('null_', RAND())) = b.key;
-- Random null valuesdistributed across reducers
5. SPARK (if migrated): repartition(), broadcast(), salting still apply

SECTION 4: CLOUDERA CDP vs HDP vs APACHE HADOOP

APACHE HADOOP (bare): Open source, no support, you manage everything
Use for: learning, custom builds, research
HORTONWORKS HDP (Hadoop Data Platform)
Was THE enterprise Hadoop distribution (2011-2019)
Merged with Cloudera in 2019
Legacy: many enterprises still run HDP 2.x or 3.x
Tools included: Ambari (management UI), Tez, ORC, Ranger, Atlas, Knox
EOL: HDP 2.x is end-of-life → companies migrating to CDP
CLOUDERA CDP (Cloudera Data Platform)
Combined Cloudera + Hortonworks (2019 merger)
On-premises: CDP Private Cloud
Cloud: CDP Public Cloud (AWS, Azure, GCP)
Management: Cloudera Manager (better than Ambari for large clusters)
Includes: CDW (Cloudera Data Warehouse), CML (ML), CDE (Data Engineering)
Key differentiator: SDX (Shared Data Experience) — unified security across all services
Atlas (data lineage + governance) is deeply integrated
INTERVIEW POSITIONING
"I've worked with HDP in production — Ambari for management, Ranger for security,
Atlas for lineage. With the Cloudera-Hortonworks merger, companies moving to CDP.
I understand the migration path: HDP AmbariCloudera Manager, same security model."

SECTION 5: CLOUD MIGRATION PATTERNS

Pattern 1: Lift-and-Shift (HDFS → Cloud Storage)

🧠 Memory Map
WHAT: Move data from HDFS to cloud object storage (ADLS Gen2, S3, GCS)
Keep same processing (Hive/MapReduce) but on cloud compute
HOW
Tool: DistCp (Distributed Copy — uses MapReduce to copy files in parallel)
hdfs distcp \
hdfs://namenode:8020/user/hive/warehouse/ \
wasbs://container@storageaccount.blob.core.windows.net/hive/warehouse/
# wasbs = Windows Azure Storage Blob Secure
Or to S3:
hdfs distcp hdfs://namenode:8020/data/ s3a://my-bucket/data/
For large datasets — incremental copy:
hdfs distcp -update -skipcrccheck \
hdfs://namenode/data/ \
s3a://bucket/data/
# -update: skip files already copied (incremental)
# -skipcrccheck: S3 CRC differs from HDFS CRC (avoid false mismatches)
PROS
✓ Minimal code changes (Hive still works with ADLS/S3 as backend)
✓ Low risk, fast migration
✓ Can keep Hive metastore pointing to new paths
CONS
✗ Still paying for compute clusters (no serverless)
✗ MapReduce still disk-based and slow
✗ Misses opportunity to modernize processing
WHEN TO CHOOSE
→ Budget limited, need quick win
→ Large Hadoop investment, can't rewrite apps
→ As Phase 1 of a larger migration

Pattern 2: Replatform (MapReduce → Spark)

🧠 HIVE → SPARK SQL MIGRATION:
WHAT: Keep data in cloud storage, replace MapReduce processing with Spark
Data: HDFSS3/ADLS
Processing: MapReduce JavaSpark (Python/Scala)
CHANGES
Old: MapReduce JobYARN → HDFS
New: Spark JobDatabricks/EMR/HDInsight → S3/ADLS
MIGRATION STEPS
1. Copy HDFS data to S3/ADLS (DistCp)
2. Rewrite MapReduce jobs as Spark jobs (logic same, API different)
3. Migrate Hive tables: update LOCATION to cloud paths
4. Migrate Oozie workflows to Airflow/Databricks Workflows
5. Migrate Sqoop imports to ADF/Glue/Spark JDBC
HIVESPARK SQL MIGRATION:
Most Hive SQL is compatible with Spark SQL!
-- Old Hive:
SET hive.execution.engine=tez;
INSERT INTO bookings_silver PARTITION (booking_date)
SELECT ... FROM bookings_bronze WHERE booking_date >= '2024-01-01';
-- New Spark SQL (almost identical!):
spark.sql("""
INSERT INTO bookings_silver PARTITION (booking_date)
SELECT ... FROM bookings_bronze WHERE booking_date >= '2024-01-01'
""")
PROS
✓ 10-100x faster processing (Spark vs MapReduce)
✓ Python/SQL (no more Java)
✓ Unified streaming + batch
✓ Lower compute costs (Spark finishes faster = fewer cluster-hours)
CONS
✗ Rewriting all jobs takes time
✗ Spark expertise needed
✗ Still no ACID transactions without Delta Lake

Pattern 3: Refactor (Hive → Delta Lake / Databricks)

🧠 HIVE TABLE → DELTA LAKE TABLE:
WHAT: Full modernization to cloud-native lakehouse
Storage: HDFSADLS Gen2 or S3
Format: ORC/ParquetDelta Lake (ACID transactions!)
Processing: Hive+TezSpark+Databricks
Orchestration: OozieDatabricks Workflows / ADF / Airflow
Security: RangerUnity Catalog (Databricks)
HIVE TABLEDELTA LAKE TABLE:
Old Hive (ORC):
CREATE EXTERNAL TABLE bookings (...)
STORED AS ORC
LOCATION 'hdfs://namenode/user/hive/warehouse/bookings/';
New Delta Lake:
CREATE TABLE bookings (...)
USING DELTA
LOCATION 'abfss://container@storage.dfs.core.windows.net/bookings/'
Convert existing ORC to Delta (in Databricks):
CONVERT TO DELTA parquet.`/mnt/data/bookings/` -- if Parquet
-- Or write new Delta table from Hive:
spark.table("hive_db.bookings").write.format("delta").save("/mnt/delta/bookings")
MIGRATION TOOL: Databricks Lakebridge (2025):
Official Databricks tool for Hive → Delta Lake migration
Auto-converts Hive DDL to Delta Lake CREATE TABLE statements
Migrates Hive Metastore → Unity Catalog
Translates HiveQL → Databricks SQL
Handles data copy from HDFS to cloud storage
Can assess compatibility: show what will break vs work automatically
PROS
✓ Full ACID (what Hive ACID never quite delivered)
✓ Time travel / data versioning
✓ Schema enforcement + evolution
✓ Unified batch + streaming on same tables
✓ Unity Catalog = better governance than Ranger + Atlas
✓ Serverless (no cluster management)
CONS
✗ Largest investment (rewrite + retrain team)
✗ Vendor lock-in (Databricks)
✗ Cost of migration project
WHEN TO CHOOSE
→ Greenfield or major modernization initiative
→ Current Hadoop cost is very high (license + hardware)
→ Team is growing and needs SQL-first tooling
→ Need ACID and time travel on large tables

Hadoop vs Spark — Complete Comparison

FEATURE HADOOP MAPREDUCE APACHE SPARK
─────────────────────────────────────────────────────────────────────
Storage HDFS HDFS, S3, ADLS, local
Processing model Disk-based In-memory (100x faster)
API Java only Python, Scala, Java, R, SQL
Fault tolerance Re-run from HDFS Re-compute from DAG lineage
Streaming KafkaHDFS → MR Spark Streaming (micro-batch)
(latency: minutes) (latency: seconds)
Machine Learning Mahout (weak) MLlib, TensorFlow (strong)
Interactive SQL Hive via MR (slow) Spark SQL (fast)
ACID Hive ACID (complex) Delta Lake (native)
Latency Minutes to hours Seconds to minutes
Code complexity High (Java verbose) Low (Python/SQL)
When still used Legacy code, New development,
HBase on-prem, cloud-native
budget constraints lakehouse workloads
10-YEAR ENGINEER'S ANSWER:
"I've built production MapReduce pipelines and I understand the disk-based model.
In practice, I moved our critical pipelines to Spark years ago — the speed difference
is real. MapReduce still runs in companies because of existing code investment.
But for new development, there's no reason to use MapReduce over Spark today."

SECTION 6: HADOOP 3 FEATURES — DEEP DIVE

Erasure Coding

🧠 ERASURE CODING = alternative to 3x replication for cold/archive data
ERASURE CODINGalternative to 3x replication for cold/archive data
OLD WAY (Replication)
Store 1 GB file3 copies3 GB used on HDFS
Storage overhead: 200% (3x replication = 200% overhead)
ERASURE CODING
Default EC policy: RS-6-3-1024k (Reed-Solomon, 6 data + 3 parity, 1 MB cells)
Store 1 GB file6 data blocks + 3 parity blocks = 9 blocks
9 blocks total, each 1/6 sizetotal = 9/6 = 1.5x storage
Storage overhead: 50% (vs 200% for 3x replication)
SAVINGS: 50% storage reduction!
MATH EXAMPLE
Without EC: 128 MB block × 3 copies = 384 MB used
With EC RS(6,3): 128 MB data → 6 × (21 MB data blocks) + 3 × (21 MB parity blocks)
= 9 × 21 MB = 189 MB used (≈1.5x overhead vs 3x overhead)
TRADEOFFS
✓ 50% storage savings (saves money!)
✗ Higher CPU overhead for reconstruction (when a node fails)
✗ Cannot reconstruct if more than 3 nodes fail (lose 4+ → data loss)
✗ Not good for hot data (reconstruction is slow on read)
USE ERASURE CODING FOR
✓ Cold/archive data (accessed rarely)
✓ Large files in archive directories
✓ /archive, /backup, /historical paths
KEEP 3x REPLICATION FOR:
✓ Hot production data (fast recovery needed)
✓ Small files (EC overhead not worth it for small files)
✓ Data accessed frequently by MapReduce (data locality doesn't work well with EC)
bash
# Set EC policy on a directory:
hdfs ec -setPolicy -policy RS-6-3-1024k -path /data/archive/2020/
# All new files in /data/archive/2020/ will use EC instead of 3x replication

# Check current EC policy:
hdfs ec -getPolicy -path /data/archive/2020/

# List available EC policies:
hdfs ec -listPolicies

# Disable EC (back to replication) on a path:
hdfs ec -unsetPolicy -path /data/archive/2020/

SECTION 7: MOCK INTERVIEW — Top 10 Questions for 10-Year Engineers

Q1: "Walk me through the complete lifecycle of a Hive query"

"When I run SELECT country, COUNT(*) FROM bookings GROUP BY country in Hive:
1. Driver receives query, initiates session
2. Compiler sends query to Metastore: 'where is bookings table?'
Metastore returns: HDFS path, ORC format, partition scheme
3. Parser builds AST (Abstract Syntax Tree)
4. Semantic analyzer validates: table exists? columns exist? permissions OK?
5. Optimizer (CBO if stats exist): decides join type, partition pruning, predicate pushdown
6. Physical plan: break into stages (Stage 1: MAP + partial agg, Stage 2: reduce+sort)
7. With Tez: translate plan to Tez DAG (not MapReduce chain)
8. Submit to YARN: YARN allocates containers for Tez workers
9. Each Tez task reads ORC file from HDFS, uses vectorization (1024 rows at a time)
10. Results aggregated, written to temp HDFS path
11. Driver returns results to client
Key optimizations I'd verify: execution.engine=tez, vectorization=true,
ORC format, partition pruning active, CBO stats current."

Q2: "How did you handle data skew in a production pipeline?"

"I had a job grouping e-commerce transactions by country — US had 70% of all data.
One reducer was running for 6 hours while others finished in 20 minutes.
My fix: two-step approach.
Step 1: I set hive.groupby.skewindata=true in Hive — that fixed the Hive queries.
For MapReduce jobs: I implemented salting. I appended a random suffix to US keys:
'US_0', 'US_1' ... 'US_19' to distribute US data across 20 reducers.
Then a second MapReduce job merged the partial aggregations.
I also ran ANALYZE TABLE to let CBO make better decisions.
The job went from 6 hours to 45 minutes."

Q3: "HDFS NameNode HA failed. Active NameNode is dead, Standby is not taking over. Debug?"

"I'd debug in layers:
1. Check ZooKeeper: is ZK quorum healthy?
echo 'ruok' | nc zk-host 2181expects 'imok'
If ZK is downZKFC can't elect new Active → failover stuck!
2. Check ZKFC on both NameNode hosts:
systemctl status hadoop-hdfs-zkfc
Look for: 'ZKFC is connected to ZooKeeper' vs 'Connection timeout'
3. Check FENCING: Did fencing script work on old Active?
Check logs: /var/log/hadoop-hdfs/hadoop-hdfs-namenode.log on standby
Look for: 'Fencing successful' or 'Fencing FAILED'
If fencing failed: Standby refuses to promote (correctly! prevents split-brain)
4. Force failover manually (with caution):
hdfs haadmin -failover --forceactive nn2
Only after confirming old Active is truly dead!
5. If Standby is way behind on edit logs:
hdfs namenode -bootstrapStandby
Re-sync Standby from JournalNodes"

Q4: "How would you migrate our 500 TB Hadoop cluster to Databricks on Azure?"

🧠 Memory Map
"I'd phase it over 6 months:
Phase 1 (Month 1-2): Assessment
Inventory all tables, jobs, data volumes, SLAs
Identify: HiveQL compatibility with Spark SQL (most is compatible!)
Find blockers: Hive UDFs in Java → need to rewrite in Python
Use Databricks Lakebridge for automated assessment
Phase 2 (Month 2-3): Infrastructure + Data Migration
Set up ADLS Gen2 as new storage layer
Use DistCp to copy HDFS → ADLS (start with cold/historical data first)
Convert ORC tables to Delta Lake format
Set up Databricks workspace + Unity Catalog for governance
Phase 3 (Month 3-5): Job Migration (start with easy wins)
Migrate Hive SQL jobs first (SQL is most compatible)
Migrate Oozie → Databricks Workflows
Migrate Sqoop → Azure Data Factory + Databricks autoloader
Keep old Hadoop running for validation (dual-run)
Phase 4 (Month 5-6): Cutover + Decommission
Run old and new in parallel for 2-4 weeks, compare results
Cut over teams one by one (analytics first, ETL pipeline last)
Decommission Hadoop cluster after validation period
RISK MITIGATION
Never do big-bang migration (always parallel run)
Start with read workloads, then write workloads
Hive ACID tables → need careful migration to Delta Lake (preserve row-level changes)
Test SLAs: some Hive batch jobs may need Spark optimization to meet same SLAs"

Q5: "What are the biggest performance problems you've seen in Hadoop at scale?"

"Three big ones from production experience:
1. NameNode memory pressure:
50 million small filesNameNode OOM crash
Fix: HAR files for archive, Hive merge settings for new writes,
NameNode heap increased to 32 GB with G1GC
2. Shuffle bottleneck in MapReduce:
Sort buffer at default 100 MBconstant disk spill
Fix: increased mapreduce.task.io.sort.mb to 512 MB + enabled Snappy compression
Result: 60% job time reduction
3. Hive queries on Text format tables:
DBA created tables as TEXT (CSV) — every query scanned billions of rows fully
Fix: converted all production tables to ORC + enabled vectorization
Result: 15x query speed improvement for analytical queries
The Hadoop small files problem is the most insidious — it degrades NameNode
stability gradually and only becomes critical when it's already a crisis."

SECTION 8: CAPACITY PLANNING

🧠 Memory Map
HADOOP CLUSTER SIZING GUIDELINES (10-year engineer reference)
STORAGE SIZING
Raw data size × replication factor × 1.25 (growth headroom)
Example: 100 TB raw data × 3 replication × 1.25 = 375 TB raw disk needed
With Erasure Coding (for cold data): 100 TB × 1.5 × 1.25 = 187.5 TB
COMPUTE SIZING
Determine: daily data processed + concurrent user queries
Rule of thumb: 1 compute node per 5 TB of hot data
For Hive/YARN: plan for peak concurrency (how many jobs run simultaneously)
DataNode spec: 64-128 GB RAM, 12-24 cores, 10 × 4 TB drives (JBOD, not RAID)
NameNode spec: 32-128 GB RAM (depends on file count!), fast SSDs
NAMENODE MEMORY FORMULA
Each file/block = ~150 bytes in NameNode heap
10 million files10M × 150 bytes = 1.5 GB
100 million files100M × 150 bytes = 15 GB minimum
Add 50% safety buffer22.5 GB → round to 32 GB
NODE COUNT (rough guideline)
NameNode: 2 (Active + Standby)
ZooKeeper: 3 or 5
JournalNode: 3 (can co-locate with ZooKeeper)
DataNode: depends on storage needs
YARN ResourceManager: 2 (Active + Standby)
HiveServer2: 2-4 (multiple for load balancing)
HBase RegionServer: co-locate with DataNodes (data locality!)