Day 3: Performance, Security + Cloud Migration — Quick Recall Guide
Must rememberKey conceptCommon trapMemory MapOne-liner
🧠 MASTER MEMORY MAP — Day 3
🧠 HADOOP SECURITY = "KRK" (Kerberos → Ranger → Knox):
HADOOP SECURITY"KRK" (Kerberos → Ranger → Knox):
KKerberos: AUTHENTICATION (who are you? → TGT from KDC)
RRanger: AUTHORIZATION (what can you do? → table/column/row policies)
KKnox: GATEWAY (how do you get in? → SSL + single entry point)
PERFORMANCE TUNING"JYH-COMP":
JJVM: heap sizes (NN heap = 1 GB per million files!), G1GC
YYARN: container sizes, node memory, scheduler config
HHadoop: block size, handler count, short-circuit reads
CCompression: Snappy shuffle, GZIP/ZLIB archive
OOptimize sort buffer: 100 MB → 512 MB (mapreduce.task.io.sort.mb)
MMore reducers: 1 → nodes×containers×0.95
PParallel shuffle copies: parallelcopies default 5 → 50
CLOUD MIGRATION"LRR":
LLift-and-Shift: HDFS → S3/ADLS + same code (fastest, least benefit)
RReplatform: MapReduce → Spark (same data, better processing)
RRefactor: Hive → Delta Lake + Databricks (full modernization)
HADOOP vs SPARK = "Disk vs RAM":
MapReduce: disk-based, slow, Java only, fault-tolerant
Spark: in-memory, 100x faster, Python/SQL/Scala, streaming+batch
SECTION 1: SECURITY — DIRECT QUESTIONS
⚡ Q1What is Kerberos in Hadoop?
Authentication protocol. Users/services prove identity to KDC (Key Distribution Center) and get cryptographic tickets. Without Kerberos, Hadoop accepts any claimed identity — zero security.
Q2What is a keytab file?
File containing pre-stored Kerberos credentials for a service account. Used by automated processes (Oozie jobs, cron) to authenticate without a password prompt. kinit -kt /etc/keytabs/hive.keytab hive/host@REALM
⚠️ Q3What happens if a Kerberos ticket expires during a long-running job?
Job fails with "Au