🐘
Hadoop
Hadoop Commands Reference — Interview Quick-Fire Guide
🐘
🐘
Hadoop · Section 8 of 8

Hadoop Commands Reference — Interview Quick-Fire Guide

Hadoop Commands Reference — Interview Quick-Fire Guide

⚠️ Common Trap
Purpose: Every command interviewers test, with real examples and traps to avoid Experience: 10 years — interviewers expect you to type these from memory Levels: ⬜ Direct (what/define) | 🟨 Mid-level (how/why) | 🟥 Scenario (debug/fix) Format: What it does → Command syntax → Practical example → Interview tip

SECTION 1: HDFS COMMANDS (hadoop fs / hdfs dfs)

💡 Interview Tip
Key distinction: hadoop fs works with ANY filesystem (HDFS, S3, local). hdfs dfs works ONLY with HDFS. In interviews, use hdfs dfs — it shows you know the difference.

ls / ls -R — List files and directories

What it does: Lists files and directories in HDFS, similar to Linux ls.

Syntax:

bash
hdfs dfs -ls <path>
hdfs dfs -ls -R <path>          # Recursive listing (all subdirectories)
hdfs dfs -ls -h <path>          # Human-readable file sizes (KB, MB, GB)

Practical example:

bash
# List all files in the bookings directory
hdfs dfs -ls /data/amadeus/bookings/
# Output:
# -rw-r--r--   3 krishna hadoop  1073741824 2026-03-25 14:30 /data/amadeus/bookings/booking_2026.orc

# Recursive list to see all partitions under a Hive table
hdfs dfs -ls -R /user/hive/warehouse/bookings_db.db/flights/

# List with human-readable sizes
hdfs dfs -ls -h /data/amadeus/bookings/
# Output: shows 1.0 G instead of 1073741824

Interview tip: The output columns are: permissions, replication factor, owner, group, size (bytes), date, time, path. The replication factor column is what catches people — they forget it's there. Files show replication (e.g., 3), directories show -.

mkdir / mkdir -p — Create directories

What it does: Creates directories in HDFS. -p creates parent directories if they don't exist.

Syntax:

bash
hdfs dfs -mkdir <path>
hdfs dfs -mkdir -p <path>       # Create parents (like Linux mkdir -p)

Practical example:

bash
# Create a single directory (parent must exist)
hdfs dfs -mkdir /data/amadeus/

# Create full directory tree in one shot
hdfs dfs -mkdir -p /data/amadeus/bookings/year=2026/month=03/day=25

# Common pattern: create staging + final directories
hdfs dfs -mkdir -p /data/staging/flights/
hdfs dfs -mkdir -p /data/processed/flights/

Interview tip: Without -p, the command fails if the parent doesn't exist. Always use -p in scripts and pipelines — it's idempotent (safe to run multiple times).

put / copyFromLocal — Upload files to HDFS

What it does: Copies a file from the local filesystem to HDFS. put and copyFromLocal are almost identical; put also reads from stdin.

Syntax:

bash
hdfs dfs -put <localPath> <hdfsPath>
hdfs dfs -copyFromLocal <localPath> <hdfsPath>
hdfs dfs -put -f <localPath> <hdfsPath>     # Overwrite if exists

Practical example:

bash
# Upload a local CSV to HDFS
hdfs dfs -put /home/krishna/booking_data.csv /data/amadeus/staging/

# Upload and overwrite if file already exists
hdfs dfs -put -f /home/krishna/daily_export.csv /data/amadeus/staging/daily_export.csv

# Upload multiple files at once
hdfs dfs -put /home/krishna/logs/*.log /data/amadeus/raw_logs/

Interview tip: put fails if the destination file already exists (unless you use -f). The trap question: "What's the difference between put and copyFromLocal?" Answer: put can also read from stdin (echo "test" | hdfs dfs -put - /data/test.txt), while copyFromLocal only works with local files. In practice, they're interchangeable for file uploads.

get / copyToLocal — Download files from HDFS

What it does: Copies a file from HDFS to the local filesystem.

Syntax:

bash
hdfs dfs -get <hdfsPath> <localPath>
hdfs dfs -copyToLocal <hdfsPath> <localPath>

Practical example:

bash
# Download a single file from HDFS
hdfs dfs -get /data/amadeus/reports/monthly_summary.csv /home/krishna/

# Download an entire directory
hdfs dfs -get /data/amadeus/bookings/year=2026/month=03/ /home/krishna/march_data/

# Download and overwrite existing local file
hdfs dfs -get -f /data/amadeus/reports/latest.csv /home/krishna/latest.csv

Interview tip: get downloads to the edge node's local filesystem, not to your laptop. For large files, prefer processing in HDFS rather than downloading — that defeats the purpose of distributed storage.

cat / head / tail — View file content

What it does: Reads and displays file content from HDFS.

Syntax:

bash
hdfs dfs -cat <hdfsPath>
hdfs dfs -head <hdfsPath>           # First 1 KB of file
hdfs dfs -tail <hdfsPath>           # Last 1 KB of file
hdfs dfs -cat <hdfsPath> | head -20 # First 20 lines (pipe to local head)

Practical example:

bash
# Quick peek at a small file
hdfs dfs -cat /data/amadeus/config/etl_params.json

# View first 1 KB of a large log file
hdfs dfs -head /data/amadeus/raw_logs/access.log

# View last 1 KB (check latest entries)
hdfs dfs -tail /data/amadeus/raw_logs/access.log

# Practical: view first 20 lines of a CSV to check schema
hdfs dfs -cat /data/amadeus/staging/bookings.csv | head -20

Interview tip: NEVER cat a large file (GBs) — it will stream the entire file to your terminal and kill your session. Always pipe to head for large files. head and tail in HDFS show only 1 KB, not lines — different from Linux.

mv / cp — Move and copy within HDFS

What it does: mv moves/renames files within HDFS. cp copies files within HDFS.

Syntax:

bash
hdfs dfs -mv <source> <destination>
hdfs dfs -cp <source> <destination>

Practical example:

bash
# Move processed files from staging to final location
hdfs dfs -mv /data/staging/bookings_2026.orc /data/processed/bookings_2026.orc

# Rename a file
hdfs dfs -mv /data/amadeus/old_name.csv /data/amadeus/new_name.csv

# Copy a file (creates a new copy with full replication)
hdfs dfs -cp /data/amadeus/bookings/current.orc /data/amadeus/bookings/backup_current.orc

# Move entire directory
hdfs dfs -mv /data/staging/batch_20260325/ /data/processed/batch_20260325/

Interview tip: mv within HDFS is a metadata-only operation (instant, no data movement) as long as source and destination are in the same filesystem. cp actually copies the data blocks — slow for large files. This is why Hive partition operations using ALTER TABLE ... SET LOCATION are fast — it's just a metadata change.

rm / rm -r — Delete files and directories

What it does: Deletes files or directories from HDFS. Deleted items go to the HDFS Trash (if enabled).

Syntax:

bash
hdfs dfs -rm <filePath>                 # Delete a file
hdfs dfs -rm -r <directoryPath>         # Delete directory recursively
hdfs dfs -rm -r -skipTrash <path>       # Permanent delete (bypass Trash)
hdfs dfs -rm -r -f <path>              # Force delete (no error if path missing)

Practical example:

bash
# Delete a single file
hdfs dfs -rm /data/staging/temp_file.csv

# Delete a directory and all contents
hdfs dfs -rm -r /data/staging/batch_20260324/

# Permanent delete when disk is full (bypass trash)
hdfs dfs -rm -r -skipTrash /data/old_logs/2024/

# Safe delete in scripts (no error if already deleted)
hdfs dfs -rm -r -f /data/staging/temp_dir/

Interview tip: By default, deleted files go to /user//.Trash/. The Trash auto-purge interval is set by fs.trash.interval in core-site.xml (default: 0 = disabled, common setting: 1440 = 24 hours). If the cluster is running low on space, use -skipTrash. Interview trap: "A DataNode disk is 95% full but you deleted files yesterday — why?" Answer: files are still in Trash.

du / du -s / du -h — Disk usage

What it does: Shows disk space used by files and directories in HDFS.

Syntax:

bash
hdfs dfs -du <path>                  # Size of each item in the path
hdfs dfs -du -s <path>              # Summary (total size of directory)
hdfs dfs -du -s -h <path>           # Human-readable summary

Practical example:

bash
# Check size of each subdirectory
hdfs dfs -du -h /data/amadeus/
# Output:
# 12.5 G  37.4 G  /data/amadeus/bookings
# 3.2 G   9.6 G   /data/amadeus/flights
# 156.7 M 470.1 M /data/amadeus/config
# First number = raw size, Second number = space with replication

# Total size of one directory
hdfs dfs -du -s -h /data/amadeus/bookings/
# Output: 12.5 G  37.4 G  /data/amadeus/bookings

# Check which partitions are largest (find data skew)
hdfs dfs -du -h /user/hive/warehouse/bookings_db.db/flights/year=2026/

Interview tip: du shows TWO numbers: raw file size and actual disk consumed (raw x replication factor). If replication=3, the second number is 3x the first. Interviewers ask: "Your HDFS is 80% full but data is only 10 TB — why?" Answer: with replication 3, 10 TB actually consumes 30 TB of disk.

chmod / chown — Permissions management

What it does: Changes file/directory permissions (chmod) or ownership (chown) in HDFS.

Syntax:

bash
hdfs dfs -chmod <permissions> <path>
hdfs dfs -chmod -R <permissions> <path>     # Recursive
hdfs dfs -chown <owner>:<group> <path>
hdfs dfs -chown -R <owner>:<group> <path>   # Recursive

Practical example:

bash
# Give read-write-execute to owner, read-execute to group and others
hdfs dfs -chmod 755 /data/amadeus/bookings/

# Recursively set permissions for a Hive table directory
hdfs dfs -chmod -R 770 /user/hive/warehouse/bookings_db.db/

# Change ownership (for service accounts)
hdfs dfs -chown -R hive:hadoop /user/hive/warehouse/bookings_db.db/

# Give only the ETL service account write access
hdfs dfs -chown etl_user:etl_group /data/staging/
hdfs dfs -chmod 750 /data/staging/

Interview tip: HDFS permissions work like POSIX (Unix) permissions but with an important difference: there's no setuid/setgid concept. Also, HDFS has ACLs (Access Control Lists) for fine-grained permissions beyond the basic owner/group/others model. In production, Apache Ranger is used instead of raw chmod/chown for enterprise authorization.

count — File and directory count

What it does: Counts the number of directories, files, and total bytes under a path.

Syntax:

bash
hdfs dfs -count <path>
hdfs dfs -count -q <path>          # Include quota information
hdfs dfs -count -h <path>          # Human-readable sizes

Practical example:

bash
# Count files under a Hive table directory
hdfs dfs -count /user/hive/warehouse/bookings_db.db/flights/
# Output: 365  4380  45231965798  /user/hive/warehouse/bookings_db.db/flights/
# Meaning: 365 directories, 4380 files, 45.2 GB total

# Human-readable format
hdfs dfs -count -h /user/hive/warehouse/bookings_db.db/flights/
# Output: 365  4380  42.1 G  /user/hive/warehouse/bookings_db.db/flights/

# Check quota usage
hdfs dfs -count -q -h /data/amadeus/

Interview tip: This is the go-to command for detecting the small files problem. If count shows 100,000 files but only 5 GB total, you have a small files problem (average file size = 50 KB, should be 128-256 MB). Each file consumes ~150 bytes of NameNode memory, so 100 million small files can crash the NameNode.

stat — File statistics

What it does: Displays statistics about a file or directory in a custom format.

Syntax:

bash
hdfs dfs -stat <format> <path>
# Format specifiers:
# %b = file size (bytes), %n = name, %o = block size
# %r = replication factor, %y = modification date

Practical example:

bash
# Get replication factor of a file
hdfs dfs -stat %r /data/amadeus/bookings/booking_2026.orc
# Output: 3

# Get block size
hdfs dfs -stat %o /data/amadeus/bookings/booking_2026.orc
# Output: 134217728  (128 MB)

# Get file size and modification time
hdfs dfs -stat "%b %y" /data/amadeus/bookings/booking_2026.orc
# Output: 1073741824 2026-03-25 14:30:22

# Quick check: is replication factor correct?
hdfs dfs -stat "%n: replication=%r, blocksize=%o" /data/amadeus/bookings/booking_2026.orc
# Output: booking_2026.orc: replication=3, blocksize=134217728

Interview tip: Use stat to quickly verify replication factor and block size during troubleshooting. If a critical file has replication=1, it's a single point of failure. Default block size is 128 MB (Hadoop 2+), was 64 MB in Hadoop 1.

touchz — Create empty file

What it does: Creates a zero-length (empty) file in HDFS. Fails if the file already exists.

Syntax:

bash
hdfs dfs -touchz <path>

Practical example:

bash
# Create a _SUCCESS flag file to signal ETL completion
hdfs dfs -touchz /data/amadeus/bookings/year=2026/month=03/day=25/_SUCCESS

# Create a lock file for job coordination
hdfs dfs -touchz /data/amadeus/locks/daily_etl.lock

# Create marker file for downstream pipelines
hdfs dfs -touchz /data/staging/batch_complete_20260325

Interview tip: touchz is commonly used to create _SUCCESS flag files that signal job completion. Oozie and custom ETL pipelines check for these files before triggering downstream jobs. Unlike Linux touch, HDFS touchz fails if the file already exists — it's NOT idempotent.

setrep — Set replication factor

What it does: Changes the replication factor of a file or directory in HDFS.

Syntax:

bash
hdfs dfs -setrep <replication> <path>
hdfs dfs -setrep -R <replication> <path>    # Recursive
hdfs dfs -setrep -w <replication> <path>    # Wait until replication completes

Practical example:

bash
# Increase replication for critical booking data
hdfs dfs -setrep 5 /data/amadeus/bookings/current_month.orc

# Reduce replication for cold/archive data to save space
hdfs dfs -setrep -R 2 /data/amadeus/archive/2024/

# Set replication for hot data and wait for completion
hdfs dfs -setrep -w 3 /data/amadeus/bookings/today.orc

# Reduce replication on staging data (temporary, don't need 3 copies)
hdfs dfs -setrep -R 1 /data/staging/

Interview tip: Changing replication factor is async — the command returns immediately but blocks are replicated/removed in the background. Use -w to wait. Interview question: "How do you handle hot vs cold data in HDFS?" Answer: hot data at replication 3, cold/archive data at replication 2, staging at replication 1. In Hadoop 3, use erasure coding instead of reducing replication — it gives fault tolerance with only 1.5x overhead vs 3x.

getmerge — Merge HDFS files to local

What it does: Merges multiple HDFS files into a single local file. Useful for exporting MapReduce/Hive output.

Syntax:

bash
hdfs dfs -getmerge <hdfsDir> <localFile>
hdfs dfs -getmerge -nl <hdfsDir> <localFile>   # Add newline between files

Practical example:

bash
# Merge all MapReduce output parts into one file
hdfs dfs -getmerge /output/job_20260325/ /home/krishna/merged_output.csv

# Merge Hive query output (multiple part-00000 files)
hdfs dfs -getmerge /user/hive/warehouse/temp_results/ /home/krishna/query_results.txt

# Add newline separator between merged files
hdfs dfs -getmerge -nl /output/daily_reports/ /home/krishna/full_report.csv

Interview tip: getmerge downloads to the LOCAL filesystem, not HDFS. It's useful for getting MapReduce/Hive output that's split across many part-00000, part-00001 files into one usable file. Warning: don't use this on huge directories — it's pulling everything to one machine.

fsck — Filesystem check

What it does: Checks the health of the HDFS filesystem, reports missing blocks, under-replicated blocks, and corrupt files.

Syntax:

bash
hdfs fsck <path> [options]
hdfs fsck <path> -files              # List all files
hdfs fsck <path> -blocks             # Show block information
hdfs fsck <path> -locations          # Show block locations (which DataNodes)
hdfs fsck <path> -racks              # Show rack information
hdfs fsck <path> -files -blocks -locations   # Full detail

Practical example:

bash
# Check overall cluster health
hdfs fsck /

# Check health of a specific directory
hdfs fsck /data/amadeus/bookings/ -files -blocks -locations
# Output shows:
# Total files: 4380
# Total blocks: 13140
# Minimally replicated blocks: 13140 (100.0%)
# Under-replicated blocks: 0 (0.0%)
# Missing blocks: 0 (0.0%)
# Corrupt blocks: 0

# Find which DataNodes hold blocks of a specific file
hdfs fsck /data/amadeus/bookings/booking_2026.orc -files -blocks -locations
# Output:
# /data/amadeus/bookings/booking_2026.orc 1073741824 bytes, 8 block(s):
#   blk_1073741825 len=134217728 [10.0.0.5:9866, 10.0.0.7:9866, 10.0.0.12:9866]
#   blk_1073741826 len=134217728 [10.0.0.3:9866, 10.0.0.8:9866, 10.0.0.15:9866]

# Check for corrupt or missing blocks across the cluster
hdfs fsck / -list-corruptfileblocks

Interview tip: hdfs fsck is THE command for diagnosing data loss and under-replication. If you see "Missing blocks" > 0, data is potentially lost. "Under-replicated" means blocks have fewer copies than the replication factor — not yet lost but at risk. Interviewers love: "You get an alert that HDFS has missing blocks — what do you do?" Answer: run fsck, identify affected files, check DataNode health with dfsadmin -report, check DataNode logs for disk failures.

balancer — Rebalance blocks across DataNodes

What it does: Redistributes blocks across DataNodes to achieve even disk usage. Required after adding new nodes or when some nodes are significantly fuller than others.

Syntax:

bash
hdfs balancer
hdfs balancer -threshold <percentage>     # Default 10%
hdfs balancer -policy datanode            # Balance by DataNode usage

Practical example:

bash
# Run balancer with default 10% threshold
# (stops when all nodes are within 10% of average utilization)
hdfs balancer

# Tighter balance — within 5% of average
hdfs balancer -threshold 5

# Run with bandwidth limit to avoid impacting production jobs
# (set in hdfs-site.xml: dfs.datanode.balance.bandwidthPerSec)
hdfs balancer -threshold 10

# Check balance status — look at utilization spread
hdfs dfsadmin -report | grep "DFS Used%"

Interview tip: The balancer runs as a background process and is bandwidth-limited to avoid saturating the network. Default bandwidth is 10 MB/s per DataNode (dfs.datanode.balance.bandwidthPerSec). Interview question: "You added 10 new DataNodes but new data still goes to old nodes — why?" Answer: existing data is not automatically rebalanced. New writes are balanced, but you need to run the balancer for existing data. Also, HDFS prefers writing to local DataNode first (data locality), so new data naturally goes to wherever the writers run.

dfsadmin -report — Cluster health report

What it does: Displays a comprehensive report of the HDFS cluster including capacity, usage, and DataNode status.

Syntax:

bash
hdfs dfsadmin -report
hdfs dfsadmin -report -live           # Only live DataNodes
hdfs dfsadmin -report -dead           # Only dead DataNodes

Practical example:

bash
# Full cluster health report
hdfs dfsadmin -report
# Output:
# Configured Capacity: 500 TB
# Present Capacity: 480 TB
# DFS Remaining: 200 TB (41.67%)
# DFS Used: 280 TB (58.33%)
# Under replicated blocks: 0
# Blocks with corrupt replicas: 0
# Missing blocks: 0
#
# ----- Live DataNodes (50) -----
# Name: 10.0.0.5:9866 (datanode05.cluster.local)
# Decommission Status: Normal
# Configured Capacity: 10 TB
# DFS Used: 5.6 TB (56.00%)
# ...

# Quick check: are any DataNodes dead?
hdfs dfsadmin -report -dead

Interview tip: This is the FIRST command you run when troubleshooting any HDFS issue. It shows you dead DataNodes, disk usage, under-replicated blocks — everything at a glance. Interviewers ask: "How do you monitor HDFS health?" Answer: dfsadmin -report for manual checks, plus Ambari/Cloudera Manager/Grafana dashboards for continuous monitoring with alerting on missing blocks, capacity thresholds, and dead DataNodes.

dfsadmin -safemode — Safe mode operations

What it does: Safe mode is a read-only state where HDFS doesn't allow any modifications. NameNode enters safe mode on startup until enough DataNodes report their blocks.

Syntax:

bash
hdfs dfsadmin -safemode get            # Check if safe mode is ON/OFF
hdfs dfsadmin -safemode enter          # Manually enter safe mode
hdfs dfsadmin -safemode leave          # Manually leave safe mode
hdfs dfsadmin -safemode wait           # Block until safe mode exits

Practical example:

bash
# Check safe mode status
hdfs dfsadmin -safemode get
# Output: Safe mode is OFF

# Enter safe mode before maintenance
hdfs dfsadmin -safemode enter
# Do maintenance (e.g., snapshot, config change)
hdfs dfsadmin -safemode leave

# In scripts: wait for safe mode to finish after cluster restart
hdfs dfsadmin -safemode wait
echo "HDFS is ready, starting ETL jobs..."

Interview tip: NameNode auto-enters safe mode on startup and waits until 99.9% of blocks are reported by DataNodes (configurable via dfs.namenode.safemode.threshold-pct). Classic interview scenario: "Your ETL job fails with 'Cannot create file, NameNode is in safe mode' — what happened?" Answer: the NameNode restarted (or is still starting up), or someone manually entered safe mode for maintenance. Fix: check why NameNode restarted, wait for safe mode to auto-exit, or manually leave with safemode leave (only if you understand why it was in safe mode).

SECTION 2: YARN COMMANDS

YARN = Yet Another Resource Negotiator. It manages cluster resources (CPU + memory) across all applications.

yarn application -list — List running applications

What it does: Lists all running (or filtered by state) YARN applications.

Syntax:

bash
yarn application -list                               # Running apps
yarn application -list -appStates ALL                # All states
yarn application -list -appStates FINISHED           # Completed apps
yarn application -list -appStates FAILED,KILLED      # Failed or killed
yarn application -list -appTypes SPARK               # Only Spark apps

Practical example:

bash
# See what's currently running on the cluster
yarn application -list
# Output:
# Application-Id          Name                  State    Queue     Progress
# application_1234_0001   daily_booking_etl     RUNNING  default   65%
# application_1234_0002   hive_query_xyz        RUNNING  adhoc     30%

# Check how many jobs finished today
yarn application -list -appStates FINISHED

# Find all failed Spark jobs
yarn application -list -appStates FAILED -appTypes SPARK

Interview tip: This is your first command when the cluster seems slow — check if rogue applications are consuming all resources. Look for apps stuck at 0% progress (possible data skew or deadlock) or apps running for hours in a queue that should take minutes.

yarn application -status — Application details

What it does: Shows detailed status of a specific YARN application.

Syntax:

bash
yarn application -status <applicationId>

Practical example:

bash
# Check status of a specific application
yarn application -status application_1711350000000_0042
# Output:
# Application Report:
#   Application-Id: application_1711350000000_0042
#   Application-Name: daily_booking_etl
#   Application-Type: SPARK
#   State: RUNNING
#   Final-Status: UNDEFINED
#   Progress: 65%
#   Queue: production
#   AM Host: datanode15.cluster.local
#   Allocated Resources: <memory:40960 MB, vCores:16>
#   Running Containers: 8
#   Start-Time: 1711350120000

Interview tip: Key things to check: the Queue (is it in the right queue?), Allocated Resources (is it hogging too much?), Running Containers (are they as expected?), and Start-Time (has it been running too long?). If Final-Status is UNDEFINED while State is RUNNING, the job is still in progress.

yarn application -kill — Kill an application

What it does: Forcefully kills a running YARN application.

Syntax:

bash
yarn application -kill <applicationId>

Practical example:

bash
# Kill a stuck Hive query that's consuming all cluster resources
yarn application -kill application_1711350000000_0042
# Output: Killing application application_1711350000000_0042
# Application application_1711350000000_0042 has been killed.

# Common scenario: kill all stuck applications in a loop
for app_id in $(yarn application -list -appStates RUNNING | grep "stuck_job" | awk '{print $1}'); do
    yarn application -kill $app_id
done

Interview tip: You need sufficient permissions to kill an application — either be the owner or have admin rights. In a production environment, always check what the application is doing BEFORE killing it. Killing a Hive INSERT OVERWRITE mid-way can leave partial data. Interview scenario: "A Spark job is using 80% of cluster resources and blocking other jobs — what do you do?" Answer: check the queue configuration first (Capacity Scheduler limits), then kill if necessary, then fix the root cause (add resource limits, use separate queues).

yarn logs — Application logs

What it does: Retrieves logs for a completed or running YARN application.

Syntax:

bash
yarn logs -applicationId <appId>
yarn logs -applicationId <appId> -containerId <containerId>   # Specific container
yarn logs -applicationId <appId> -nodeAddress <nodeAddress>    # Specific node
yarn logs -applicationId <appId> -log_files stderr             # Only stderr

Practical example:

bash
# Get all logs for a failed application
yarn logs -applicationId application_1711350000000_0042 > /home/krishna/job_logs.txt

# Get only stderr (where exceptions appear)
yarn logs -applicationId application_1711350000000_0042 -log_files stderr

# Get logs from a specific container (useful for debugging specific task failures)
yarn logs -applicationId application_1711350000000_0042 \
  -containerId container_1711350000000_0042_01_000005

# Get logs from a specific node
yarn logs -applicationId application_1711350000000_0042 \
  -nodeAddress datanode15.cluster.local:8041

Interview tip: Logs are available AFTER the application finishes (unless log aggregation is enabled). Log aggregation (yarn.log-aggregation-enable=true) collects logs from all NodeManagers and stores them in HDFS (/app-logs/ by default). Without log aggregation, you must SSH to each NodeManager to read logs. Interview trap: "Your Spark job failed yesterday but you can't find the logs — why?" Answer: log aggregation might be disabled, or the aggregated logs have been cleaned up (check yarn.log-aggregation.retain-seconds).

yarn node -list — List cluster nodes

What it does: Lists all NodeManagers in the YARN cluster with their status and resources.

Syntax:

bash
yarn node -list                    # All nodes
yarn node -list -states RUNNING    # Only active nodes
yarn node -list -all               # Include decommissioned/lost nodes

Practical example:

bash
# List all active NodeManagers
yarn node -list -states RUNNING
# Output:
# Node-Id                  Node-State  Node-Http-Address     Containers
# datanode01:45454         RUNNING     datanode01:8042       4
# datanode02:45454         RUNNING     datanode02:8042       6
# datanode03:45454         RUNNING     datanode03:8042       3

# Check for unhealthy or lost nodes
yarn node -list -all

Interview tip: Healthy NodeManagers regularly heartbeat to the ResourceManager. If a NodeManager stops heartbeating (default timeout: 10 minutes via yarn.nm.liveness-monitor.expiry-interval-ms), it's marked as LOST and its containers are killed. YARN then reschedules those containers on other healthy nodes — this is automatic fault tolerance.

yarn queue -status — Queue information

What it does: Shows the status and resource allocation of a specific YARN queue.

Syntax:

bash
yarn queue -status <queueName>

Practical example:

bash
# Check the production queue status
yarn queue -status production
# Output:
# Queue Name: production
# State: RUNNING
# Capacity: 60.0%
# Current Capacity: 45.0%
# Maximum Capacity: 80.0%
# Default Node Label: <DEFAULT>
# Number of Applications: 3

# Check the adhoc queue
yarn queue -status adhoc

Interview tip: Queue configuration is how enterprises control resource sharing. Capacity Scheduler (default in HDP) defines percentage-based queues. Fair Scheduler (default in CDH) shares resources equally. Interview question: "How do you prevent one team from monopolizing cluster resources?" Answer: configure separate queues with capacity limits (e.g., production=60%, analytics=30%, adhoc=10%) and set maximum-capacity to prevent elastic growth beyond a threshold.

yarn top — Resource usage overview

What it does: Shows real-time cluster resource usage, similar to Linux top.

Syntax:

bash
yarn top

Practical example:

bash
# Real-time resource monitoring
yarn top
# Output (refreshes every 3 seconds):
# JEEP CLUSTER SUMMARY: 50 NodeManagers
# Memory: 500 GB / 800 GB used (62.5%)
# VCores: 200 / 400 used (50.0%)
# Queue    Capacity  Used  Apps  Containers
# root.prod  60%     45%    3      24
# root.adhoc 30%     15%    2      8
# root.dev   10%     2%     1      2

Interview tip: yarn top gives you the real-time bird's eye view. If memory usage is at 95%, new applications will be queued (pending). If vCores are maxed out, jobs will run but slower. The balance between memory and vCores matters — you can have free memory but no vCores, or vice versa.

SECTION 3: HIVE COMMANDS (beeline / hive CLI)

📝 Note
Note: The hive CLI is deprecated since Hive 2.0. Use beeline for all production work. beeline connects to HiveServer2 via JDBC, supports authentication and concurrent sessions.

beeline — Connection string

What it does: Connects to HiveServer2 for executing Hive queries.

Syntax:

bash
beeline -u "jdbc:hive2://<host>:<port>/<database>"
beeline -u "jdbc:hive2://<host>:<port>/<database>" -n <username> -p <password>
beeline -u "jdbc:hive2://<host>:10000/default" --hiveconf hive.execution.engine=tez

Practical example:

bash
# Connect to HiveServer2 on default port
beeline -u "jdbc:hive2://hiveserver.cluster.local:10000/default"

# Connect with Kerberos authentication
beeline -u "jdbc:hive2://hiveserver.cluster.local:10000/default;principal=hive/_HOST@REALM.COM"

# Connect and run a single query (non-interactive)
beeline -u "jdbc:hive2://hiveserver.cluster.local:10000/default" \
  -e "SELECT count(*) FROM bookings WHERE year=2026"

# Connect and run a script file
beeline -u "jdbc:hive2://hiveserver.cluster.local:10000/default" \
  -f /home/krishna/etl_daily.hql

Interview tip: beeline vs hive CLI — know the difference. hive CLI runs an embedded Metastore and doesn't go through HiveServer2, so it bypasses security (no authentication). beeline connects via JDBC to HiveServer2, supports Kerberos, LDAP, and concurrent users. In interviews, always say you use beeline.

SHOW / DESCRIBE — Metadata exploration

What it does: Explores databases, tables, and schema metadata.

Syntax:

sql
SHOW DATABASES;
SHOW TABLES;
SHOW TABLES IN <database>;
DESCRIBE <table>;
DESCRIBE FORMATTED <table>;        -- Full metadata including location, format, partitions
DESCRIBE EXTENDED <table>;         -- Similar but less readable
SHOW PARTITIONS <table>;
SHOW CREATE TABLE <table>;         -- DDL to recreate the table

Practical example:

sql
-- List all databases
SHOW DATABASES;

-- Switch to bookings database
USE bookings_db;

-- List all tables
SHOW TABLES;

-- Quick schema check
DESCRIBE flights;
-- Output:
-- flight_id    int
-- origin       string
-- destination  string
-- departure    timestamp
-- year         int        (partition column)
-- month        int        (partition column)

-- Full metadata (CRITICAL for interviews)
DESCRIBE FORMATTED flights;
-- Shows: location (HDFS path), InputFormat, OutputFormat, SerDe,
-- partition columns, table type (MANAGED/EXTERNAL), creation time

-- See all partitions
SHOW PARTITIONS flights;
-- Output:
-- year=2025/month=01
-- year=2025/month=02
-- ...
-- year=2026/month=03

-- Get exact DDL
SHOW CREATE TABLE flights;

Interview tip: DESCRIBE FORMATTED is the most powerful metadata command — it shows the HDFS location, file format (ORC/Parquet), SerDe, partition columns, and whether it's MANAGED or EXTERNAL. Interview question: "How do you find where a Hive table's data is stored?" Answer: DESCRIBE FORMATTED table_name — look for the Location field.

CREATE TABLE — Internal, external, partitioned, bucketed

What it does: Creates Hive tables with various storage configurations.

Syntax and examples:

sql
-- INTERNAL (Managed) TABLE: Hive owns the data. DROP TABLE = data deleted.
CREATE TABLE bookings (
    booking_id   BIGINT,
    passenger    STRING,
    flight_code  STRING,
    amount       DOUBLE,
    booking_time TIMESTAMP
)
STORED AS ORC
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- EXTERNAL TABLE: Hive only manages metadata. DROP TABLE = data stays.
CREATE EXTERNAL TABLE flights_raw (
    flight_id    INT,
    origin       STRING,
    destination  STRING,
    departure    STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/amadeus/raw/flights/';

-- PARTITIONED TABLE (most common in production)
CREATE EXTERNAL TABLE bookings_partitioned (
    booking_id   BIGINT,
    passenger    STRING,
    flight_code  STRING,
    amount       DOUBLE
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS ORC
LOCATION '/data/amadeus/bookings/';

-- BUCKETED TABLE (for optimized joins)
CREATE TABLE bookings_bucketed (
    booking_id   BIGINT,
    passenger    STRING,
    flight_code  STRING,
    amount       DOUBLE
)
CLUSTERED BY (booking_id) INTO 32 BUCKETS
STORED AS ORC;

Interview tip: The #1 Hive interview question: "Internal vs External table — when to use which?" Answer: Use EXTERNAL for raw/shared data (dropping table won't delete data, safe for multiple consumers). Use INTERNAL/MANAGED for intermediate/temp tables where Hive should manage lifecycle. In production, 90% of tables are EXTERNAL. Bucketing: use when you frequently join two large tables on the same key — bucketed tables enable bucket map join (no shuffle).

LOAD DATA / INSERT — Loading data into tables

What it does: Loads data from files or query results into Hive tables.

Syntax:

sql
-- LOAD DATA: moves a file into the table's HDFS directory (no transformation)
LOAD DATA INPATH '<hdfs_path>' INTO TABLE <table>;
LOAD DATA INPATH '<hdfs_path>' OVERWRITE INTO TABLE <table>;
LOAD DATA LOCAL INPATH '<local_path>' INTO TABLE <table>;

-- INSERT INTO: appends query results
INSERT INTO TABLE <target> SELECT * FROM <source>;

-- INSERT OVERWRITE: replaces all data (or partition)
INSERT OVERWRITE TABLE <target> SELECT * FROM <source>;

-- DYNAMIC PARTITION INSERT
INSERT OVERWRITE TABLE bookings PARTITION (year, month, day)
SELECT booking_id, passenger, flight_code, amount, year, month, day
FROM staging_bookings;

Practical example:

sql
-- Load a CSV file from HDFS into a raw table
LOAD DATA INPATH '/data/staging/bookings_20260325.csv' INTO TABLE bookings_raw;

-- Load from local filesystem (copies to HDFS first)
LOAD DATA LOCAL INPATH '/home/krishna/test_data.csv' INTO TABLE test_table;

-- ETL pattern: transform and load into partitioned ORC table
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE bookings_orc PARTITION (year, month)
SELECT
    booking_id, passenger, flight_code, amount,
    year(booking_time) AS year,
    month(booking_time) AS month
FROM bookings_raw
WHERE booking_date = '2026-03-25';

-- Overwrite a specific partition only
INSERT OVERWRITE TABLE bookings_orc PARTITION (year=2026, month=3)
SELECT booking_id, passenger, flight_code, amount
FROM bookings_raw
WHERE year(booking_time) = 2026 AND month(booking_time) = 3;

Interview tip: LOAD DATA INPATH MOVES the file (not copies) — the source file is gone after the load. Use LOAD DATA LOCAL INPATH to copy from local. INSERT OVERWRITE with a partition spec only overwrites THAT partition, not the entire table. Dynamic partitioning requires hive.exec.dynamic.partition.mode=nonstrict — without this, Hive requires at least one static partition. Interview trap: "Your INSERT OVERWRITE deleted all data instead of just one partition" — you forgot the PARTITION clause.

ALTER TABLE — Modify table structure

What it does: Modifies table schema, properties, partitions, or location.

Syntax and examples:

sql
-- Add a partition manually (static partitioning)
ALTER TABLE bookings ADD PARTITION (year=2026, month=3, day=25)
LOCATION '/data/amadeus/bookings/year=2026/month=03/day=25';

-- Drop a partition (data deleted for managed tables, kept for external)
ALTER TABLE bookings DROP PARTITION (year=2024, month=1);

-- Rename table
ALTER TABLE old_bookings RENAME TO bookings_archive;

-- Add a column
ALTER TABLE bookings ADD COLUMNS (loyalty_tier STRING);

-- Change column name/type
ALTER TABLE bookings CHANGE old_column_name new_column_name BIGINT;

-- Change table properties
ALTER TABLE bookings SET TBLPROPERTIES ('orc.compress'='ZLIB');

-- Change HDFS location
ALTER TABLE bookings SET LOCATION '/data/amadeus/bookings_v2/';

Interview tip: Adding partitions with ALTER TABLE ADD PARTITION is called static partitioning — you manually define each partition. This is needed when data is already in HDFS but Hive doesn't know about it. More common: use MSCK REPAIR TABLE to auto-discover all partitions. ALTER TABLE on an external table only changes metadata — the data in HDFS is untouched.

MSCK REPAIR TABLE — Sync partitions

What it does: Scans the HDFS directory structure and automatically adds any partitions that exist in HDFS but not in the Hive Metastore.

Syntax:

sql
MSCK REPAIR TABLE <table>;

Practical example:

sql
-- Scenario: Spark/ETL wrote new partition directories to HDFS
-- HDFS has: /data/bookings/year=2026/month=03/day=25/
-- But Hive doesn't know about it yet

-- Sync Hive Metastore with HDFS
MSCK REPAIR TABLE bookings;
-- Output: Partitions not in metastore: bookings:year=2026/month=03/day=25
-- Now the partition is queryable

-- Verify
SHOW PARTITIONS bookings;

Interview tip: MSCK REPAIR TABLE is essential when external tools (Spark, Sqoop, manual hdfs dfs -put) create partition directories without going through Hive. It only ADDS partitions — it does NOT remove partitions whose HDFS directories were deleted. For large tables with thousands of partitions, MSCK REPAIR can be slow — prefer ALTER TABLE ADD PARTITION for specific partitions. Interview question: "Spark wrote data to HDFS but Hive query returns 0 rows — why?" Answer: partitions not registered in Metastore. Fix: MSCK REPAIR TABLE.

EXPLAIN — Query execution plan

What it does: Shows the execution plan of a Hive query without running it. Essential for optimization.

Syntax:

sql
EXPLAIN <query>;
EXPLAIN EXTENDED <query>;           -- More detail
EXPLAIN FORMATTED <query>;          -- Readable JSON format

Practical example:

sql
-- See execution plan for a join query
EXPLAIN
SELECT b.booking_id, f.origin, f.destination
FROM bookings b
JOIN flights f ON b.flight_code = f.flight_code
WHERE b.year = 2026 AND b.month = 3;

-- Output shows:
-- Stage-1: Map (reads bookings, applies partition filter)
-- Stage-2: Map (reads flights)
-- Stage-3: Reduce (shuffle join on flight_code)

-- Check if partition pruning is working
EXPLAIN
SELECT count(*) FROM bookings WHERE year = 2026;
-- Look for: filterExpr: (year = 2026) — means partition pruning is active

Interview tip: Always EXPLAIN before running expensive queries. Look for: (1) partition pruning — is Hive scanning only needed partitions? (2) join strategy — map join (broadcast) vs reduce join (shuffle). (3) number of stages — fewer stages = faster. Interview scenario: "Your Hive query scans 5 TB but should only scan 50 GB — what's wrong?" Answer: run EXPLAIN, check if partition pruning is happening. If WHERE clause uses a function on the partition column (e.g., WHERE year(dt) = 2026 instead of WHERE year = 2026), Hive can't do partition pruning.

SET — Configuration at runtime

What it does: Sets Hive configuration parameters for the current session.

Key settings for interviews:

sql
-- Use Tez engine (10-100x faster than MapReduce)
SET hive.execution.engine=tez;

-- Enable vectorized execution (process 1024 rows at a time instead of 1)
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;

-- Enable dynamic partitioning
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

-- Enable map join (broadcast small table)
SET hive.auto.convert.join=true;
SET hive.mapjoin.smalltable.filesize=25000000;  -- 25 MB threshold

-- Enable CBO (Cost-Based Optimizer)
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;

-- Enable compression
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

-- Control parallelism
SET hive.exec.parallel=true;
SET hive.exec.parallel.thread.number=8;

Interview tip: The top 3 Hive performance settings interviewers expect you to know: (1) hive.execution.engine=tez — switch from MapReduce to Tez, (2) vectorized execution — processes batches of 1024 rows, (3) CBO with ANALYZE TABLE for statistics. These three alone can improve query performance by 10-50x.

SECTION 4: SQOOP COMMANDS

Sqoop = SQL-to-Hadoop. Imports data from RDBMS (Oracle, MySQL, PostgreSQL) into HDFS/Hive and exports back. Uses MapReduce under the hood for parallel data transfer.

sqoop import — Basic import

What it does: Imports a table from an RDBMS into HDFS or Hive.

Syntax and examples:

bash
# Basic import from MySQL to HDFS
sqoop import \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --table bookings \
  --target-dir /data/amadeus/sqoop_import/bookings/ \
  --as-avrodatafile \
  --num-mappers 8

# Import into Hive table directly
sqoop import \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --table bookings \
  --hive-import \
  --hive-table bookings_db.bookings_raw \
  --hive-overwrite \
  --num-mappers 8

# Import with WHERE clause (subset of data)
sqoop import \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --table bookings \
  --where "booking_date >= '2026-03-01'" \
  --target-dir /data/amadeus/sqoop_import/bookings_march/ \
  --num-mappers 4

# Import with custom query
sqoop import \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --query "SELECT b.*, f.origin, f.destination FROM bookings b JOIN flights f ON b.flight_code = f.flight_code WHERE \$CONDITIONS" \
  --split-by b.booking_id \
  --target-dir /data/amadeus/sqoop_import/enriched_bookings/ \
  --num-mappers 8

Interview tip: --split-by determines how Sqoop parallelizes the import. By default, it uses the primary key. If no primary key, you MUST specify --split-by or use --num-mappers 1. The --split-by column should be numeric and evenly distributed — if it's skewed (e.g., 90% of values in one range), most mappers will be idle. With --query, you must include WHERE $CONDITIONS — Sqoop replaces this with range conditions for each mapper. Always use --password-file instead of --password to avoid credentials in process listings.

sqoop import --incremental — Incremental imports

What it does: Imports only new or modified rows, not the entire table.

Syntax:

bash
# APPEND mode: import only rows with ID greater than last imported
sqoop import \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --table bookings \
  --incremental append \
  --check-column booking_id \
  --last-value 1000000 \
  --target-dir /data/amadeus/sqoop_import/bookings/ \
  --num-mappers 4

# LASTMODIFIED mode: import rows modified since last import
sqoop import \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --table bookings \
  --incremental lastmodified \
  --check-column updated_at \
  --last-value "2026-03-24 00:00:00" \
  --target-dir /data/amadeus/sqoop_import/bookings/ \
  --merge-key booking_id \
  --num-mappers 4

Interview tip: Two modes — know the difference: append is for INSERT-only tables (new rows have higher ID, no updates). lastmodified is for tables with updates (uses a timestamp column). append just adds new files to HDFS. lastmodified with --merge-key does a MapReduce merge of old and new data — slower but handles updates. Interview question: "How do you do incremental loads from Oracle to HDFS?" Answer: Sqoop --incremental lastmodified with --check-column on updated_at and --merge-key on primary key. Store --last-value in a Sqoop job or external metadata table.

sqoop export — Export to RDBMS

What it does: Exports data from HDFS/Hive back to an RDBMS table.

Syntax:

bash
# Basic export from HDFS to MySQL
sqoop export \
  --connect jdbc:mysql://db.amadeus.local:3306/reports_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --table daily_summary \
  --export-dir /data/amadeus/reports/daily_summary/ \
  --input-fields-terminated-by ',' \
  --num-mappers 4

# Export with update mode (upsert: insert or update existing rows)
sqoop export \
  --connect jdbc:mysql://db.amadeus.local:3306/reports_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --table booking_metrics \
  --export-dir /user/hive/warehouse/bookings_db.db/booking_metrics/ \
  --update-key booking_date \
  --update-mode allowinsert \
  --num-mappers 4

Interview tip: By default, Sqoop export does INSERT. If the target table has a unique key constraint and a row already exists, the export FAILS. Use --update-key with --update-mode allowinsert for upsert behavior. Sqoop export is NOT atomic — if it fails halfway, partial data is already in the RDBMS. Solution: export to a staging table, then do a SQL INSERT INTO final_table SELECT * FROM staging_table in a transaction.

sqoop eval — Test connection and run queries

What it does: Executes a SQL query on the source database. Used to test connectivity and verify schemas before import.

Syntax:

bash
sqoop eval \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --query "SELECT count(*) FROM bookings"

Practical example:

bash
# Test database connectivity
sqoop eval \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --query "SELECT 1"

# Check row count before import
sqoop eval \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --query "SELECT count(*) FROM bookings WHERE booking_date = '2026-03-25'"

# Check table schema
sqoop eval \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password \
  --query "DESCRIBE bookings"

Interview tip: Always run sqoop eval first to verify: (1) network connectivity to the database, (2) credentials work, (3) the table exists and schema is as expected. This saves you from debugging a failed 2-hour import that failed in the first second due to wrong credentials.

sqoop list-databases / list-tables — Discovery

What it does: Lists available databases or tables in the source RDBMS.

Syntax:

bash
# List all databases
sqoop list-databases \
  --connect jdbc:mysql://db.amadeus.local:3306/ \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password

# List all tables in a database
sqoop list-tables \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password

Practical example:

bash
# Discover what's available on a new database server
sqoop list-databases \
  --connect jdbc:mysql://db.amadeus.local:3306/ \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password
# Output:
# information_schema
# bookings_db
# flights_db
# reports_db

# List tables in the bookings database
sqoop list-tables \
  --connect jdbc:mysql://db.amadeus.local:3306/bookings_db \
  --username etl_user \
  --password-file /home/krishna/.sqoop_password
# Output:
# bookings
# passengers
# flights
# airports
# booking_audit_log

Interview tip: These commands are useful during the discovery phase of a migration project. When migrating an entire Oracle/MySQL database to Hadoop, first run list-tables to inventory everything, then plan imports table by table with appropriate --split-by columns and file formats.

SECTION 5: QUICK-FIRE INTERVIEW QUESTIONS

💡 Interview Tip
These are rapid-fire questions interviewers ask to check your hands-on experience. Answer in 1-2 sentences + the exact command.

Q1: How do you check HDFS cluster health?

Answer: Run hdfs dfsadmin -report — it shows total capacity, used space, remaining space, number of live/dead DataNodes, under-replicated blocks, and missing blocks. For a quick filesystem integrity check, run hdfs fsck /.

bash
hdfs dfsadmin -report
hdfs fsck /

Q2: How do you find which DataNode a specific block is on?

Answer: Use hdfs fsck with the -files -blocks -locations flags on the specific file. It shows every block ID and the DataNodes holding each replica.

bash
hdfs fsck /data/amadeus/bookings/booking_2026.orc -files -blocks -locations
# Output shows:
# blk_1073741825 len=134217728 [10.0.0.5:9866, 10.0.0.7:9866, 10.0.0.12:9866]
# This block has 3 replicas on DataNodes at 10.0.0.5, 10.0.0.7, 10.0.0.12

Q3: How do you check if NameNode is in safe mode?

Answer: Run hdfs dfsadmin -safemode get. If it returns "Safe mode is ON", no write operations are allowed. The NameNode auto-enters safe mode on startup until enough blocks are reported.

bash
hdfs dfsadmin -safemode get
# Output: Safe mode is ON  (or OFF)

# To exit safe mode manually:
hdfs dfsadmin -safemode leave

Q4: How do you decommission a DataNode?

Answer: Decommissioning is a graceful removal — HDFS first replicates all blocks from that node to other nodes before shutting it down. This ensures no data loss.

bash
# Step 1: Add the DataNode hostname to the exclude file
# (configured in hdfs-site.xml as dfs.hosts.exclude)
echo "datanode15.cluster.local" >> /etc/hadoop/conf/dfs.exclude

# Step 2: Tell NameNode to refresh the node list
hdfs dfsadmin -refreshNodes

# Step 3: Monitor decommission progress
hdfs dfsadmin -report
# Look for: Decommission Status: Decommission in progress
# Wait until it says: Decommission Status: Decommissioned

# Step 4: Once decommissioned, stop the DataNode service
# (all blocks have been replicated to other nodes)

Key point: Never just shut down a DataNode without decommissioning. If replication factor is 3 and you kill a node, those blocks temporarily have only 2 replicas. If another node dies before HDFS re-replicates, you lose data.

Q5: How do you check YARN resource usage?

Answer: Use yarn top for real-time monitoring, or yarn node -list to see per-node container counts. For queue-level usage, use yarn queue -status .

bash
# Real-time cluster resource usage
yarn top

# Per-node resource usage
yarn node -list

# Queue-specific usage
yarn queue -status production

# List all running applications consuming resources
yarn application -list

Q6: How do you kill a stuck YARN application?

Answer: First identify the application ID with yarn application -list, then kill it with yarn application -kill. Always check what the application is doing before killing it.

bash
# Find the stuck application
yarn application -list
# Look for apps stuck at 0% progress or running too long

# Kill it
yarn application -kill application_1711350000000_0042

# Verify it's gone
yarn application -status application_1711350000000_0042
# State should be KILLED

Q7: How do you see Hive table partition information?

Answer: Use SHOW PARTITIONS to list all partitions, and DESCRIBE FORMATTED to see partition columns and table metadata including HDFS location.

sql
-- List all partitions
SHOW PARTITIONS bookings;
-- Output:
-- year=2025/month=01
-- year=2025/month=02
-- ...
-- year=2026/month=03

-- See partition columns and full table metadata
DESCRIBE FORMATTED bookings;

-- Check physical HDFS data for a specific partition
-- (from beeline, then check HDFS)
bash
# Verify the HDFS directory structure matches partitions
hdfs dfs -ls -R /user/hive/warehouse/bookings_db.db/bookings/ | head -20

# Check size of each partition
hdfs dfs -du -h /user/hive/warehouse/bookings_db.db/bookings/

MEMORY MAP: COMMAND CATEGORIES

📐 Architecture Diagram
HDFS COMMANDS — Remember: "CRUD + Health"
═════════════════════════════════════════
C = Create     → mkdir, touchz, put/copyFromLocal
R = Read       → ls, cat, head, tail, stat, count, du
U = Update     → mv, cp, chmod, chown, setrep
D = Delete     → rm, rm -r
Health         → fsck, balancer, dfsadmin -report, dfsadmin -safemode

YARN COMMANDS — Remember: "LASK-N-Q-T"
═══════════════════════════════════════
L = List       → yarn application -list
A = App status → yarn application -status
S = Stop (kill)→ yarn application -kill
K = Know logs  → yarn logs -applicationId
N = Nodes      → yarn node -list
Q = Queues     → yarn queue -status
T = Top        → yarn top

HIVE COMMANDS — Remember: "SCALD-ME"
═════════════════════════════════════
S = Show       → SHOW DATABASES, TABLES, PARTITIONS
C = Create     → CREATE TABLE (internal, external, partitioned, bucketed)
A = Alter      → ALTER TABLE (add partition, rename, add column)
L = Load       → LOAD DATA, INSERT INTO, INSERT OVERWRITE
D = Describe   → DESCRIBE FORMATTED (the power command)
M = MSCK       → MSCK REPAIR TABLE (sync partitions)
E = Explain    → EXPLAIN (query plan)

SQOOP COMMANDS — Remember: "I-I-E-E-L"
═══════════════════════════════════════
I = Import          → sqoop import (full table)
I = Incremental     → sqoop import --incremental (append/lastmodified)
E = Export          → sqoop export (HDFS to RDBMS)
E = Eval            → sqoop eval (test connection)
L = List            → sqoop list-databases, list-tables
Pro Tip
Final Interview Tip: When asked "What commands do you use daily?", frame it as a senior engineer: "On a typical day, I check cluster health with dfsadmin -report, monitor jobs with yarn application -list and yarn top, optimize Hive queries using EXPLAIN, and manage data pipelines that use Sqoop for RDBMS ingestion with incremental loads. For troubleshooting, hdfs fsck and yarn logs are my go-to tools."