🪁
Airflow
Airflow — Confusions, Labs, Gotchas & Mock Interview
🌬️
🌬️
Apache Airflow · Section 1 of 1

Airflow — Confusions, Labs, Gotchas & Mock Interview

Airflow — Confusions, Labs, Gotchas & Mock Interview

💡 Interview Tip
The video-free pack. Read this end-to-end and you can walk into any Airflow interview without opening YouTube.

🧠 Memory Map: DAG-TASK-RUN

Airflow boils down to 3 concepts. Remember DTR:

LetterPillarWhat it is
DDAG (Directed Acyclic Graph)Workflow definition — the recipe
TTaskSingle unit of work (an Operator call)
RRun (DAG Run / Task Instance)An EXECUTION of DAG/Task at a specific time

Drawing DAG → Task → Task Instance → Task Run explains 80% of Airflow confusion.

SECTION 1 — TOP 8 CONFUSIONS CLEARED

Confusion #1 — execution_date vs data_interval_start vs logical_date vs start_date

Biggest interview trap. Clarified in Airflow 2.2+:

TermWhat it means
start_date (DAG arg)Earliest date the DAG is allowed to run
logical_date (= old execution_date)The timestamp the DAG represents (NOT when it ran)
data_interval_startStart of the data window this run processes
data_interval_endEnd of the window
actual run timeWall-clock time when the task executes

Example: schedule @daily, DAG run for 2026-04-01:

  • logical_date = 2026-04-01 00:00:00 UTC
  • data_interval_start = 2026-04-01 00:00:00
  • data_interval_end = 2026-04-02 00:00:00
  • ACTUAL run happens at end of interval → ~2026-04-02 00:00:01

Interview one-liner: "A DAG scheduled @daily for 2026-04-01 runs AT 2026-04-02. logical_date is the window it represents, not when it executed."

Confusion #2 — Scheduler vs Executor vs Worker

Three things with overlapping names.

ComponentWhat it doesWhere it runs
SchedulerParses DAGs, decides what should run NOW, queues tasksDedicated process
ExecutorDelivers queued tasks to workersInside Scheduler (config choice)
WorkerActually runs the taskSeparate process (Celery/K8s)

Executor types:

  • LocalExecutor — single machine, subprocesses
  • CeleryExecutor — distributed via Celery queue
  • KubernetesExecutor — each task → a new K8s pod
  • CeleryKubernetesExecutor — mix both

Memory trick: Scheduler decides → Executor dispatches → Worker executes.

Confusion #3 — Operator vs Task vs TaskInstance

ConceptRole
OperatorCLASS template (BashOperator, PythonOperator)
TaskAn INSTANCE of an Operator in a DAG (a node in the graph)
TaskInstance (TI)An EXECUTION of that task for a specific DAG run
python — editable
from airflow.operators.bash import BashOperator

# BashOperator is the OPERATOR
# `extract` is the TASK (lives in the DAG graph)
extract = BashOperator(
    task_id='extract',
    bash_command='echo hello',
    dag=dag,
)
# Every day, ONE TaskInstance of `extract` is created

Confusion #4 — XCom vs Variable vs Connection vs Params

MechanismScopeUse case
XCom (cross-communication)Between tasks in same DAG runPass small result: count, path, status
VariableGlobalEnv config: API_URL, feature flags
ConnectionGlobalCredentials: DB, S3, APIs
ParamsPer-DAG-runRuntime overrides at trigger time
python — editable
# XCom push/pull
def extract(**ctx):
    ctx['ti'].xcom_push(key='row_count', value=1000)

def load(**ctx):
    count = ctx['ti'].xcom_pull(task_ids='extract', key='row_count')
    print(f"Loading {count} rows")

⚠️ Gotcha: XCom is stored in metadata DB — DON'T push large DataFrames. Push S3 path, not the data itself.

Confusion #5 — Trigger Rules: all_success vs all_done vs one_failed vs none_failed

Default = all_success (all upstreams must succeed). Override when you want different behavior.

RuleTriggers when
all_success (default)All upstream tasks succeeded
all_failedAll upstreams failed
all_doneAll upstreams finished (success OR fail)
one_successAt least one upstream succeeded
one_failedAt least one upstream failed
none_failedNone failed (success OR skipped OK)
none_failed_min_one_successNo failures, at least one success

Common pattern: Cleanup task with trigger_rule='all_done' — runs whether pipeline succeeded or failed.

python — editable
cleanup = PythonOperator(
    task_id='cleanup',
    python_callable=delete_temp_files,
    trigger_rule='all_done',
    dag=dag,
)

Confusion #6 — Sensor vs Operator + poke vs reschedule mode

Sensor = special operator that waits for a condition. Two modes:

ModeBehaviorWhen to use
poke (default)Task holds a worker slot, checks every poke_intervalShort waits (< 5 min)
rescheduleTask releases slot between pokes, re-queues itselfLong waits (hours)
python — editable
from airflow.sensors.filesystem import FileSensor

wait_for_file = FileSensor(
    task_id='wait_file',
    filepath='/data/ready.flag',
    poke_interval=60,
    timeout=3600,
    mode='reschedule',   # release worker between checks
)

Horror story: a poke sensor waiting 6 hours blocks a worker slot for 6 hours. Use reschedule for long waits.

Confusion #7 — Catchup vs Backfill

TermWhat it does
catchup=True (default)On DAG creation, create runs for ALL past dates since start_date
catchup=FalseOnly start running from "now" onwards
Backfill (CLI)MANUALLY run DAG for a past date range
python — editable
dag = DAG('my_dag', start_date=datetime(2020,1,1), catchup=False)
# Without catchup=False, Airflow would try to run 5+ years of historical runs!
bash
# Backfill by command
airflow dags backfill -s 2026-01-01 -e 2026-01-31 my_dag

Interview rule: almost always set catchup=False unless you need historical backfill.

Confusion #8 — Airflow vs Prefect vs Dagster vs Luigi

AirflowPrefectDagsterLuigi
DefinitionStatic Python DAGsDynamic Python flowsAssets-firstTask-based
SchedulingCron / timetableDynamic triggersCron / triggersManual/cron
StrengthEcosystem, opsDeveloper ergonomicsData-aware (assets)Simple, minimal
WeaknessVerbose DAG files, dynamic DAGs awkwardYounger ecosystemNewer, lock-in to assets modelDated

2026 reality: Airflow still dominates in enterprises. Prefect/Dagster grow fast in startups.

SECTION 2 — PRACTICE LABS

Lab 1: Minimal DAG with 3 tasks + XCom (15 mins)

bash
# Install Airflow locally
pip install "apache-airflow==2.9.0"
export AIRFLOW_HOME=$HOME/airflow_lab
airflow db init
airflow users create --username admin --password admin --role Admin \
  --email a@a.com --firstname A --lastname B

Save as $AIRFLOW_HOME/dags/etl_lab.py:

python — editable
from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator

def extract(**ctx):
    rows = [{"id": 1, "amt": 100}, {"id": 2, "amt": 200}]
    print(f"extracted: {rows}")  # intermediate step
    ctx['ti'].xcom_push(key='rows', value=rows)

def transform(**ctx):
    rows = ctx['ti'].xcom_pull(task_ids='extract', key='rows')
    doubled = [{"id": r["id"], "amt": r["amt"] * 2} for r in rows]
    print(f"transformed: {doubled}")  # intermediate step
    ctx['ti'].xcom_push(key='doubled', value=doubled)

def load(**ctx):
    rows = ctx['ti'].xcom_pull(task_ids='transform', key='doubled')
    print(f"loaded to DB: {rows}")  # intermediate step
    return len(rows)

with DAG(
    dag_id='etl_lab',
    start_date=datetime(2026, 4, 1),
    schedule='@daily',
    catchup=False,
) as dag:
    e = PythonOperator(task_id='extract', python_callable=extract)
    t = PythonOperator(task_id='transform', python_callable=transform)
    l = PythonOperator(task_id='load', python_callable=load)
    e >> t >> l
bash
# Start scheduler + webserver
airflow standalone
# Open http://localhost:8080 → trigger 'etl_lab' → inspect XCom in UI

What you proved: DAG with dependencies, XCom passing between tasks, visible in UI.

Lab 2: Branching (5 mins)

python — editable
from airflow.operators.python import BranchPythonOperator
from airflow.operators.empty import EmptyOperator

def choose_path(**ctx):
    import random
    return 'path_a' if random.random() > 0.5 else 'path_b'

with DAG('branching_lab', start_date=datetime(2026,4,1), schedule=None, catchup=False) as dag:
    branch = BranchPythonOperator(task_id='choose', python_callable=choose_path)
    a = EmptyOperator(task_id='path_a')
    b = EmptyOperator(task_id='path_b')
    join = EmptyOperator(task_id='join', trigger_rule='none_failed_min_one_success')

    branch >> [a, b] >> join

What you proved: one path runs, other is SKIPPED (not failed), and join still runs because of trigger_rule.

Lab 3: Sensor + dynamic task mapping (10 mins)

python — editable
from airflow.decorators import task, dag
from airflow.sensors.filesystem import FileSensor

@dag(dag_id='dynamic_lab', start_date=datetime(2026,4,1), schedule=None, catchup=False)
def flow():
    wait = FileSensor(task_id='wait', filepath='/tmp/ready.flag',
                      poke_interval=10, mode='reschedule', timeout=300)

    @task
    def list_files():
        files = ['a.csv', 'b.csv', 'c.csv']
        print(f"files: {files}")
        return files

    @task
    def process(name: str):
        print(f"processing {name}")
        return f"done:{name}"

    files = list_files()
    results = process.expand(name=files)  # dynamic mapping

    wait >> files >> results

flow()

What you proved: .expand() creates one task instance per list element — parallel fan-out without hardcoding.

SECTION 3 — LIVE VISUAL ANIMATIONS

Animation 1: Scheduler heartbeat loop

📐 Architecture Diagram
EVERY FEW SECONDS:
  ┌────────────────────────────┐
  │  1. Parse DAG files        │
  │     (dags/ folder)         │
  └────────────┬───────────────┘
               ▼
  ┌────────────────────────────┐
  │  2. For each DAG:          │
  │     compute next run       │
  │     based on schedule      │
  └────────────┬───────────────┘
               ▼
  ┌────────────────────────────┐
  │  3. Create DAG Run rows    │
  │     in metadata DB         │
  └────────────┬───────────────┘
               ▼
  ┌────────────────────────────┐
  │  4. Queue ready tasks      │
  │     (deps met, retries ok) │
  └────────────┬───────────────┘
               ▼
  ┌────────────────────────────┐
  │  5. Executor picks from    │
  │     queue, sends to worker │
  └────────────────────────────┘

Takeaway: the scheduler is a LOOP. DAG changes picked up on next parse (configurable via dag_dir_list_interval).

Animation 2: XCom data flow

📐 Architecture Diagram
DAG run 2026-04-01
┌──────────────┐         ┌──────────────┐         ┌──────────────┐
│  extract     │ push    │  transform   │ push    │    load      │
│  returns 1000│────────▶│  adds *2     │────────▶│  saves to DB │
└──────┬───────┘         └──────▲───────┘         └──────▲───────┘
       │                        │                        │
       └──────────▶ Metadata DB (xcom table) ◀────────────┘
             (key="row_count", value=1000, dag_run_id=...)

Rule: XCom lives in metadata DB. One row per (dag_run, task, key). Keep values SMALL.

Animation 3: Task state machine

🗂️none
│ scheduler creates TI
scheduled ──> queued ──> running ──> success
> failed ──> up_for_retry ──> queued (loop)
> skipped (branch not taken)
> up_for_reschedule (sensor mode=reschedule)

TI states to know: none, scheduled, queued, running, success, failed, up_for_retry, skipped, up_for_reschedule, upstream_failed.

SECTION 4 — GOTCHAS (REAL PRODUCTION FAILURES)

Gotcha 1: Big XComs = metadata DB bloat

Pushing a 50 MB DataFrame via XCom → Postgres metadata DB explodes. Fix: push S3/GCS path, read the actual data in next task.

Gotcha 2: top-level code runs EVERY parse (every 30s)

python — editable
# BAD: runs on every DAG parse
df = pd.read_csv('huge.csv')  # scheduler freezes

with DAG(...) as dag:
    ...

Fix: put I/O inside task functions, not module-level.

Gotcha 3: schedule_interval is deprecated, use schedule=

Airflow 2.4+ uses schedule= (supports cron, timedelta, datasets, timetables).

Gotcha 4: catchup=True on a DAG with start_date 2 years ago

Creates 730 DAG runs immediately → scheduler overload. Fix: catchup=False unless you truly want backfill.

Gotcha 5: Sensor in poke mode holds worker slot for hours

Starves downstream tasks. Fix: mode='reschedule' OR use SmartSensor (deprecated) OR Deferrable operators (2.2+).

Gotcha 6: Timezone confusion

Airflow uses UTC by default. logical_date is always UTC. Local cron expressions like "run at 9am" need explicit timezone. Fix:

python — editable
import pendulum
dag = DAG(..., start_date=pendulum.datetime(2026,1,1, tz="Asia/Kolkata"))

SECTION 5 — TIMED MOCK INTERVIEW (45 MIN)

Q1 (8 min) — "Design a daily ETL that pulls from API, transforms, loads to warehouse"

Answer structure:

  1. DAG: schedule='@daily', catchup=False, start_date=datetime(2026,1,1)
  2. Tasks: api_sensorextractvalidatetransformloadquality_checknotify
  3. Use TaskFlow API (@task) for clean Python code
  4. Connection for API credentials stored in Airflow Connections
  5. Retries: retries=3, retry_delay=timedelta(minutes=5), retry_exponential_backoff=True
  6. Alerts: on_failure_callback → Slack/PagerDuty
  7. SLA: 2 hours end-to-end

Q2 (5 min) — "Difference between @daily and 0 0 * * *?"

Both run at midnight. But:

  • @daily = 0 0 * * * internally
  • Cron gives precise timing control (e.g., 30 2 * * MON-FRI = 2:30 AM Mon-Fri)
  • Use cron when you need non-standard windows
  • Datasets + timetables (2.4+) allow event-driven runs instead

Q3 (6 min) — "A DAG is slow, how do you debug?"

  1. Check Gantt chart in UI → which task is longest?
  2. Check worker logs → CPU-bound? I/O-bound?
  3. Is XCom huge? Check metadata DB size
  4. Too many top-level imports? Run airflow dags list-import-errors
  5. Parser slow? Enable DAG file processor stats
  6. Task queued but not running → executor capacity / pool config
  7. Consider KubernetesExecutor for per-task isolation

Q4 (4 min) — "How do you handle task retries for flaky API calls?"

python — editable
task = PythonOperator(
    task_id='api_call',
    python_callable=call_api,
    retries=5,
    retry_delay=timedelta(seconds=30),
    retry_exponential_backoff=True,  # 30s, 60s, 120s, 240s...
    max_retry_delay=timedelta(minutes=10),
)

Q5 (4 min) — "DAG runs are piling up after a weekend outage"

  1. Pause DAG: airflow dags pause my_dag
  2. Mark old runs as success (skip) OR manual backfill with --rerun-failed-tasks
  3. If you only want LATEST state: max_active_runs=1 + depends_on_past=True prevents pile-up
  4. Consider latest_only sensor to short-circuit historical runs

SECTION 6 — FINAL READINESS CHECKLIST

  • Can I explain logical_date vs execution time vs data_interval?
  • Do I know Scheduler → Executor → Worker pipeline?
  • Can I list the 4 executor types and when to use each?
  • Do I know XCom limits and what NOT to push through it?
  • Can I write 3 trigger rules and a use case for each?
  • Do I know sensor poke vs reschedule mode?
  • Can I explain catchup=True disaster?
  • Can I use dynamic task mapping (.expand())?
  • Do I know how to set retries + exponential backoff?
  • Can I debug a slow DAG (5 checks)?

If all 10 = YES, you're Airflow interview-ready.

Remember DAG-TASK-RUN. Everything else is details.