🪁

Airflow

Airflow — Confusions, Labs, Gotchas & Mock Interview

🌬️

Apache Airflow · Section 1 of 1

Airflow — Confusions, Labs, Gotchas & Mock Interview

💡 Interview Tip

The video-free pack. Read this end-to-end and you can walk into any Airflow interview without opening YouTube.

🧠 Memory Map: DAG-TASK-RUN

Airflow boils down to 3 concepts. Remember DTR:

Letter	Pillar	What it is
D	DAG (Directed Acyclic Graph)	Workflow definition — the recipe
T	Task	Single unit of work (an Operator call)
R	Run (DAG Run / Task Instance)	An EXECUTION of DAG/Task at a specific time

Drawing DAG → Task → Task Instance → Task Run explains 80% of Airflow confusion.

SECTION 1 — TOP 8 CONFUSIONS CLEARED

Confusion #1 — execution_date vs data_interval_start vs logical_date vs start_date

Biggest interview trap. Clarified in Airflow 2.2+:

Term	What it means
start_date (DAG arg)	Earliest date the DAG is allowed to run
logical_date (= old execution_date)	The timestamp the DAG represents (NOT when it ran)
data_interval_start	Start of the data window this run processes
data_interval_end	End of the window
actual run time	Wall-clock time when the task executes

Example: schedule @daily, DAG run for 2026-04-01:

logical_date = 2026-04-01 00:00:00 UTC
data_interval_start = 2026-04-01 00:00:00
data_interval_end = 2026-04-02 00:00:00
ACTUAL run happens at end of interval → ~2026-04-02 00:00:01

Interview one-liner: "A DAG scheduled @daily for 2026-04-01 runs AT 2026-04-02. logical_date is the window it represents, not when it executed."

Confusion #2 — Scheduler vs Executor vs Worker

Three things with overlapping names.

Component	What it does	Where it runs
Scheduler	Parses DAGs, decides what should run NOW, queues tasks	Dedicated process
Executor	Delivers queued tasks to workers	Inside Scheduler (config choice)
Worker	Actually runs the task	Separate process (Celery/K8s)

Executor types:

LocalExecutor — single machine, subprocesses
CeleryExecutor — distributed via Celery queue
KubernetesExecutor — each task → a new K8s pod
CeleryKubernetesExecutor — mix both

Memory trick: Scheduler decides → Executor dispatches → Worker executes.

Confusion #3 — Operator vs Task vs TaskInstance

Concept	Role
Operator	CLASS template (`BashOperator`, `PythonOperator`)
Task	An INSTANCE of an Operator in a DAG (a node in the graph)
TaskInstance (TI)	An EXECUTION of that task for a specific DAG run

python — editable

from airflow.operators.bash import BashOperator

# BashOperator is the OPERATOR
# `extract` is the TASK (lives in the DAG graph)
extract = BashOperator(
    task_id='extract',
    bash_command='echo hello',
    dag=dag,
)
# Every day, ONE TaskInstance of `extract` is created

Confusion #4 — XCom vs Variable vs Connection vs Params

Mechanism	Scope	Use case
XCom (cross-communication)	Between tasks in same DAG run	Pass small result: count, path, status
Variable	Global	Env config: `API_URL`, feature flags
Connection	Global	Credentials: DB, S3, APIs
Params	Per-DAG-run	Runtime overrides at trigger time

python — editable

# XCom push/pull
def extract(**ctx):
    ctx['ti'].xcom_push(key='row_count', value=1000)

def load(**ctx):
    count = ctx['ti'].xcom_pull(task_ids='extract', key='row_count')
    print(f"Loading {count} rows")

⚠️ Gotcha: XCom is stored in metadata DB — DON'T push large DataFrames. Push S3 path, not the data itself.

Confusion #5 — Trigger Rules: all_success vs all_done vs one_failed vs none_failed

Default = all_success (all upstreams must succeed). Override when you want different behavior.

Rule	Triggers when
`all_success` (default)	All upstream tasks succeeded
`all_failed`	All upstreams failed
`all_done`	All upstreams finished (success OR fail)
`one_success`	At least one upstream succeeded
`one_failed`	At least one upstream failed
`none_failed`	None failed (success OR skipped OK)
`none_failed_min_one_success`	No failures, at least one success

Common pattern: Cleanup task with trigger_rule='all_done' — runs whether pipeline succeeded or failed.

python — editable

cleanup = PythonOperator(
    task_id='cleanup',
    python_callable=delete_temp_files,
    trigger_rule='all_done',
    dag=dag,
)

Confusion #6 — Sensor vs Operator + poke vs reschedule mode

Sensor = special operator that waits for a condition. Two modes:

Mode	Behavior	When to use
`poke` (default)	Task holds a worker slot, checks every `poke_interval`	Short waits (< 5 min)
`reschedule`	Task releases slot between pokes, re-queues itself	Long waits (hours)

python — editable

from airflow.sensors.filesystem import FileSensor

wait_for_file = FileSensor(
    task_id='wait_file',
    filepath='/data/ready.flag',
    poke_interval=60,
    timeout=3600,
    mode='reschedule',   # release worker between checks
)

Horror story: a poke sensor waiting 6 hours blocks a worker slot for 6 hours. Use reschedule for long waits.

Confusion #7 — Catchup vs Backfill

Term	What it does
catchup=True (default)	On DAG creation, create runs for ALL past dates since `start_date`
catchup=False	Only start running from "now" onwards
Backfill (CLI)	MANUALLY run DAG for a past date range

python — editable

dag = DAG('my_dag', start_date=datetime(2020,1,1), catchup=False)
# Without catchup=False, Airflow would try to run 5+ years of historical runs!

bash

# Backfill by command
airflow dags backfill -s 2026-01-01 -e 2026-01-31 my_dag

Interview rule: almost always set catchup=False unless you need historical backfill.

Confusion #8 — Airflow vs Prefect vs Dagster vs Luigi

	Airflow	Prefect	Dagster	Luigi
Definition	Static Python DAGs	Dynamic Python flows	Assets-first	Task-based
Scheduling	Cron / timetable	Dynamic triggers	Cron / triggers	Manual/cron
Strength	Ecosystem, ops	Developer ergonomics	Data-aware (assets)	Simple, minimal
Weakness	Verbose DAG files, dynamic DAGs awkward	Younger ecosystem	Newer, lock-in to assets model	Dated

2026 reality: Airflow still dominates in enterprises. Prefect/Dagster grow fast in startups.

SECTION 2 — PRACTICE LABS

Lab 1: Minimal DAG with 3 tasks + XCom (15 mins)

bash

# Install Airflow locally
pip install "apache-airflow==2.9.0"
export AIRFLOW_HOME=$HOME/airflow_lab
airflow db init
airflow users create --username admin --password admin --role Admin \
  --email a@a.com --firstname A --lastname B

Save as $AIRFLOW_HOME/dags/etl_lab.py:

python — editable

from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator

def extract(**ctx):
    rows = [{"id": 1, "amt": 100}, {"id": 2, "amt": 200}]
    print(f"extracted: {rows}")  # intermediate step
    ctx['ti'].xcom_push(key='rows', value=rows)

def transform(**ctx):
    rows = ctx['ti'].xcom_pull(task_ids='extract', key='rows')
    doubled = [{"id": r["id"], "amt": r["amt"] * 2} for r in rows]
    print(f"transformed: {doubled}")  # intermediate step
    ctx['ti'].xcom_push(key='doubled', value=doubled)

def load(**ctx):
    rows = ctx['ti'].xcom_pull(task_ids='transform', key='doubled')
    print(f"loaded to DB: {rows}")  # intermediate step
    return len(rows)

with DAG(
    dag_id='etl_lab',
    start_date=datetime(2026, 4, 1),
    schedule='@daily',
    catchup=False,
) as dag:
    e = PythonOperator(task_id='extract', python_callable=extract)
    t = PythonOperator(task_id='transform', python_callable=transform)
    l = PythonOperator(task_id='load', python_callable=load)
    e >> t >> l

bash

# Start scheduler + webserver
airflow standalone
# Open http://localhost:8080 → trigger 'etl_lab' → inspect XCom in UI

What you proved: DAG with dependencies, XCom passing between tasks, visible in UI.

Lab 2: Branching (5 mins)

python — editable

from airflow.operators.python import BranchPythonOperator
from airflow.operators.empty import EmptyOperator

def choose_path(**ctx):
    import random
    return 'path_a' if random.random() > 0.5 else 'path_b'

with DAG('branching_lab', start_date=datetime(2026,4,1), schedule=None, catchup=False) as dag:
    branch = BranchPythonOperator(task_id='choose', python_callable=choose_path)
    a = EmptyOperator(task_id='path_a')
    b = EmptyOperator(task_id='path_b')
    join = EmptyOperator(task_id='join', trigger_rule='none_failed_min_one_success')

    branch >> [a, b] >> join

What you proved: one path runs, other is SKIPPED (not failed), and join still runs because of trigger_rule.

Lab 3: Sensor + dynamic task mapping (10 mins)

python — editable

from airflow.decorators import task, dag
from airflow.sensors.filesystem import FileSensor

@dag(dag_id='dynamic_lab', start_date=datetime(2026,4,1), schedule=None, catchup=False)
def flow():
    wait = FileSensor(task_id='wait', filepath='/tmp/ready.flag',
                      poke_interval=10, mode='reschedule', timeout=300)

    @task
    def list_files():
        files = ['a.csv', 'b.csv', 'c.csv']
        print(f"files: {files}")
        return files

    @task
    def process(name: str):
        print(f"processing {name}")
        return f"done:{name}"

    files = list_files()
    results = process.expand(name=files)  # dynamic mapping

    wait >> files >> results

flow()

What you proved: .expand() creates one task instance per list element — parallel fan-out without hardcoding.

SECTION 3 — LIVE VISUAL ANIMATIONS

Animation 1: Scheduler heartbeat loop

📐 Architecture Diagram

EVERY FEW SECONDS:
  ┌────────────────────────────┐
  │  1. Parse DAG files        │
  │     (dags/ folder)         │
  └────────────┬───────────────┘
               ▼
  ┌────────────────────────────┐
  │  2. For each DAG:          │
  │     compute next run       │
  │     based on schedule      │
  └────────────┬───────────────┘
               ▼
  ┌────────────────────────────┐
  │  3. Create DAG Run rows    │
  │     in metadata DB         │
  └────────────┬───────────────┘
               ▼
  ┌────────────────────────────┐
  │  4. Queue ready tasks      │
  │     (deps met, retries ok) │
  └────────────┬───────────────┘
               ▼
  ┌────────────────────────────┐
  │  5. Executor picks from    │
  │     queue, sends to worker │
  └────────────────────────────┘

Takeaway: the scheduler is a LOOP. DAG changes picked up on next parse (configurable via dag_dir_list_interval).

Animation 2: XCom data flow

📐 Architecture Diagram

DAG run 2026-04-01
┌──────────────┐         ┌──────────────┐         ┌──────────────┐
│  extract     │ push    │  transform   │ push    │    load      │
│  returns 1000│────────▶│  adds *2     │────────▶│  saves to DB │
└──────┬───────┘         └──────▲───────┘         └──────▲───────┘
       │                        │                        │
       └──────────▶ Metadata DB (xcom table) ◀────────────┘
             (key="row_count", value=1000, dag_run_id=...)

Rule: XCom lives in metadata DB. One row per (dag_run, task, key). Keep values SMALL.

Animation 3: Task state machine

🗂️none

│ scheduler creates TI

▼

scheduled ──> queued ──> running ──> success

> failed ──> up_for_retry ──> queued (loop)

> skipped (branch not taken)

> up_for_reschedule (sensor mode=reschedule)

TI states to know: none, scheduled, queued, running, success, failed, up_for_retry, skipped, up_for_reschedule, upstream_failed.

SECTION 4 — GOTCHAS (REAL PRODUCTION FAILURES)

Gotcha 1: Big XComs = metadata DB bloat

Pushing a 50 MB DataFrame via XCom → Postgres metadata DB explodes. Fix: push S3/GCS path, read the actual data in next task.

Gotcha 2: top-level code runs EVERY parse (every 30s)

python — editable

# BAD: runs on every DAG parse
df = pd.read_csv('huge.csv')  # scheduler freezes

with DAG(...) as dag:
    ...

Fix: put I/O inside task functions, not module-level.

Gotcha 3: `schedule_interval` is deprecated, use `schedule=`

Airflow 2.4+ uses schedule= (supports cron, timedelta, datasets, timetables).

Gotcha 4: catchup=True on a DAG with start_date 2 years ago

Creates 730 DAG runs immediately → scheduler overload. Fix: catchup=False unless you truly want backfill.

Gotcha 5: Sensor in poke mode holds worker slot for hours

Starves downstream tasks. Fix: mode='reschedule' OR use SmartSensor (deprecated) OR Deferrable operators (2.2+).

Gotcha 6: Timezone confusion

Airflow uses UTC by default. logical_date is always UTC. Local cron expressions like "run at 9am" need explicit timezone. Fix:

python — editable

import pendulum
dag = DAG(..., start_date=pendulum.datetime(2026,1,1, tz="Asia/Kolkata"))

SECTION 5 — TIMED MOCK INTERVIEW (45 MIN)

Q1 (8 min) — "Design a daily ETL that pulls from API, transforms, loads to warehouse"

Answer structure:

DAG: schedule='@daily', catchup=False, start_date=datetime(2026,1,1)
Tasks: api_sensor → extract → validate → transform → load → quality_check → notify
Use TaskFlow API (@task) for clean Python code
Connection for API credentials stored in Airflow Connections
Retries: retries=3, retry_delay=timedelta(minutes=5), retry_exponential_backoff=True
Alerts: on_failure_callback → Slack/PagerDuty
SLA: 2 hours end-to-end

Q2 (5 min) — "Difference between `@daily` and `0 0 * * *`?"

Both run at midnight. But:

@daily = 0 0 * * * internally
Cron gives precise timing control (e.g., 30 2 * * MON-FRI = 2:30 AM Mon-Fri)
Use cron when you need non-standard windows
Datasets + timetables (2.4+) allow event-driven runs instead

Q3 (6 min) — "A DAG is slow, how do you debug?"

Check Gantt chart in UI → which task is longest?
Check worker logs → CPU-bound? I/O-bound?
Is XCom huge? Check metadata DB size
Too many top-level imports? Run airflow dags list-import-errors
Parser slow? Enable DAG file processor stats
Task queued but not running → executor capacity / pool config
Consider KubernetesExecutor for per-task isolation

Q4 (4 min) — "How do you handle task retries for flaky API calls?"

python — editable

task = PythonOperator(
    task_id='api_call',
    python_callable=call_api,
    retries=5,
    retry_delay=timedelta(seconds=30),
    retry_exponential_backoff=True,  # 30s, 60s, 120s, 240s...
    max_retry_delay=timedelta(minutes=10),
)

Q5 (4 min) — "DAG runs are piling up after a weekend outage"

Pause DAG: airflow dags pause my_dag
Mark old runs as success (skip) OR manual backfill with --rerun-failed-tasks
If you only want LATEST state: max_active_runs=1 + depends_on_past=True prevents pile-up
Consider latest_only sensor to short-circuit historical runs

SECTION 6 — FINAL READINESS CHECKLIST

Can I explain logical_date vs execution time vs data_interval?
Do I know Scheduler → Executor → Worker pipeline?
Can I list the 4 executor types and when to use each?
Do I know XCom limits and what NOT to push through it?
Can I write 3 trigger rules and a use case for each?
Do I know sensor poke vs reschedule mode?
Can I explain catchup=True disaster?
Can I use dynamic task mapping (.expand())?
Do I know how to set retries + exponential backoff?
Can I debug a slow DAG (5 checks)?

If all 10 = YES, you're Airflow interview-ready.

Remember DAG-TASK-RUN. Everything else is details.

Practice Questions 📋

Airflow — Confusions, Labs, Gotchas & Mock Interview

Airflow — Confusions, Labs, Gotchas & Mock Interview

🧠 Memory Map: DAG-TASK-RUN

SECTION 1 — TOP 8 CONFUSIONS CLEARED

Confusion #1 — execution_date vs data_interval_start vs logical_date vs start_date

Confusion #2 — Scheduler vs Executor vs Worker

Confusion #3 — Operator vs Task vs TaskInstance

Confusion #4 — XCom vs Variable vs Connection vs Params

Confusion #5 — Trigger Rules: all_success vs all_done vs one_failed vs none_failed

Confusion #6 — Sensor vs Operator + poke vs reschedule mode

Confusion #7 — Catchup vs Backfill

Confusion #8 — Airflow vs Prefect vs Dagster vs Luigi

SECTION 2 — PRACTICE LABS

Lab 1: Minimal DAG with 3 tasks + XCom (15 mins)

Lab 2: Branching (5 mins)

Lab 3: Sensor + dynamic task mapping (10 mins)

SECTION 3 — LIVE VISUAL ANIMATIONS

Animation 1: Scheduler heartbeat loop

Animation 2: XCom data flow

Animation 3: Task state machine

SECTION 4 — GOTCHAS (REAL PRODUCTION FAILURES)

Gotcha 1: Big XComs = metadata DB bloat

Gotcha 2: top-level code runs EVERY parse (every 30s)

Gotcha 3: schedule_interval is deprecated, use schedule=

Gotcha 4: catchup=True on a DAG with start_date 2 years ago

Gotcha 5: Sensor in poke mode holds worker slot for hours

Gotcha 6: Timezone confusion

SECTION 5 — TIMED MOCK INTERVIEW (45 MIN)

Q1 (8 min) — "Design a daily ETL that pulls from API, transforms, loads to warehouse"

Q2 (5 min) — "Difference between @daily and 0 0 * * *?"

Q3 (6 min) — "A DAG is slow, how do you debug?"

Q4 (4 min) — "How do you handle task retries for flaky API calls?"

Q5 (4 min) — "DAG runs are piling up after a weekend outage"

SECTION 6 — FINAL READINESS CHECKLIST

Gotcha 3: `schedule_interval` is deprecated, use `schedule=`

Q2 (5 min) — "Difference between `@daily` and `0 0 * * *`?"