Airflow — Confusions, Labs, Gotchas & Mock Interview
🧠 Memory Map: DAG-TASK-RUN
Airflow boils down to 3 concepts. Remember DTR:
| Letter | Pillar | What it is |
|---|---|---|
| D | DAG (Directed Acyclic Graph) | Workflow definition — the recipe |
| T | Task | Single unit of work (an Operator call) |
| R | Run (DAG Run / Task Instance) | An EXECUTION of DAG/Task at a specific time |
Drawing DAG → Task → Task Instance → Task Run explains 80% of Airflow confusion.
SECTION 1 — TOP 8 CONFUSIONS CLEARED
Confusion #1 — execution_date vs data_interval_start vs logical_date vs start_date
Biggest interview trap. Clarified in Airflow 2.2+:
| Term | What it means |
|---|---|
| start_date (DAG arg) | Earliest date the DAG is allowed to run |
| logical_date (= old execution_date) | The timestamp the DAG represents (NOT when it ran) |
| data_interval_start | Start of the data window this run processes |
| data_interval_end | End of the window |
| actual run time | Wall-clock time when the task executes |
Example: schedule @daily, DAG run for 2026-04-01:
logical_date= 2026-04-01 00:00:00 UTCdata_interval_start= 2026-04-01 00:00:00data_interval_end= 2026-04-02 00:00:00- ACTUAL run happens at end of interval → ~2026-04-02 00:00:01
Interview one-liner: "A DAG scheduled @daily for 2026-04-01 runs AT 2026-04-02. logical_date is the window it represents, not when it executed."
Confusion #2 — Scheduler vs Executor vs Worker
Three things with overlapping names.
| Component | What it does | Where it runs |
|---|---|---|
| Scheduler | Parses DAGs, decides what should run NOW, queues tasks | Dedicated process |
| Executor | Delivers queued tasks to workers | Inside Scheduler (config choice) |
| Worker | Actually runs the task | Separate process (Celery/K8s) |
Executor types:
LocalExecutor— single machine, subprocessesCeleryExecutor— distributed via Celery queueKubernetesExecutor— each task → a new K8s podCeleryKubernetesExecutor— mix both
Memory trick: Scheduler decides → Executor dispatches → Worker executes.
Confusion #3 — Operator vs Task vs TaskInstance
| Concept | Role |
|---|---|
| Operator | CLASS template (BashOperator, PythonOperator) |
| Task | An INSTANCE of an Operator in a DAG (a node in the graph) |
| TaskInstance (TI) | An EXECUTION of that task for a specific DAG run |
from airflow.operators.bash import BashOperator
# BashOperator is the OPERATOR
# `extract` is the TASK (lives in the DAG graph)
extract = BashOperator(
task_id='extract',
bash_command='echo hello',
dag=dag,
)
# Every day, ONE TaskInstance of `extract` is created
Confusion #4 — XCom vs Variable vs Connection vs Params
| Mechanism | Scope | Use case |
|---|---|---|
| XCom (cross-communication) | Between tasks in same DAG run | Pass small result: count, path, status |
| Variable | Global | Env config: API_URL, feature flags |
| Connection | Global | Credentials: DB, S3, APIs |
| Params | Per-DAG-run | Runtime overrides at trigger time |
# XCom push/pull
def extract(**ctx):
ctx['ti'].xcom_push(key='row_count', value=1000)
def load(**ctx):
count = ctx['ti'].xcom_pull(task_ids='extract', key='row_count')
print(f"Loading {count} rows")
⚠️ Gotcha: XCom is stored in metadata DB — DON'T push large DataFrames. Push S3 path, not the data itself.
Confusion #5 — Trigger Rules: all_success vs all_done vs one_failed vs none_failed
Default = all_success (all upstreams must succeed). Override when you want different behavior.
| Rule | Triggers when |
|---|---|
all_success (default) | All upstream tasks succeeded |
all_failed | All upstreams failed |
all_done | All upstreams finished (success OR fail) |
one_success | At least one upstream succeeded |
one_failed | At least one upstream failed |
none_failed | None failed (success OR skipped OK) |
none_failed_min_one_success | No failures, at least one success |
Common pattern: Cleanup task with trigger_rule='all_done' — runs whether pipeline succeeded or failed.
cleanup = PythonOperator(
task_id='cleanup',
python_callable=delete_temp_files,
trigger_rule='all_done',
dag=dag,
)
Confusion #6 — Sensor vs Operator + poke vs reschedule mode
Sensor = special operator that waits for a condition. Two modes:
| Mode | Behavior | When to use |
|---|---|---|
poke (default) | Task holds a worker slot, checks every poke_interval | Short waits (< 5 min) |
reschedule | Task releases slot between pokes, re-queues itself | Long waits (hours) |
from airflow.sensors.filesystem import FileSensor
wait_for_file = FileSensor(
task_id='wait_file',
filepath='/data/ready.flag',
poke_interval=60,
timeout=3600,
mode='reschedule', # release worker between checks
)
Horror story: a poke sensor waiting 6 hours blocks a worker slot for 6 hours. Use reschedule for long waits.
Confusion #7 — Catchup vs Backfill
| Term | What it does |
|---|---|
| catchup=True (default) | On DAG creation, create runs for ALL past dates since start_date |
| catchup=False | Only start running from "now" onwards |
| Backfill (CLI) | MANUALLY run DAG for a past date range |
dag = DAG('my_dag', start_date=datetime(2020,1,1), catchup=False)
# Without catchup=False, Airflow would try to run 5+ years of historical runs!
# Backfill by command
airflow dags backfill -s 2026-01-01 -e 2026-01-31 my_dag
Interview rule: almost always set catchup=False unless you need historical backfill.
Confusion #8 — Airflow vs Prefect vs Dagster vs Luigi
| Airflow | Prefect | Dagster | Luigi | |
|---|---|---|---|---|
| Definition | Static Python DAGs | Dynamic Python flows | Assets-first | Task-based |
| Scheduling | Cron / timetable | Dynamic triggers | Cron / triggers | Manual/cron |
| Strength | Ecosystem, ops | Developer ergonomics | Data-aware (assets) | Simple, minimal |
| Weakness | Verbose DAG files, dynamic DAGs awkward | Younger ecosystem | Newer, lock-in to assets model | Dated |
2026 reality: Airflow still dominates in enterprises. Prefect/Dagster grow fast in startups.
SECTION 2 — PRACTICE LABS
Lab 1: Minimal DAG with 3 tasks + XCom (15 mins)
# Install Airflow locally
pip install "apache-airflow==2.9.0"
export AIRFLOW_HOME=$HOME/airflow_lab
airflow db init
airflow users create --username admin --password admin --role Admin \
--email a@a.com --firstname A --lastname B
Save as $AIRFLOW_HOME/dags/etl_lab.py:
from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
def extract(**ctx):
rows = [{"id": 1, "amt": 100}, {"id": 2, "amt": 200}]
print(f"extracted: {rows}") # intermediate step
ctx['ti'].xcom_push(key='rows', value=rows)
def transform(**ctx):
rows = ctx['ti'].xcom_pull(task_ids='extract', key='rows')
doubled = [{"id": r["id"], "amt": r["amt"] * 2} for r in rows]
print(f"transformed: {doubled}") # intermediate step
ctx['ti'].xcom_push(key='doubled', value=doubled)
def load(**ctx):
rows = ctx['ti'].xcom_pull(task_ids='transform', key='doubled')
print(f"loaded to DB: {rows}") # intermediate step
return len(rows)
with DAG(
dag_id='etl_lab',
start_date=datetime(2026, 4, 1),
schedule='@daily',
catchup=False,
) as dag:
e = PythonOperator(task_id='extract', python_callable=extract)
t = PythonOperator(task_id='transform', python_callable=transform)
l = PythonOperator(task_id='load', python_callable=load)
e >> t >> l
# Start scheduler + webserver
airflow standalone
# Open http://localhost:8080 → trigger 'etl_lab' → inspect XCom in UI
What you proved: DAG with dependencies, XCom passing between tasks, visible in UI.
Lab 2: Branching (5 mins)
from airflow.operators.python import BranchPythonOperator
from airflow.operators.empty import EmptyOperator
def choose_path(**ctx):
import random
return 'path_a' if random.random() > 0.5 else 'path_b'
with DAG('branching_lab', start_date=datetime(2026,4,1), schedule=None, catchup=False) as dag:
branch = BranchPythonOperator(task_id='choose', python_callable=choose_path)
a = EmptyOperator(task_id='path_a')
b = EmptyOperator(task_id='path_b')
join = EmptyOperator(task_id='join', trigger_rule='none_failed_min_one_success')
branch >> [a, b] >> join
What you proved: one path runs, other is SKIPPED (not failed), and join still runs because of trigger_rule.
Lab 3: Sensor + dynamic task mapping (10 mins)
from airflow.decorators import task, dag
from airflow.sensors.filesystem import FileSensor
@dag(dag_id='dynamic_lab', start_date=datetime(2026,4,1), schedule=None, catchup=False)
def flow():
wait = FileSensor(task_id='wait', filepath='/tmp/ready.flag',
poke_interval=10, mode='reschedule', timeout=300)
@task
def list_files():
files = ['a.csv', 'b.csv', 'c.csv']
print(f"files: {files}")
return files
@task
def process(name: str):
print(f"processing {name}")
return f"done:{name}"
files = list_files()
results = process.expand(name=files) # dynamic mapping
wait >> files >> results
flow()
What you proved: .expand() creates one task instance per list element — parallel fan-out without hardcoding.
SECTION 3 — LIVE VISUAL ANIMATIONS
Animation 1: Scheduler heartbeat loop
EVERY FEW SECONDS:
┌────────────────────────────┐
│ 1. Parse DAG files │
│ (dags/ folder) │
└────────────┬───────────────┘
▼
┌────────────────────────────┐
│ 2. For each DAG: │
│ compute next run │
│ based on schedule │
└────────────┬───────────────┘
▼
┌────────────────────────────┐
│ 3. Create DAG Run rows │
│ in metadata DB │
└────────────┬───────────────┘
▼
┌────────────────────────────┐
│ 4. Queue ready tasks │
│ (deps met, retries ok) │
└────────────┬───────────────┘
▼
┌────────────────────────────┐
│ 5. Executor picks from │
│ queue, sends to worker │
└────────────────────────────┘
Takeaway: the scheduler is a LOOP. DAG changes picked up on next parse (configurable via dag_dir_list_interval).
Animation 2: XCom data flow
DAG run 2026-04-01
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ extract │ push │ transform │ push │ load │
│ returns 1000│────────▶│ adds *2 │────────▶│ saves to DB │
└──────┬───────┘ └──────▲───────┘ └──────▲───────┘
│ │ │
└──────────▶ Metadata DB (xcom table) ◀────────────┘
(key="row_count", value=1000, dag_run_id=...)
Rule: XCom lives in metadata DB. One row per (dag_run, task, key). Keep values SMALL.
Animation 3: Task state machine
TI states to know: none, scheduled, queued, running, success, failed, up_for_retry, skipped, up_for_reschedule, upstream_failed.
SECTION 4 — GOTCHAS (REAL PRODUCTION FAILURES)
Gotcha 1: Big XComs = metadata DB bloat
Pushing a 50 MB DataFrame via XCom → Postgres metadata DB explodes. Fix: push S3/GCS path, read the actual data in next task.
Gotcha 2: top-level code runs EVERY parse (every 30s)
# BAD: runs on every DAG parse
df = pd.read_csv('huge.csv') # scheduler freezes
with DAG(...) as dag:
...
Fix: put I/O inside task functions, not module-level.
Gotcha 3: schedule_interval is deprecated, use schedule=
Airflow 2.4+ uses schedule= (supports cron, timedelta, datasets, timetables).
Gotcha 4: catchup=True on a DAG with start_date 2 years ago
Creates 730 DAG runs immediately → scheduler overload.
Fix: catchup=False unless you truly want backfill.
Gotcha 5: Sensor in poke mode holds worker slot for hours
Starves downstream tasks.
Fix: mode='reschedule' OR use SmartSensor (deprecated) OR Deferrable operators (2.2+).
Gotcha 6: Timezone confusion
Airflow uses UTC by default. logical_date is always UTC. Local cron expressions like "run at 9am" need explicit timezone.
Fix:
import pendulum
dag = DAG(..., start_date=pendulum.datetime(2026,1,1, tz="Asia/Kolkata"))
SECTION 5 — TIMED MOCK INTERVIEW (45 MIN)
Q1 (8 min) — "Design a daily ETL that pulls from API, transforms, loads to warehouse"
Answer structure:
- DAG:
schedule='@daily',catchup=False,start_date=datetime(2026,1,1) - Tasks:
api_sensor→extract→validate→transform→load→quality_check→notify - Use
TaskFlow API(@task) for clean Python code - Connection for API credentials stored in Airflow Connections
- Retries:
retries=3, retry_delay=timedelta(minutes=5), retry_exponential_backoff=True - Alerts:
on_failure_callback→ Slack/PagerDuty - SLA: 2 hours end-to-end
Q2 (5 min) — "Difference between @daily and 0 0 * * *?"
Both run at midnight. But:
@daily=0 0 * * *internally- Cron gives precise timing control (e.g.,
30 2 * * MON-FRI= 2:30 AM Mon-Fri) - Use cron when you need non-standard windows
- Datasets + timetables (2.4+) allow event-driven runs instead
Q3 (6 min) — "A DAG is slow, how do you debug?"
- Check Gantt chart in UI → which task is longest?
- Check worker logs → CPU-bound? I/O-bound?
- Is XCom huge? Check metadata DB size
- Too many top-level imports? Run
airflow dags list-import-errors - Parser slow? Enable DAG file processor stats
- Task queued but not running → executor capacity / pool config
- Consider KubernetesExecutor for per-task isolation
Q4 (4 min) — "How do you handle task retries for flaky API calls?"
task = PythonOperator(
task_id='api_call',
python_callable=call_api,
retries=5,
retry_delay=timedelta(seconds=30),
retry_exponential_backoff=True, # 30s, 60s, 120s, 240s...
max_retry_delay=timedelta(minutes=10),
)
Q5 (4 min) — "DAG runs are piling up after a weekend outage"
- Pause DAG:
airflow dags pause my_dag - Mark old runs as success (skip) OR manual backfill with
--rerun-failed-tasks - If you only want LATEST state:
max_active_runs=1+depends_on_past=Trueprevents pile-up - Consider
latest_onlysensor to short-circuit historical runs
SECTION 6 — FINAL READINESS CHECKLIST
- Can I explain logical_date vs execution time vs data_interval?
- Do I know Scheduler → Executor → Worker pipeline?
- Can I list the 4 executor types and when to use each?
- Do I know XCom limits and what NOT to push through it?
- Can I write 3 trigger rules and a use case for each?
- Do I know sensor poke vs reschedule mode?
- Can I explain catchup=True disaster?
- Can I use dynamic task mapping (
.expand())? - Do I know how to set retries + exponential backoff?
- Can I debug a slow DAG (5 checks)?
If all 10 = YES, you're Airflow interview-ready.
Remember DAG-TASK-RUN. Everything else is details.