Python Interview Question Bank — Data Engineer Edition
Q01 — What is the difference between a list and a tuple?
Question: Explain the key differences between lists and tuples in Python. When would you use each?
Quick Answer: Lists are mutable (changeable), tuples are immutable (fixed). Tuples are faster, use less memory, and can be used as dictionary keys.
# Example 1: Mutability difference
my_list = [1, 2, 3]
my_list[0] = 99 # Works fine — lists are mutable
print(my_list)
# Output: [99, 2, 3]
my_tuple = (1, 2, 3)
# my_tuple[0] = 99 # Raises: TypeError: 'tuple' object does not support item assignment
print(my_tuple)
# Output: (1, 2, 3)
# Example 2: Tuples as dictionary keys (lists cannot do this)
# Tuples are hashable because they are immutable
coords = {(10, 20): "New York", (40, 50): "London"}
print(coords[(10, 20)])
# Output: New York
# Lists are NOT hashable — this would fail:
# bad_dict = {[10, 20]: "New York"}
# Raises: TypeError: unhashable type: 'list'
120 bytes
print(f"Tuple size: {sys.getsizeof(my_tuple)} bytes")
# Output: Tuple size: 80 bytes
# Tuples use less memory because they don't need resize overhead"># Example 3: Memory and performance comparison
import sys
my_list = [1, 2, 3, 4, 5]
my_tuple = (1, 2, 3, 4, 5)
print(f"List size: {sys.getsizeof(my_list)} bytes")
# Output: List size: 120 bytes
print(f"Tuple size: {sys.getsizeof(my_tuple)} bytes")
# Output: Tuple size: 80 bytes
# Tuples use less memory because they don't need resize overhead
🎯 Tip: "Tuples are hashable so they can be dict keys. Lists cannot because they're mutable. I use tuples for fixed data like DB row results or coordinates."
Q02 — What are Python's mutable and immutable types?
Question: Name Python's mutable and immutable types. What happens when you modify an immutable object?
Quick Answer: Immutable types (int, float, str, tuple, frozenset, bytes) create a new object on modification. Mutable types (list, dict, set, bytearray) change in place.
# Example 1: Immutable strings — "modification" creates a new object
a = "hello"
b = a # b points to the same object as a
print(id(a) == id(b))
# Output: True
a = a + " world" # a now points to a NEW string object
print(a)
# Output: hello world
print(b)
# Output: hello
print(id(a) == id(b))
# Output: False
# b still points to the original "hello" — it was never modified
# Example 2: Mutable lists — modification changes the SAME object
x = [1, 2, 3]
y = x # y points to the same list object
x.append(4) # Modifies the list in place
print(x)
# Output: [1, 2, 3, 4]
print(y)
# Output: [1, 2, 3, 4]
# Both x and y see the change because they share the same object
print(id(x) == id(y))
# Output: True
# Example 3: Immutable integers — reassignment creates a new object
a = 10
print(id(a))
# Output: 4344024144 (some memory address)
a = a + 5 # Creates a brand new int object 15
print(a)
# Output: 15
# The integer 10 still exists (until garbage collected)
# Python caches small integers (-5 to 256), so id() for those may be reused
🎯 Tip: "Understanding mutability prevents aliasing bugs. In data pipelines, I'm careful with mutable default arguments — def f(lst=[]) is a classic trap because the default list is shared across calls."
Q03 — Explain list comprehension vs generator expression
Question: What is the difference between [x for x in range(n)] and (x for x in range(n))? When would you use each?
Quick Answer: List comprehension creates the full list in memory. Generator expression produces values lazily, one at a time. For large data, generators save memory.
# Example 1: List comprehension — entire list stored in memory
squares_list = [x ** 2 for x in range(6)]
print(squares_list)
# Output: [0, 1, 4, 9, 16, 25]
print(type(squares_list))
# Output: <class 'list'>
# Example 2: Generator expression — lazy, one value at a time
squares_gen = (x ** 2 for x in range(6))
print(type(squares_gen))
# Output: <class 'generator'>
# Must iterate or convert to see values
print(next(squares_gen))
# Output: 0
print(next(squares_gen))
# Output: 1
print(list(squares_gen)) # Remaining values
# Output: [4, 9, 16, 25]
87624 bytes
print(f"Generator memory: {sys.getsizeof(gen_expr)} bytes")
# Output: Generator memory: 200 bytes
# Generator uses constant memory regardless of data size"># Example 3: Memory comparison — generators win for large data
import sys
list_comp = [x for x in range(10000)]
gen_expr = (x for x in range(10000))
print(f"List memory: {sys.getsizeof(list_comp)} bytes")
# Output: List memory: 87624 bytes
print(f"Generator memory: {sys.getsizeof(gen_expr)} bytes")
# Output: Generator memory: 200 bytes
# Generator uses constant memory regardless of data size
🎯 Tip: "In data pipelines, I use generators to avoid OOM on large datasets. If I need to iterate only once, a generator is always better than a list."
Q04 — What is the difference between deepcopy and shallow copy?
Question: Explain shallow copy vs deep copy. When does each matter? Provide an example where shallow copy causes a bug.
Quick Answer: Shallow copy creates a new outer object but shares inner objects. Deep copy creates new copies of everything recursively. Matters when you have nested structures.
# Example 1: Shallow copy — inner lists are shared
import copy
a = [[1, 2], [3, 4]]
b = copy.copy(a) # Shallow copy
a[0].append(999) # Modify an inner list
print(a)
# Output: [[1, 2, 999], [3, 4]]
print(b)
# Output: [[1, 2, 999], [3, 4]]
# BUG: b was affected because inner lists are shared references
# Example 2: Deep copy — completely independent
import copy
a = [[1, 2], [3, 4]]
c = copy.deepcopy(a) # Deep copy — all nested objects are cloned
a[0].append(999)
print(a)
# Output: [[1, 2, 999], [3, 4]]
print(c)
# Output: [[1, 2], [3, 4]]
# c is fully independent — no shared references
# Example 3: Multiple ways to shallow copy a list
original = [1, 2, 3]
# Method 1: copy module
copy1 = copy.copy(original)
# Method 2: list slicing
copy2 = original[:]
# Method 3: list() constructor
copy3 = list(original)
# All produce independent shallow copies for flat lists
original.append(4)
print(original)
# Output: [1, 2, 3, 4]
print(copy1)
# Output: [1, 2, 3]
print(copy2)
# Output: [1, 2, 3]
print(copy3)
# Output: [1, 2, 3]
# For flat lists (no nesting), shallow copy is safe and sufficient
🎯 Tip: "If your data has nested structures (list of lists, dict of dicts), always use deepcopy. For flat structures, shallow copy or slicing is fine and faster."
Q05 — What does if __name__ == '__main__' do?
Question: What is the purpose of if __name__ == '__main__' in Python? Why is it important?
Quick Answer: It checks if the file is being run directly (not imported). Code inside this block only executes when the file is the entry point.
# Example 1: Basic usage in a module file
# File: utils.py
def helper():
return "I'm a helper function"
def add(a, b):
return a + b
if __name__ == '__main__':
# Only runs when: python utils.py
# Does NOT run when: from utils import helper
print(helper())
# Output: I'm a helper function
print(add(3, 4))
# Output: 7
# Example 2: Understanding __name__ value
# When run directly: __name__ == '__main__'
# When imported: __name__ == 'utils' (the module name)
# File: demo.py
print(f"__name__ is: {__name__}")
# Running: python demo.py
# Output: __name__ is: __main__
# Importing: import demo
# Output: __name__ is: demo
# Example 3: Real-world pattern — ETL script with reusable functions
# File: etl_pipeline.py
def extract(source):
"""Extract data from source — reusable when imported"""
return [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
def transform(data):
"""Transform data — reusable when imported"""
return [row for row in data if row["id"] > 0]
def load(data):
"""Load data — reusable when imported"""
print(f"Loaded {len(data)} rows")
if __name__ == '__main__':
# Orchestration only runs when script is executed directly
raw = extract("db://source")
clean = transform(raw)
load(clean)
# Output: Loaded 2 rows
🎯 Tip: "This pattern makes ETL code both runnable as a script AND importable as a module. Without it, import side effects would trigger pipeline runs unintentionally."
Q06 — What are *args and **kwargs?
Question: Explain *args and **kwargs. When would you use them? Show the order of parameters.
Quick Answer: *args collects extra positional arguments as a tuple. **kwargs collects extra keyword arguments as a dict. Order: def f(positional, *args, **kwargs).
# Example 1: Basic *args — collects positional arguments into a tuple
def add_all(*args):
# args is a tuple of all positional arguments
print(f"args = {args}")
return sum(args)
result = add_all(1, 2, 3, 4, 5)
# Output: args = (1, 2, 3, 4, 5)
print(result)
# Output: 15
# Example 2: Basic **kwargs — collects keyword arguments into a dict
def create_user(**kwargs):
# kwargs is a dict of all keyword arguments
for key in sorted(kwargs.keys()): # sorted for deterministic output
print(f" {key}: {kwargs[key]}")
create_user(name="Alice", age=30, role="engineer")
# Output: age: 30
# Output: name: Alice
# Output: role: engineer
# Example 3: Combined usage with correct ordering
def func(required, *args, **kwargs):
print(f"required: {required}")
print(f"args: {args}")
# Sort kwargs keys for deterministic output
print(f"kwargs: {dict(sorted(kwargs.items()))}")
func(1, 2, 3, x=4, y=5)
# Output: required: 1
# Output: args: (2, 3)
# Output: kwargs: {'x': 4, 'y': 5}
# Real-world use: wrapper functions, decorators, flexible APIs
def log_call(func_name, *args, **kwargs):
"""Log function calls in a data pipeline"""
sorted_kw = dict(sorted(kwargs.items()))
print(f"Calling {func_name} with args={args}, kwargs={sorted_kw}")
log_call("extract", "table_a", limit=100, format="parquet")
# Output: Calling extract with args=('table_a',), kwargs={'format': 'parquet', 'limit': 100}
🚫 What NOT to Say: "args and kwargs are special keywords." They are just conventions -- you could use *numbers and **options, but *args/**kwargs is the standard everyone follows.
Q07 — What is a decorator?
Question: What is a decorator in Python? How does it work internally? Give a practical example.
Quick Answer: A decorator is a function that wraps another function to add behavior without modifying the original. It takes a function as input and returns a new function.
# Example 1: Simple decorator — timing function execution
import time
def timer(func):
"""Decorator that measures execution time"""
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs) # Call the original function
elapsed = time.time() - start
print(f"{func.__name__} took {elapsed:.4f}s")
return result # Return the original result
return wrapper
@timer
def process_data():
total = sum(range(1000000)) # Some computation
return total
result = process_data()
# Output: process_data took 0.0312s (approximate)
print(result)
# Output: 499999500000
# Example 2: Decorator with arguments — retry logic
import time
def retry(max_attempts=3):
"""Decorator factory — returns a decorator configured with max_attempts"""
def decorator(func):
def wrapper(*args, **kwargs):
for attempt in range(1, max_attempts + 1):
try:
return func(*args, **kwargs)
except Exception as e:
print(f"Attempt {attempt} failed: {e}")
if attempt == max_attempts:
raise
return None
return wrapper
return decorator
@retry(max_attempts=2)
def fetch_data():
print("Fetching...")
return {"status": "ok"}
result = fetch_data()
# Output: Fetching...
print(result)
# Output: {'status': 'ok'}
# Example 3: Understanding decorator syntax — @ is syntactic sugar
def shout(func):
def wrapper():
result = func()
return result.upper()
return wrapper
# These two are IDENTICAL:
@shout
def greet():
return "hello world"
# Same as: greet = shout(greet)
print(greet())
# Output: HELLO WORLD
🎯 Tip: "I use decorators for timing ETL steps, retrying failed API calls, and caching expensive computations with @functools.lru_cache."
Q08 — What is the GIL (Global Interpreter Lock)?
Question: What is the GIL? How does it affect multi-threaded Python programs? How do you work around it?
Quick Answer: The GIL allows only one thread to execute Python bytecode at a time. CPU-bound tasks don't benefit from multi-threading. Use multiprocessing for CPU work, threading for I/O work.
# Example 1: Threading for I/O-bound work (GIL is released during I/O)
import threading
import time
results = []
lock = threading.Lock()
def fetch_url(url, delay):
"""Simulate an I/O-bound task (API call)"""
time.sleep(delay) # GIL is released during sleep/I/O
with lock:
results.append(f"Done: {url}")
threads = [
threading.Thread(target=fetch_url, args=("api/users", 0.1)),
threading.Thread(target=fetch_url, args=("api/orders", 0.1)),
]
start = time.time()
for t in threads:
t.start()
for t in threads:
t.join()
elapsed = time.time() - start
print(sorted(results))
# Output: ['Done: api/orders', 'Done: api/users']
print(f"Time: {elapsed:.1f}s (parallel, not 0.2s)")
# Output: Time: 0.1s (parallel, not 0.2s)
# Example 2: Multiprocessing for CPU-bound work (bypasses GIL)
from multiprocessing import Pool
def square(n):
"""CPU-bound computation"""
return n * n
# Each process has its own Python interpreter and GIL
with Pool(processes=2) as pool:
results = pool.map(square, [1, 2, 3, 4, 5])
print(results)
# Output: [1, 4, 9, 16, 25]
# Example 3: When to use what — quick reference
# I/O-bound (network, disk, DB) --> threading or asyncio
# - API calls, file reads, database queries
# - GIL is released during I/O operations
# CPU-bound (math, data processing) --> multiprocessing
# - Number crunching, image processing, data transformation
# - Each process has its own GIL
# Data engineering --> PySpark / Dask
# - Distributed computing across machines
# - Each executor runs a separate Python process
print("Threading: best for I/O-bound tasks")
print("Multiprocessing: best for CPU-bound tasks")
print("PySpark/Dask: best for distributed big data")
# Output: Threading: best for I/O-bound tasks
# Output: Multiprocessing: best for CPU-bound tasks
# Output: PySpark/Dask: best for distributed big data
🎯 Tip: "PySpark avoids the GIL by running Python in separate processes per executor. That's why it scales for big data workloads."
Q09 — Difference between is and ==?
Question: What is the difference between is and == in Python? When should you use each?
Quick Answer: == checks value equality (are the contents the same?). is checks identity (are they the exact same object in memory?). Use is only for None checks.
# Example 1: Value equality vs identity
a = [1, 2, 3]
b = [1, 2, 3] # Same values, different object
print(a == b)
# Output: True (same values)
print(a is b)
# Output: False (different objects in memory)
c = a # c points to the SAME object as a
print(a is c)
# Output: True (same object)
# Example 2: The correct way to check for None
value = None
# CORRECT — use 'is' for None checks
if value is None:
print("Value is None")
# Output: Value is None
# ALSO CORRECT — 'is not' for the opposite
value = 42
if value is not None:
print(f"Value is {value}")
# Output: Value is 42
# WHY? None is a singleton — there's only ONE None object in Python
# So 'is' is both correct and faster than ==
# Example 3: Python's integer caching — a common gotcha
# Python caches small integers from -5 to 256
a = 256
b = 256
print(a is b)
# Output: True (cached — same object)
a = 257
b = 257
print(a is b)
# Output: False (not cached — different objects)
# LESSON: Never use 'is' to compare integers or strings
# Always use == for value comparison
print(a == b)
# Output: True (correct way to compare values)
🚫 What NOT to Say: "I use is to compare strings or numbers." -- Only use is for None checks. For everything else, use ==.
Q10 — How does Python handle memory management?
Question: Explain how Python manages memory. What is reference counting? What is garbage collection?
Quick Answer: Python uses reference counting (tracks how many variables point to each object) plus a cyclic garbage collector for circular references. When reference count hits 0, memory is freed immediately.
# Example 1: Reference counting in action
import sys
a = [1, 2, 3]
print(sys.getrefcount(a))
# Output: 2 (one for 'a', one for the getrefcount argument)
b = a # Another reference to the same list
print(sys.getrefcount(a))
# Output: 3 (a + b + getrefcount arg)
del b # Remove one reference
print(sys.getrefcount(a))
# Output: 2 (back to a + getrefcount arg)
# Example 2: Circular references — garbage collector handles these
import gc
class Node:
def __init__(self, name):
self.name = name
self.ref = None # Will create circular reference
# Create circular reference
a = Node("A")
b = Node("B")
a.ref = b # A points to B
b.ref = a # B points to A (circular!)
# Delete external references
del a
del b
# Reference count for both is still 1 (they point to each other)
# But Python's garbage collector detects and cleans circular references
collected = gc.collect() # Force garbage collection
print(f"Garbage collector cleaned up {collected} objects")
# Output: Garbage collector cleaned up 0 objects
# (may vary; objects might already be collected)
28 bytes
print(f"int(1): {sys.getsizeof(1)} bytes")
# Output: int(1): 28 bytes
print(f"str(''): {sys.getsizeof('')} bytes")
# Output: str(''): 49 bytes
print(f"str('hello'):{sys.getsizeof('hello')} bytes")
# Output: str('hello'):54 bytes
print(f"list([]): {sys.getsizeof([])} bytes")
# Output: list([]): 56 bytes
print(f"dict({{}}): {sys.getsizeof({})} bytes")
# Output: dict({}): 64 bytes"># Example 3: Checking memory usage of objects
import sys
# Different types use different amounts of memory
print(f"int(0): {sys.getsizeof(0)} bytes")
# Output: int(0): 28 bytes
print(f"int(1): {sys.getsizeof(1)} bytes")
# Output: int(1): 28 bytes
print(f"str(''): {sys.getsizeof('')} bytes")
# Output: str(''): 49 bytes
print(f"str('hello'):{sys.getsizeof('hello')} bytes")
# Output: str('hello'):54 bytes
print(f"list([]): {sys.getsizeof([])} bytes")
# Output: list([]): 56 bytes
print(f"dict({{}}): {sys.getsizeof({})} bytes")
# Output: dict({}): 64 bytes
🎯 Tip: "In data pipelines, I watch for circular references in custom classes and use weakref when needed. For large DataFrames, I del them explicitly and call gc.collect() to free memory sooner."
Q11 — What is a lambda function?
Question: What is a lambda function? How is it different from a regular function? When would you use it?
Quick Answer: A lambda is an anonymous, single-expression function. Used for short operations, especially with sorted(), map(), filter(). Cannot contain statements or multiple expressions.
# Example 1: Lambda vs regular function — equivalent code
# Lambda version
square = lambda x: x ** 2
print(square(5))
# Output: 25
# Equivalent regular function
def square_func(x):
return x ** 2
print(square_func(5))
# Output: 25
# Example 2: Lambda with sorted() — most common real-world use
# Sort list of dicts by a specific key
employees = [
{"name": "Charlie", "salary": 75000},
{"name": "Alice", "salary": 90000},
{"name": "Bob", "salary": 60000},
]
# Sort by salary ascending
by_salary = sorted(employees, key=lambda emp: emp["salary"])
for emp in by_salary:
print(f"{emp['name']}: {emp['salary']}")
# Output: Bob: 60000
# Output: Charlie: 75000
# Output: Alice: 90000
# Sort by name descending
by_name_desc = sorted(employees, key=lambda emp: emp["name"], reverse=True)
for emp in by_name_desc:
print(emp["name"])
# Output: Charlie
# Output: Bob
# Output: Alice
# Example 3: Lambda with map() and filter()
numbers = [1, 2, 3, 4, 5, 6, 7, 8]
# filter: keep only even numbers
evens = list(filter(lambda x: x % 2 == 0, numbers))
print(evens)
# Output: [2, 4, 6, 8]
# map: double each number
doubled = list(map(lambda x: x * 2, numbers))
print(doubled)
# Output: [2, 4, 6, 8, 10, 12, 14, 16]
# Note: List comprehensions are usually more Pythonic
evens_lc = [x for x in numbers if x % 2 == 0]
doubled_lc = [x * 2 for x in numbers]
print(evens_lc)
# Output: [2, 4, 6, 8]
print(doubled_lc)
# Output: [2, 4, 6, 8, 10, 12, 14, 16]
🚫 What NOT to Say: "Lambda can have multiple statements." -- Lambda is ONE expression only. No assignments, no loops, no if/else blocks (only ternary x if cond else y).
Q12 — Difference between append() and extend()?
Question: What is the difference between append() and extend() on a list? What about +=?
Quick Answer: append() adds one item (even if it's a list, it becomes a nested element). extend() adds each element individually from an iterable. += behaves like extend().
# Example 1: append() adds ONE item — even if that item is a list
a = [1, 2]
a.append([3, 4]) # Adds the entire list as a single element
print(a)
# Output: [1, 2, [3, 4]]
print(len(a))
# Output: 3 (the inner list counts as one element)
# Example 2: extend() unpacks the iterable and adds each element
b = [1, 2]
b.extend([3, 4]) # Adds 3 and 4 individually
print(b)
# Output: [1, 2, 3, 4]
print(len(b))
# Output: 4
# extend works with any iterable, not just lists
c = [1, 2]
c.extend("abc") # Strings are iterable — adds each character
print(c)
# Output: [1, 2, 'a', 'b', 'c']
# Example 3: += is equivalent to extend() (not append!)
d = [1, 2]
d += [3, 4] # Same as d.extend([3, 4])
print(d)
# Output: [1, 2, 3, 4]
# Comparison side by side
append_result = [1, 2]
extend_result = [1, 2]
pluseq_result = [1, 2]
append_result.append([3, 4])
extend_result.extend([3, 4])
pluseq_result += [3, 4]
print(f"append: {append_result}")
# Output: append: [1, 2, [3, 4]]
print(f"extend: {extend_result}")
# Output: extend: [1, 2, 3, 4]
print(f"+=: {pluseq_result}")
# Output: +=: [1, 2, 3, 4]
🎯 Tip: "A common bug: using append when you meant extend, causing nested lists. In data pipelines, extend is usually what you want when combining batches of records."
Q13 — What are Python generators?
Question: What is a generator in Python? How does yield work? Why are generators useful in data engineering?
Quick Answer: Generators are functions that use yield instead of return. They produce values lazily -- one at a time -- and pause between yields. Memory efficient for large data.
# Example 1: Simple generator with yield
def count_up_to(n):
"""Generator that yields numbers from 1 to n"""
i = 1
while i <= n:
yield i # Pause here, return value, resume on next call
i += 1
gen = count_up_to(5)
print(next(gen))
# Output: 1
print(next(gen))
# Output: 2
print(list(gen)) # Remaining values
# Output: [3, 4, 5]
# Example 2: Generator for processing data in chunks
def chunked_range(start, end, chunk_size):
"""Yield data in chunks — simulates batch processing"""
for i in range(start, end, chunk_size):
chunk = list(range(i, min(i + chunk_size, end)))
yield chunk # Only one chunk in memory at a time
for batch in chunked_range(0, 10, 3):
print(f"Processing batch: {batch}")
# Output: Processing batch: [0, 1, 2]
# Output: Processing batch: [3, 4, 5]
# Output: Processing batch: [6, 7, 8]
# Output: Processing batch: [9]
# Example 3: Generator pipeline — chaining generators for ETL
def generate_rows():
"""Extract: produce raw rows"""
data = [
{"name": "Alice", "score": 85},
{"name": "Bob", "score": 45},
{"name": "Charlie", "score": 92},
]
for row in data:
yield row
def filter_passing(rows, threshold=60):
"""Transform: keep only passing scores"""
for row in rows:
if row["score"] >= threshold:
yield row
def format_output(rows):
"""Transform: format for display"""
for row in rows:
yield f"{row['name']}: {row['score']}"
# Chain generators — no intermediate lists created
pipeline = format_output(filter_passing(generate_rows()))
for record in pipeline:
print(record)
# Output: Alice: 85
# Output: Charlie: 92
🎯 Tip: "In ETL pipelines, generators let me process millions of rows without loading everything into memory. I chain them together like Unix pipes."
Q14 — What is the difference between @staticmethod and @classmethod?
Question: Explain @staticmethod and @classmethod. When would you use each? How do they differ from regular methods?
Quick Answer: @staticmethod has no access to class or instance (just a function inside a class). @classmethod gets the class (cls) as its first argument and can access/modify class state.
# Example 1: All three method types compared
class DataProcessor:
default_format = "csv" # Class variable
def __init__(self, name):
self.name = name # Instance variable
def process(self):
"""Regular method — has access to instance (self)"""
return f"{self.name} processing as {self.default_format}"
@classmethod
def set_format(cls, fmt):
"""Class method — has access to class (cls), not instance"""
cls.default_format = fmt
return f"Format set to {fmt}"
@staticmethod
def validate_extension(filename):
"""Static method — no access to class or instance"""
return filename.endswith((".csv", ".json", ".parquet"))
# Regular method — needs an instance
dp = DataProcessor("Pipeline-1")
print(dp.process())
# Output: Pipeline-1 processing as csv
# Class method — can be called on the class itself
print(DataProcessor.set_format("parquet"))
# Output: Format set to parquet
# Static method — utility function, no class/instance needed
print(DataProcessor.validate_extension("data.csv"))
# Output: True
print(DataProcessor.validate_extension("data.exe"))
# Output: False
# Example 2: @classmethod as alternative constructor
class Config:
def __init__(self, host, port, db):
self.host = host
self.port = port
self.db = db
@classmethod
def from_string(cls, config_str):
"""Alternative constructor — parses a config string"""
host, port, db = config_str.split(":")
return cls(host, int(port), db) # cls() creates a new instance
def __repr__(self):
return f"Config({self.host}:{self.port}/{self.db})"
# Standard constructor
c1 = Config("localhost", 5432, "mydb")
print(c1)
# Output: Config(localhost:5432/mydb)
# Alternative constructor via classmethod
c2 = Config.from_string("prod-server:5432:analytics")
print(c2)
# Output: Config(prod-server:5432/analytics)
# Example 3: When to use which — decision guide
class DateUtils:
date_format = "%Y-%m-%d"
@staticmethod
def is_weekend(day_number):
"""No class or instance data needed — pure utility"""
return day_number >= 5 # 5=Sat, 6=Sun
@classmethod
def get_format(cls):
"""Needs access to class-level configuration"""
return cls.date_format
print(DateUtils.is_weekend(6))
# Output: True
print(DateUtils.get_format())
# Output: %Y-%m-%d
🎯 Tip: "Use @classmethod for factory methods (alternative constructors) and @staticmethod for utility functions that logically belong to the class but don't need class/instance data."
Q15 — How do you handle file reading in Python?
Question: What is the best practice for reading files in Python? How do you handle large files efficiently?
Quick Answer: Always use the with statement (context manager) to ensure files are closed properly, even if exceptions occur. For large files, read line by line instead of loading everything.
# Example 1: Reading a file with context manager
import json
# CORRECT — 'with' ensures file is closed even if an exception occurs
data = '{"name": "Alice", "role": "engineer"}'
# Simulating file read with json.loads (no external file needed)
config = json.loads(data)
print(config["name"])
# Output: Alice
print(config["role"])
# Output: engineer
# Pattern for real file reading:
# with open("config.json") as f:
# config = json.load(f)
# BAD — file may not be closed if an exception occurs
# f = open("config.json")
# config = json.load(f)
# f.close() # Skipped if exception happens above!
# Example 2: Reading large files — line by line (memory efficient)
import io
# Simulate a large CSV file
csv_content = "id,name,score\n1,Alice,85\n2,Bob,92\n3,Charlie,78\n"
fake_file = io.StringIO(csv_content)
# Line-by-line reading — only ONE line in memory at a time
row_count = 0
for line in fake_file:
row_count += 1
print(line.strip())
# Output: id,name,score
# Output: 1,Alice,85
# Output: 2,Bob,92
# Output: 3,Charlie,78
print(f"Total lines: {row_count}")
# Output: Total lines: 4
# Example 3: Reading in chunks — for binary or very large files
import io
# Simulate a large file
large_content = "A" * 100 # 100 characters
fake_file = io.StringIO(large_content)
chunk_size = 30
chunks_read = 0
while True:
chunk = fake_file.read(chunk_size)
if not chunk:
break
chunks_read += 1
print(f"Chunk {chunks_read}: {len(chunk)} chars")
# Output: Chunk 1: 30 chars
# Output: Chunk 2: 30 chars
# Output: Chunk 3: 30 chars
# Output: Chunk 4: 10 chars
print(f"Total chunks: {chunks_read}")
# Output: Total chunks: 4
🚫 What NOT to Say: "I use f = open(...) without with." -- That's a resource leak if an exception occurs. Always use context managers.
Q16 — What is a dictionary comprehension?
Question: What is a dictionary comprehension? How does it compare to building a dict with a loop? Give practical examples.
Quick Answer: A dictionary comprehension creates a dict in a single expression: {key: value for item in iterable}. It's more concise and often faster than a loop.
# Example 1: Basic dict comprehension — creating a mapping
# Square mapping
squares = {x: x ** 2 for x in range(6)}
print(squares)
# Output: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
# Equivalent loop (more verbose)
squares_loop = {}
for x in range(6):
squares_loop[x] = x ** 2
print(squares_loop)
# Output: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
# Example 2: Filtering with dict comprehension
scores = {"Alice": 85, "Bob": 55, "Charlie": 92, "Diana": 48}
# Keep only passing scores (>= 60)
passed = {k: v for k, v in sorted(scores.items()) if v >= 60}
print(passed)
# Output: {'Alice': 85, 'Charlie': 92}
# Invert a dictionary (swap keys and values)
inverted = {v: k for k, v in sorted(scores.items())}
print(inverted)
# Output: {48: 'Diana', 55: 'Bob', 85: 'Alice', 92: 'Charlie'}
# Example 3: Real-world use — transforming data structures
# Convert list of tuples to dict
raw_data = [("host", "localhost"), ("port", "5432"), ("db", "analytics")]
config = {k: v for k, v in raw_data}
print(config)
# Output: {'host': 'localhost', 'port': '5432', 'db': 'analytics'}
# Create a lookup table from a list of records
employees = [
{"id": 101, "name": "Alice"},
{"id": 102, "name": "Bob"},
{"id": 103, "name": "Charlie"},
]
lookup = {emp["id"]: emp["name"] for emp in employees}
print(lookup)
# Output: {101: 'Alice', 102: 'Bob', 103: 'Charlie'}
print(lookup[102])
# Output: Bob
🎯 Tip: "Dict comprehensions are great for building lookup tables. In data engineering, I use them to create column mappings, config transforms, and ID-to-name lookups."
Q17 — What is enumerate() and why use it?
Question: What does enumerate() do? Why is it better than using range(len(...))?
Quick Answer: enumerate() adds a counter to an iterable, returning (index, value) pairs. It's cleaner, more Pythonic, and less error-prone than manual indexing.
# Example 1: Bad vs Good — manual index vs enumerate
names = ["Alice", "Bob", "Charlie"]
# BAD — manual indexing with range(len(...))
for i in range(len(names)):
print(f"{i}: {names[i]}")
# Output: 0: Alice
# Output: 1: Bob
# Output: 2: Charlie
# GOOD — enumerate is cleaner and more Pythonic
for i, name in enumerate(names):
print(f"{i}: {name}")
# Output: 0: Alice
# Output: 1: Bob
# Output: 2: Charlie
# Example 2: Custom start index
tasks = ["Extract", "Transform", "Load"]
# Start counting from 1 instead of 0
for step, task in enumerate(tasks, start=1):
print(f"Step {step}: {task}")
# Output: Step 1: Extract
# Output: Step 2: Transform
# Output: Step 3: Load
# Example 3: Practical use — finding positions of matching elements
scores = [72, 85, 91, 45, 88, 55, 93]
threshold = 80
# Find indices of scores above threshold
high_scorers = [(i, score) for i, score in enumerate(scores) if score > threshold]
print(high_scorers)
# Output: [(1, 85), (2, 91), (4, 88), (6, 93)]
# Convert enumerate to a dict
index_map = dict(enumerate(["zero", "one", "two", "three"]))
print(index_map)
# Output: {0: 'zero', 1: 'one', 2: 'two', 3: 'three'}
🚫 What NOT to Say: "I use for i in range(len(list)) to iterate with indices." -- That's unpythonic. Always use enumerate().
Q18 — What is the difference between a module, package, and library?
Question: Explain the difference between a module, a package, and a library in Python. How does __init__.py work?
Quick Answer: A module is a single .py file. A package is a directory with __init__.py containing multiple modules. A library is a collection of packages (e.g., pandas, numpy).
# Example 1: Structure visualization
# myproject/ <- Project root
# main.py <- Script
# mypackage/ <- Package (directory with __init__.py)
# __init__.py <- Makes this directory a package
# utils.py <- Module (single .py file)
# models.py <- Module
# subpackage/ <- Sub-package
# __init__.py
# helpers.py <- Module
# Import a module from a package
# from mypackage import utils
# from mypackage.subpackage import helpers
# Example 2: Creating and using a simple module
# Simulating what a module looks like
# --- File: math_utils.py (this would be a module) ---
def add(a, b):
return a + b
def multiply(a, b):
return a * b
PI = 3.14159
# --- File: main.py (importing the module) ---
# import math_utils
# result = math_utils.add(3, 4)
# Or selective import:
# from math_utils import add, PI
# result = add(3, 4)
# Demonstrating the concept inline
print(add(3, 4))
# Output: 7
print(f"PI = {PI}")
# Output: PI = 3.14159
# Example 3: __init__.py controls what gets exported
# --- File: mypackage/__init__.py ---
# This file runs when you do: import mypackage
# You can expose specific items at the package level:
# from .utils import helper_function
# from .models import DataModel
# Then users can do:
# from mypackage import helper_function (clean import)
# Instead of:
# from mypackage.utils import helper_function (longer path)
# Check if something is a module, package, or built-in
import json
import os
print(type(json))
# Output: <class 'module'>
print(hasattr(json, '__path__'))
# Output: False (json is a module, not a package in this context)
print(hasattr(os, '__path__'))
# Output: True (os is a package — it's a directory with __init__.py)
🎯 Tip: "In data engineering projects, I organize code into packages: etl/extract.py, etl/transform.py, etl/load.py with an __init__.py that exposes the main pipeline function."
Q19 — How does error handling work with try/except/else/finally?
Question: Explain Python's error handling. What is the purpose of else and finally in a try block? Give practical examples.
Quick Answer: try runs risky code. except catches specific exceptions. else runs only if no exception occurred. finally always runs (cleanup). Order matters.
# Example 1: Full try/except/else/finally structure
def divide(a, b):
try:
result = a / b # Risky operation
except ZeroDivisionError:
print("Error: Cannot divide by zero!")
return None
except TypeError as e:
print(f"Error: Wrong types — {e}")
return None
else:
print(f"Success: {a} / {b} = {result}") # Only if NO exception
return result
finally:
print("Cleanup: division attempt complete") # ALWAYS runs
print(divide(10, 3))
# Output: Success: 10 / 3 = 3.3333333333333335
# Output: Cleanup: division attempt complete
# Output: 3.3333333333333335
print(divide(10, 0))
# Output: Error: Cannot divide by zero!
# Output: Cleanup: division attempt complete
# Output: None
# Example 2: Catching multiple exception types
def parse_config(value):
"""Parse a config value — handle various errors"""
try:
# Try converting to int
result = int(value)
return result
except ValueError:
print(f"'{value}' is not a valid integer")
return None
except TypeError:
print(f"Expected string or number, got {type(value).__name__}")
return None
print(parse_config("42"))
# Output: 42
print(parse_config("abc"))
# Output: 'abc' is not a valid integer
# Output: None
print(parse_config(None))
# Output: Expected string or number, got NoneType
# Output: None
# Example 3: Custom exceptions for data pipelines
class DataValidationError(Exception):
"""Custom exception for data quality issues"""
def __init__(self, column, message):
self.column = column
self.message = message
super().__init__(f"Column '{column}': {message}")
def validate_row(row):
if row.get("age") is not None and row["age"] < 0:
raise DataValidationError("age", "negative value not allowed")
if not row.get("name"):
raise DataValidationError("name", "cannot be empty")
return True
# Test validation
test_rows = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": -5},
{"name": "", "age": 25},
]
for row in test_rows:
try:
validate_row(row)
print(f"Valid: {row}")
except DataValidationError as e:
print(f"Invalid: {e}")
# Output: Valid: {'name': 'Alice', 'age': 30}
# Output: Invalid: Column 'age': negative value not allowed
# Output: Invalid: Column 'name': cannot be empty
🎯 Tip: "else runs only on success, finally always runs -- even if there's a return statement. In ETL, I use finally for closing DB connections and else for logging success."
Q20 — What is zip() and how do you unzip?
Question: What does zip() do? How do you unzip? What happens when iterables have different lengths?
Quick Answer: zip() pairs elements from multiple iterables into tuples. Unzip with zip(*zipped). In Python 3, zip() returns a lazy iterator and stops at the shortest iterable.
# Example 1: Basic zip — pairing two lists
names = ["Alice", "Bob", "Charlie"]
scores = [85, 92, 78]
paired = list(zip(names, scores))
print(paired)
# Output: [('Alice', 85), ('Bob', 92), ('Charlie', 78)]
# Create a dict from two lists
score_map = dict(zip(names, scores))
print(score_map)
# Output: {'Alice': 85, 'Bob': 92, 'Charlie': 78}
# Example 2: Unzipping with zip(*)
pairs = [("Alice", 85), ("Bob", 92), ("Charlie", 78)]
# Unzip — separate back into individual tuples
names_back, scores_back = zip(*pairs)
print(names_back)
# Output: ('Alice', 'Bob', 'Charlie')
print(scores_back)
# Output: (85, 92, 78)
# Convert back to lists if needed
print(list(names_back))
# Output: ['Alice', 'Bob', 'Charlie']
# Example 3: Unequal lengths and zip_longest
from itertools import zip_longest
names = ["Alice", "Bob", "Charlie"]
scores = [85, 92] # Shorter!
# Regular zip — stops at shortest
print(list(zip(names, scores)))
# Output: [('Alice', 85), ('Bob', 92)]
# Charlie is DROPPED silently!
# zip_longest — fills missing values with a default
print(list(zip_longest(names, scores, fillvalue=0)))
# Output: [('Alice', 85), ('Bob', 92), ('Charlie', 0)]
# Practical use: iterate multiple lists in parallel
columns = ["id", "name", "score"]
values = [101, "Alice", 85]
for col, val in zip(columns, values):
print(f" {col} = {val}")
# Output: id = 101
# Output: name = Alice
# Output: score = 85
🚫 What NOT to Say: "zip returns a list." -- In Python 3, zip() returns a lazy iterator. Wrap in list() if you need a list. Also beware: regular zip silently drops extra elements from longer iterables.