Python Interview Questions for Data Science
Python has become the dominant programming language for data science. Whether you are interviewing for a data analyst, data scientist, or machine learning engineer role, you will likely face Python coding questions. This guide covers the essential Python concepts and questions you need to master for your data science interview.
Python Fundamentals for Data Science
Data Types and Structures
Before diving into pandas and numpy, make sure you understand Python's built-in data structures:
# Lists - ordered, mutable
numbers = [1, 2, 3, 4, 5]
numbers.append(6)
# Dictionaries - key-value pairs
user = {'name': 'Alice', 'age': 30, 'city': 'Mumbai'}
user['email'] = 'alice@example.com'
# Sets - unique values, unordered
unique_ids = {101, 102, 103, 101} # Results in {101, 102, 103}
# Tuples - ordered, immutable
coordinates = (10.5, 20.3)
# List comprehensions - powerful and common in data work
squares = [x**2 for x in range(10)]
even_squares = [x**2 for x in range(10) if x % 2 == 0]
Common Interview Question: Difference between list and tuple?
Lists are mutable (can be changed after creation), while tuples are immutable. Tuples are slightly faster and can be used as dictionary keys. Use tuples for fixed collections and lists when you need to modify the data.
Pandas for Data Manipulation
Creating and Exploring DataFrames
import pandas as pd
import numpy as np
# Create DataFrame from dictionary
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 75000, 55000],
'department': ['Engineering', 'Marketing', 'Engineering', 'Sales']
})
# Basic exploration
df.head() # First 5 rows
df.tail(3) # Last 3 rows
df.shape # (rows, columns)
df.info() # Data types and memory
df.describe() # Statistical summary
df.columns # Column names
df.dtypes # Data types per column
Filtering and Selection
# Select single column
df['name']
# Select multiple columns
df[['name', 'salary']]
# Filter rows
df[df['age'] > 28]
df[df['department'] == 'Engineering']
# Multiple conditions
df[(df['age'] > 25) & (df['salary'] > 55000)]
df[(df['department'] == 'Engineering') | (df['department'] == 'Sales')]
# Using query method
df.query('age > 25 and salary > 55000')
# Using isin for multiple values
df[df['department'].isin(['Engineering', 'Sales'])]
Interview Question: What is the difference between loc and iloc?
# loc - label-based selection
df.loc[0, 'name'] # Row 0, column 'name'
df.loc[0:2, ['name', 'age']] # Rows 0-2, specific columns
# iloc - integer position-based selection
df.iloc[0, 0] # First row, first column
df.iloc[0:2, 0:2] # First 2 rows, first 2 columns
df.iloc[-1] # Last row
Grouping and Aggregation
# Basic groupby
df.groupby('department')['salary'].mean()
# Multiple aggregations
df.groupby('department').agg({
'salary': ['mean', 'min', 'max'],
'age': 'mean'
})
# Named aggregations (cleaner output)
df.groupby('department').agg(
avg_salary=('salary', 'mean'),
total_employees=('name', 'count'),
avg_age=('age', 'mean')
)
# Transform - returns same shape as original
df['salary_zscore'] = df.groupby('department')['salary'].transform(
lambda x: (x - x.mean()) / x.std()
)
Handling Missing Data
# Check for missing values
df.isnull().sum()
df.isna().any()
# Drop missing values
df.dropna() # Drop rows with any NaN
df.dropna(subset=['salary']) # Drop only if salary is NaN
df.dropna(thresh=2) # Keep rows with at least 2 non-NaN values
# Fill missing values
df['salary'].fillna(0)
df['salary'].fillna(df['salary'].mean())
df['salary'].fillna(method='ffill') # Forward fill
df['salary'].fillna(method='bfill') # Backward fill
# Interpolation
df['value'].interpolate(method='linear')
Merging and Joining DataFrames
# Sample DataFrames
orders = pd.DataFrame({
'order_id': [1, 2, 3, 4],
'customer_id': [101, 102, 101, 103],
'amount': [150, 200, 300, 175]
})
customers = pd.DataFrame({
'customer_id': [101, 102, 104],
'name': ['Alice', 'Bob', 'Diana']
})
# Inner join (default)
pd.merge(orders, customers, on='customer_id')
# Left join
pd.merge(orders, customers, on='customer_id', how='left')
# Different column names
pd.merge(orders, customers, left_on='customer_id', right_on='customer_id')
# Concatenation
pd.concat([df1, df2], axis=0) # Stack vertically
pd.concat([df1, df2], axis=1) # Stack horizontally
NumPy for Numerical Computing
Array Creation and Operations
import numpy as np
# Create arrays
arr = np.array([1, 2, 3, 4, 5])
zeros = np.zeros((3, 4)) # 3x4 matrix of zeros
ones = np.ones((2, 3)) # 2x3 matrix of ones
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced values
# Array operations (vectorized)
arr * 2 # Element-wise multiplication
arr + arr # Element-wise addition
np.sqrt(arr) # Square root of each element
np.exp(arr) # Exponential
# Statistical operations
arr.mean()
arr.std()
arr.sum()
arr.min(), arr.max()
np.percentile(arr, 50) # Median
Interview Question: Why use NumPy over Python lists?
NumPy arrays are faster and more memory-efficient than Python lists for numerical operations. NumPy uses contiguous memory blocks and vectorized operations implemented in C, while Python lists store pointers to objects scattered in memory. For large datasets, NumPy can be 10-100x faster.
Array Reshaping and Manipulation
# Reshaping
arr = np.arange(12)
arr.reshape(3, 4) # 3 rows, 4 columns
arr.reshape(2, -1) # 2 rows, auto-calculate columns
# Transposing
matrix = np.array([[1, 2], [3, 4], [5, 6]])
matrix.T # Transpose
# Stacking
np.vstack([arr1, arr2]) # Vertical stack
np.hstack([arr1, arr2]) # Horizontal stack
np.concatenate([arr1, arr2], axis=0)
Data Visualization
Matplotlib Basics
import matplotlib.pyplot as plt
# Line plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Sales', color='blue', linestyle='-', marker='o')
plt.xlabel('Month')
plt.ylabel('Revenue ($)')
plt.title('Monthly Sales Trend')
plt.legend()
plt.grid(True)
plt.savefig('sales_chart.png', dpi=300)
plt.show()
# Bar chart
plt.bar(categories, values)
plt.xticks(rotation=45)
# Histogram
plt.hist(data, bins=30, edgecolor='black')
# Scatter plot
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5)
Seaborn for Statistical Visualization
import seaborn as sns
# Distribution plot
sns.histplot(data=df, x='salary', hue='department', kde=True)
# Box plot
sns.boxplot(data=df, x='department', y='salary')
# Scatter plot with regression line
sns.regplot(data=df, x='age', y='salary')
# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
# Pair plot for multiple variables
sns.pairplot(df, hue='department')
Machine Learning with Scikit-learn
Basic ML Workflow
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Prepare data
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
# Predict and evaluate
y_pred = model.predict(X_test_scaled)
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(classification_report(y_test, y_pred))
Interview Question: What is the difference between fit, transform, and fit_transform?
fit: Learns parameters from the training data (like mean and std for StandardScaler).
transform: Applies learned parameters to transform data.
fit_transform: Does both in one step, but only use on training data.
Always fit on training data only, then use transform on both training and test data to prevent data leakage.
Common Coding Challenges
Challenge 1: Find the most common value in each group
def most_common_per_group(df, group_col, value_col):
return df.groupby(group_col)[value_col].agg(
lambda x: x.mode().iloc[0] if not x.mode().empty else None
)
Challenge 2: Calculate rolling statistics
def add_rolling_stats(df, value_col, window=7):
df[f'{value_col}_rolling_mean'] = df[value_col].rolling(window).mean()
df[f'{value_col}_rolling_std'] = df[value_col].rolling(window).std()
return df
Challenge 3: Remove outliers using IQR
def remove_outliers_iqr(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
Challenge 4: Create dummy variables and handle the dummy trap
def create_dummies(df, categorical_cols):
return pd.get_dummies(df, columns=categorical_cols, drop_first=True)
Best Practices for Python Coding Interviews
- Write readable code: Use meaningful variable names and add comments
- Think before coding: Outline your approach verbally before writing
- Handle edge cases: Empty DataFrames, missing values, data type issues
- Use vectorized operations: Avoid loops when pandas or numpy operations exist
- Test your code: Check with simple examples before finalizing
Make sure your resume highlights your Python skills with specific libraries and projects you have worked on. Mentioning pandas, scikit-learn, and visualization libraries demonstrates practical data science readiness.