Python Interview Questions for Data Science

Python has become the dominant programming language for data science. Whether you are interviewing for a data analyst, data scientist, or machine learning engineer role, you will likely face Python coding questions. This guide covers the essential Python concepts and questions you need to master for your data science interview.

Python Fundamentals for Data Science

Data Types and Structures

Before diving into pandas and numpy, make sure you understand Python's built-in data structures:

# Lists - ordered, mutable
numbers = [1, 2, 3, 4, 5]
numbers.append(6)

# Dictionaries - key-value pairs
user = {'name': 'Alice', 'age': 30, 'city': 'Mumbai'}
user['email'] = 'alice@example.com'

# Sets - unique values, unordered
unique_ids = {101, 102, 103, 101}  # Results in {101, 102, 103}

# Tuples - ordered, immutable
coordinates = (10.5, 20.3)

# List comprehensions - powerful and common in data work
squares = [x**2 for x in range(10)]
even_squares = [x**2 for x in range(10) if x % 2 == 0]

Common Interview Question: Difference between list and tuple?

Lists are mutable (can be changed after creation), while tuples are immutable. Tuples are slightly faster and can be used as dictionary keys. Use tuples for fixed collections and lists when you need to modify the data.

Pandas for Data Manipulation

Creating and Exploring DataFrames

import pandas as pd
import numpy as np

# Create DataFrame from dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 60000, 75000, 55000],
    'department': ['Engineering', 'Marketing', 'Engineering', 'Sales']
})

# Basic exploration
df.head()           # First 5 rows
df.tail(3)          # Last 3 rows
df.shape            # (rows, columns)
df.info()           # Data types and memory
df.describe()       # Statistical summary
df.columns          # Column names
df.dtypes           # Data types per column

Filtering and Selection

# Select single column
df['name']

# Select multiple columns
df[['name', 'salary']]

# Filter rows
df[df['age'] > 28]
df[df['department'] == 'Engineering']

# Multiple conditions
df[(df['age'] > 25) & (df['salary'] > 55000)]
df[(df['department'] == 'Engineering') | (df['department'] == 'Sales')]

# Using query method
df.query('age > 25 and salary > 55000')

# Using isin for multiple values
df[df['department'].isin(['Engineering', 'Sales'])]

Interview Question: What is the difference between loc and iloc?

# loc - label-based selection
df.loc[0, 'name']           # Row 0, column 'name'
df.loc[0:2, ['name', 'age']] # Rows 0-2, specific columns

# iloc - integer position-based selection
df.iloc[0, 0]               # First row, first column
df.iloc[0:2, 0:2]           # First 2 rows, first 2 columns
df.iloc[-1]                 # Last row

Grouping and Aggregation

# Basic groupby
df.groupby('department')['salary'].mean()

# Multiple aggregations
df.groupby('department').agg({
    'salary': ['mean', 'min', 'max'],
    'age': 'mean'
})

# Named aggregations (cleaner output)
df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    total_employees=('name', 'count'),
    avg_age=('age', 'mean')
)

# Transform - returns same shape as original
df['salary_zscore'] = df.groupby('department')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)

Handling Missing Data

# Check for missing values
df.isnull().sum()
df.isna().any()

# Drop missing values
df.dropna()                    # Drop rows with any NaN
df.dropna(subset=['salary'])   # Drop only if salary is NaN
df.dropna(thresh=2)            # Keep rows with at least 2 non-NaN values

# Fill missing values
df['salary'].fillna(0)
df['salary'].fillna(df['salary'].mean())
df['salary'].fillna(method='ffill')  # Forward fill
df['salary'].fillna(method='bfill')  # Backward fill

# Interpolation
df['value'].interpolate(method='linear')

Merging and Joining DataFrames

# Sample DataFrames
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4],
    'customer_id': [101, 102, 101, 103],
    'amount': [150, 200, 300, 175]
})

customers = pd.DataFrame({
    'customer_id': [101, 102, 104],
    'name': ['Alice', 'Bob', 'Diana']
})

# Inner join (default)
pd.merge(orders, customers, on='customer_id')

# Left join
pd.merge(orders, customers, on='customer_id', how='left')

# Different column names
pd.merge(orders, customers, left_on='customer_id', right_on='customer_id')

# Concatenation
pd.concat([df1, df2], axis=0)  # Stack vertically
pd.concat([df1, df2], axis=1)  # Stack horizontally

NumPy for Numerical Computing

Array Creation and Operations

import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
zeros = np.zeros((3, 4))       # 3x4 matrix of zeros
ones = np.ones((2, 3))         # 2x3 matrix of ones
range_arr = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)  # 5 evenly spaced values

# Array operations (vectorized)
arr * 2                         # Element-wise multiplication
arr + arr                       # Element-wise addition
np.sqrt(arr)                    # Square root of each element
np.exp(arr)                     # Exponential

# Statistical operations
arr.mean()
arr.std()
arr.sum()
arr.min(), arr.max()
np.percentile(arr, 50)          # Median

Interview Question: Why use NumPy over Python lists?

NumPy arrays are faster and more memory-efficient than Python lists for numerical operations. NumPy uses contiguous memory blocks and vectorized operations implemented in C, while Python lists store pointers to objects scattered in memory. For large datasets, NumPy can be 10-100x faster.

Array Reshaping and Manipulation

# Reshaping
arr = np.arange(12)
arr.reshape(3, 4)              # 3 rows, 4 columns
arr.reshape(2, -1)             # 2 rows, auto-calculate columns

# Transposing
matrix = np.array([[1, 2], [3, 4], [5, 6]])
matrix.T                        # Transpose

# Stacking
np.vstack([arr1, arr2])        # Vertical stack
np.hstack([arr1, arr2])        # Horizontal stack
np.concatenate([arr1, arr2], axis=0)

Data Visualization

Matplotlib Basics

import matplotlib.pyplot as plt

# Line plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Sales', color='blue', linestyle='-', marker='o')
plt.xlabel('Month')
plt.ylabel('Revenue ($)')
plt.title('Monthly Sales Trend')
plt.legend()
plt.grid(True)
plt.savefig('sales_chart.png', dpi=300)
plt.show()

# Bar chart
plt.bar(categories, values)
plt.xticks(rotation=45)

# Histogram
plt.hist(data, bins=30, edgecolor='black')

# Scatter plot
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5)

Seaborn for Statistical Visualization

import seaborn as sns

# Distribution plot
sns.histplot(data=df, x='salary', hue='department', kde=True)

# Box plot
sns.boxplot(data=df, x='department', y='salary')

# Scatter plot with regression line
sns.regplot(data=df, x='age', y='salary')

# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

# Pair plot for multiple variables
sns.pairplot(df, hue='department')

Machine Learning with Scikit-learn

Basic ML Workflow

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Prepare data
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = model.predict(X_test_scaled)
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(classification_report(y_test, y_pred))

Interview Question: What is the difference between fit, transform, and fit_transform?

fit: Learns parameters from the training data (like mean and std for StandardScaler).

transform: Applies learned parameters to transform data.

fit_transform: Does both in one step, but only use on training data.

Always fit on training data only, then use transform on both training and test data to prevent data leakage.

Common Coding Challenges

Challenge 1: Find the most common value in each group

def most_common_per_group(df, group_col, value_col):
    return df.groupby(group_col)[value_col].agg(
        lambda x: x.mode().iloc[0] if not x.mode().empty else None
    )

Challenge 2: Calculate rolling statistics

def add_rolling_stats(df, value_col, window=7):
    df[f'{value_col}_rolling_mean'] = df[value_col].rolling(window).mean()
    df[f'{value_col}_rolling_std'] = df[value_col].rolling(window).std()
    return df

Challenge 3: Remove outliers using IQR

def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

Challenge 4: Create dummy variables and handle the dummy trap

def create_dummies(df, categorical_cols):
    return pd.get_dummies(df, columns=categorical_cols, drop_first=True)

Best Practices for Python Coding Interviews

  • Write readable code: Use meaningful variable names and add comments
  • Think before coding: Outline your approach verbally before writing
  • Handle edge cases: Empty DataFrames, missing values, data type issues
  • Use vectorized operations: Avoid loops when pandas or numpy operations exist
  • Test your code: Check with simple examples before finalizing

Make sure your resume highlights your Python skills with specific libraries and projects you have worked on. Mentioning pandas, scikit-learn, and visualization libraries demonstrates practical data science readiness.

Frequently Asked Questions

Do I need to know Python for data analyst interviews?

While SQL is often the primary technical requirement for data analyst roles, Python knowledge is increasingly expected, especially at tech companies and startups. For data scientist roles, Python proficiency is almost always required. Knowing pandas, basic statistics in Python, and visualization libraries gives you a significant advantage.

Should I learn Python or R for data science interviews?

Python is the safer choice for most data science interviews in 2026. It is more versatile, has stronger industry adoption, and is used across more companies. R remains valuable in academia, biostatistics, and some finance roles. If you can only learn one, choose Python.

How much Python do I need to know for a data science interview?

You should be comfortable with pandas for data manipulation, numpy for numerical operations, matplotlib or seaborn for visualization, and scikit-learn for basic machine learning. You do not need to be a software engineer, but you should write clean, readable code and understand Python fundamentals like data structures, functions, and object-oriented basics.

Ready to Build Your Resume?

Start building your professional, ATS-friendly resume in minutes — no sign-up required.