How much machine learning do I need for a data scientist interview?

For most data scientist roles, you need strong understanding of supervised learning (regression, classification), model evaluation, feature engineering, and common algorithms like linear regression, logistic regression, decision trees, and random forests. Deep learning is less commonly required unless the role specifically focuses on it.

Will I need to code ML algorithms from scratch in interviews?

Rarely. Most interviews test your understanding of concepts and your ability to use libraries like scikit-learn effectively. You might be asked to explain how an algorithm works conceptually, but implementing gradient descent from scratch is uncommon except at ML-focused companies or research roles.

What is the most common ML interview question?

Explaining the bias-variance tradeoff and how to handle overfitting is the most common topic. Interviewers want to see that you understand why models fail to generalize and what techniques (regularization, cross-validation, more data) can help.

Machine Learning Interview: Key Concepts

Published February 11, 2026

Machine learning interviews can range from conceptual questions about algorithms to hands-on coding challenges. This guide covers the essential ML concepts you need to understand for data science interviews, with clear explanations and practical examples.

Supervised vs Unsupervised Learning

Supervised Learning

In supervised learning, we have labeled data: input features X and known output labels y. The model learns to map inputs to outputs.

Types:

Classification: Predicting discrete categories (spam/not spam, customer churn)
Regression: Predicting continuous values (house prices, sales forecasts)

Unsupervised Learning

In unsupervised learning, we only have input features X with no labels. The model finds patterns or structure in the data.

Types:

Clustering: Grouping similar data points (customer segments, document topics)
Dimensionality Reduction: Reducing features while preserving information (PCA, t-SNE)
Anomaly Detection: Finding unusual data points (fraud detection)

Interview Question: Give an example where you would use unsupervised learning

Customer segmentation is a classic example. You have customer data (purchase history, demographics, behavior) but no predefined segments. Clustering algorithms can discover natural groupings that the business can then use for targeted marketing.

Common Algorithms

Linear Regression

Predicts continuous output as a linear combination of input features.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print(f'R2 Score: {r2_score(y_test, predictions):.3f}')
print(f'RMSE: {mean_squared_error(y_test, predictions, squared=False):.3f}')

Logistic Regression

Despite the name, logistic regression is for classification. It predicts the probability of belonging to a class.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]

print(f'Accuracy: {accuracy_score(y_test, predictions):.3f}')

Decision Trees

Makes decisions by splitting data based on feature thresholds. Easy to interpret but prone to overfitting.

Random Forest

Ensemble of many decision trees. Each tree is trained on a random subset of data and features. Predictions are averaged (regression) or voted (classification).

Interview Question: Why does Random Forest reduce overfitting compared to a single decision tree?

By training many trees on different subsets of data and features, individual tree errors tend to cancel out. The averaging/voting process reduces variance while maintaining low bias. This is the power of ensemble methods.

Gradient Boosting (XGBoost, LightGBM)

Builds trees sequentially, with each new tree correcting errors from previous trees. Often achieves best performance on tabular data.

K-Nearest Neighbors (KNN)

Classifies points based on the majority class of their k nearest neighbors. Simple but slow for large datasets.

Support Vector Machines (SVM)

Finds the hyperplane that best separates classes. Effective in high-dimensional spaces. Kernel trick allows nonlinear boundaries.

The Bias-Variance Tradeoff

Interview Question: Explain the bias-variance tradeoff

Bias: Error from overly simplistic assumptions. High bias models underfit; they fail to capture the underlying pattern.

Variance: Error from sensitivity to small fluctuations in training data. High variance models overfit; they memorize noise.

Total Error = Bias^2 + Variance + Irreducible Error

As model complexity increases, bias decreases but variance increases. The goal is to find the sweet spot that minimizes total error.

Overfitting vs Underfitting

Overfitting: Model performs well on training data but poorly on new data. Signs: training error much lower than validation error.

Underfitting: Model performs poorly on both training and new data. Signs: high error on both training and validation sets.

How to Address Overfitting

Get more training data
Reduce model complexity: Fewer features, simpler algorithms
Regularization: L1 (Lasso), L2 (Ridge), or ElasticNet
Cross-validation: Use to select hyperparameters
Early stopping: For iterative algorithms
Dropout: For neural networks

Model Evaluation

Classification Metrics

Accuracy: (TP + TN) / Total. Can be misleading with imbalanced classes.

Precision: TP / (TP + FP). Of predicted positives, how many are correct?

Recall (Sensitivity): TP / (TP + FN). Of actual positives, how many did we find?

F1 Score: Harmonic mean of precision and recall. 2 * (Precision * Recall) / (Precision + Recall)

AUC-ROC: Area under the ROC curve. Measures ability to distinguish classes across all thresholds.

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(f'AUC-ROC: {roc_auc_score(y_test, probabilities):.3f}')

Interview Question: When would you optimize for precision vs recall?

High precision priority: When false positives are costly. Example: spam filter where legitimate emails should never be blocked.

High recall priority: When false negatives are costly. Example: cancer screening where missing a case is dangerous.

Regression Metrics

MAE (Mean Absolute Error): Average absolute difference. Robust to outliers.

MSE (Mean Squared Error): Average squared difference. Penalizes large errors more.

RMSE: Square root of MSE. Same units as target variable.

R-squared: Proportion of variance explained. 1 is perfect, 0 is baseline.

Cross-Validation

Why Use Cross-Validation?

A single train/test split can be noisy. Cross-validation provides more robust estimates of model performance by using multiple splits.

K-Fold Cross-Validation

Split data into k folds. Train on k-1 folds, validate on remaining fold. Repeat k times and average results.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f'CV Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})')

Interview Question: What is data leakage and how do you prevent it?

Data leakage occurs when information from the test set influences training. This leads to overly optimistic performance estimates that do not hold in production.

Common causes:

Scaling/normalizing before splitting data
Using future information to predict past events
Including target-derived features

Prevention: Always split data first, then fit preprocessing steps only on training data.

Feature Engineering

Why Feature Engineering Matters

Good features often matter more than algorithm choice. A simple model with excellent features can outperform a complex model with poor features.

Common Techniques

Handling missing values:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')  # or 'mean', 'most_frequent'
X_imputed = imputer.fit_transform(X)

Encoding categorical variables:

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-hot encoding (for nominal categories)
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X[['category_column']])

# Label encoding (for ordinal categories)
le = LabelEncoder()
X['encoded'] = le.fit_transform(X['ordinal_column'])

Feature scaling:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Normalization (range 0-1)
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X)

Interview Question: When is feature scaling necessary?

Feature scaling is important for:

Distance-based algorithms (KNN, SVM, K-means)
Gradient descent optimization (neural networks, logistic regression)
Regularized models (features should be on similar scales for fair penalization)

Tree-based models (Random Forest, XGBoost) generally do not require scaling.

Handling Imbalanced Data

The Problem

When one class is much rarer than another (e.g., 1% fraud rate), models can achieve high accuracy by predicting the majority class for everything while failing to detect the minority class.

Solutions

Resampling: Oversample minority class (SMOTE) or undersample majority class
Class weights: Tell the algorithm to penalize minority class errors more
Different metrics: Use precision, recall, F1, or AUC instead of accuracy
Threshold tuning: Adjust classification threshold based on business needs

# Using class weights
model = LogisticRegression(class_weight='balanced')

# Using SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Model Selection and Hyperparameter Tuning

Grid Search

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='f1'
)
grid_search.fit(X_train, y_train)
print(f'Best parameters: {grid_search.best_params_}')

When to Use Which Algorithm

Start simple: Logistic regression for classification, linear regression for regression. These establish baselines.

Tabular data: Tree ensembles (Random Forest, XGBoost, LightGBM) often perform best.

Small datasets: Simpler models to avoid overfitting. Regularization is crucial.

Interpretability needed: Linear models, decision trees, or use SHAP values for complex models.

Interview Tips

Explain your reasoning: Why did you choose this algorithm? What assumptions does it make?
Discuss tradeoffs: Accuracy vs interpretability, speed vs accuracy
Think about production: How would the model be deployed? How would you monitor it?
Ask clarifying questions: What is the business problem? What data is available?

When preparing your resume for ML roles, highlight specific models you have built, the metrics you optimized, and the business impact of your work. Mentioning experience with scikit-learn, XGBoost, or deep learning frameworks demonstrates practical ML skills.