Machine Learning Interview: Key Concepts
Machine learning interviews can range from conceptual questions about algorithms to hands-on coding challenges. This guide covers the essential ML concepts you need to understand for data science interviews, with clear explanations and practical examples.
Supervised vs Unsupervised Learning
Supervised Learning
In supervised learning, we have labeled data: input features X and known output labels y. The model learns to map inputs to outputs.
Types:
- Classification: Predicting discrete categories (spam/not spam, customer churn)
- Regression: Predicting continuous values (house prices, sales forecasts)
Unsupervised Learning
In unsupervised learning, we only have input features X with no labels. The model finds patterns or structure in the data.
Types:
- Clustering: Grouping similar data points (customer segments, document topics)
- Dimensionality Reduction: Reducing features while preserving information (PCA, t-SNE)
- Anomaly Detection: Finding unusual data points (fraud detection)
Interview Question: Give an example where you would use unsupervised learning
Customer segmentation is a classic example. You have customer data (purchase history, demographics, behavior) but no predefined segments. Clustering algorithms can discover natural groupings that the business can then use for targeted marketing.
Common Algorithms
Linear Regression
Predicts continuous output as a linear combination of input features.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f'R2 Score: {r2_score(y_test, predictions):.3f}')
print(f'RMSE: {mean_squared_error(y_test, predictions, squared=False):.3f}')
Logistic Regression
Despite the name, logistic regression is for classification. It predicts the probability of belonging to a class.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]
print(f'Accuracy: {accuracy_score(y_test, predictions):.3f}')
Decision Trees
Makes decisions by splitting data based on feature thresholds. Easy to interpret but prone to overfitting.
Random Forest
Ensemble of many decision trees. Each tree is trained on a random subset of data and features. Predictions are averaged (regression) or voted (classification).
Interview Question: Why does Random Forest reduce overfitting compared to a single decision tree?
By training many trees on different subsets of data and features, individual tree errors tend to cancel out. The averaging/voting process reduces variance while maintaining low bias. This is the power of ensemble methods.
Gradient Boosting (XGBoost, LightGBM)
Builds trees sequentially, with each new tree correcting errors from previous trees. Often achieves best performance on tabular data.
K-Nearest Neighbors (KNN)
Classifies points based on the majority class of their k nearest neighbors. Simple but slow for large datasets.
Support Vector Machines (SVM)
Finds the hyperplane that best separates classes. Effective in high-dimensional spaces. Kernel trick allows nonlinear boundaries.
The Bias-Variance Tradeoff
Interview Question: Explain the bias-variance tradeoff
Bias: Error from overly simplistic assumptions. High bias models underfit; they fail to capture the underlying pattern.
Variance: Error from sensitivity to small fluctuations in training data. High variance models overfit; they memorize noise.
Total Error = Bias^2 + Variance + Irreducible Error
As model complexity increases, bias decreases but variance increases. The goal is to find the sweet spot that minimizes total error.
Overfitting vs Underfitting
Overfitting: Model performs well on training data but poorly on new data. Signs: training error much lower than validation error.
Underfitting: Model performs poorly on both training and new data. Signs: high error on both training and validation sets.
How to Address Overfitting
- Get more training data
- Reduce model complexity: Fewer features, simpler algorithms
- Regularization: L1 (Lasso), L2 (Ridge), or ElasticNet
- Cross-validation: Use to select hyperparameters
- Early stopping: For iterative algorithms
- Dropout: For neural networks
Model Evaluation
Classification Metrics
Accuracy: (TP + TN) / Total. Can be misleading with imbalanced classes.
Precision: TP / (TP + FP). Of predicted positives, how many are correct?
Recall (Sensitivity): TP / (TP + FN). Of actual positives, how many did we find?
F1 Score: Harmonic mean of precision and recall. 2 * (Precision * Recall) / (Precision + Recall)
AUC-ROC: Area under the ROC curve. Measures ability to distinguish classes across all thresholds.
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(f'AUC-ROC: {roc_auc_score(y_test, probabilities):.3f}')
Interview Question: When would you optimize for precision vs recall?
High precision priority: When false positives are costly. Example: spam filter where legitimate emails should never be blocked.
High recall priority: When false negatives are costly. Example: cancer screening where missing a case is dangerous.
Regression Metrics
MAE (Mean Absolute Error): Average absolute difference. Robust to outliers.
MSE (Mean Squared Error): Average squared difference. Penalizes large errors more.
RMSE: Square root of MSE. Same units as target variable.
R-squared: Proportion of variance explained. 1 is perfect, 0 is baseline.
Cross-Validation
Why Use Cross-Validation?
A single train/test split can be noisy. Cross-validation provides more robust estimates of model performance by using multiple splits.
K-Fold Cross-Validation
Split data into k folds. Train on k-1 folds, validate on remaining fold. Repeat k times and average results.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f'CV Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})')
Interview Question: What is data leakage and how do you prevent it?
Data leakage occurs when information from the test set influences training. This leads to overly optimistic performance estimates that do not hold in production.
Common causes:
- Scaling/normalizing before splitting data
- Using future information to predict past events
- Including target-derived features
Prevention: Always split data first, then fit preprocessing steps only on training data.
Feature Engineering
Why Feature Engineering Matters
Good features often matter more than algorithm choice. A simple model with excellent features can outperform a complex model with poor features.
Common Techniques
Handling missing values:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median') # or 'mean', 'most_frequent'
X_imputed = imputer.fit_transform(X)
Encoding categorical variables:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# One-hot encoding (for nominal categories)
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X[['category_column']])
# Label encoding (for ordinal categories)
le = LabelEncoder()
X['encoded'] = le.fit_transform(X['ordinal_column'])
Feature scaling:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Normalization (range 0-1)
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X)
Interview Question: When is feature scaling necessary?
Feature scaling is important for:
- Distance-based algorithms (KNN, SVM, K-means)
- Gradient descent optimization (neural networks, logistic regression)
- Regularized models (features should be on similar scales for fair penalization)
Tree-based models (Random Forest, XGBoost) generally do not require scaling.
Handling Imbalanced Data
The Problem
When one class is much rarer than another (e.g., 1% fraud rate), models can achieve high accuracy by predicting the majority class for everything while failing to detect the minority class.
Solutions
- Resampling: Oversample minority class (SMOTE) or undersample majority class
- Class weights: Tell the algorithm to penalize minority class errors more
- Different metrics: Use precision, recall, F1, or AUC instead of accuracy
- Threshold tuning: Adjust classification threshold based on business needs
# Using class weights
model = LogisticRegression(class_weight='balanced')
# Using SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
Model Selection and Hyperparameter Tuning
Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='f1'
)
grid_search.fit(X_train, y_train)
print(f'Best parameters: {grid_search.best_params_}')
When to Use Which Algorithm
Start simple: Logistic regression for classification, linear regression for regression. These establish baselines.
Tabular data: Tree ensembles (Random Forest, XGBoost, LightGBM) often perform best.
Small datasets: Simpler models to avoid overfitting. Regularization is crucial.
Interpretability needed: Linear models, decision trees, or use SHAP values for complex models.
Interview Tips
- Explain your reasoning: Why did you choose this algorithm? What assumptions does it make?
- Discuss tradeoffs: Accuracy vs interpretability, speed vs accuracy
- Think about production: How would the model be deployed? How would you monitor it?
- Ask clarifying questions: What is the business problem? What data is available?
When preparing your resume for ML roles, highlight specific models you have built, the metrics you optimized, and the business impact of your work. Mentioning experience with scikit-learn, XGBoost, or deep learning frameworks demonstrates practical ML skills.