A/B Testing Interview Questions
A/B testing is one of the most practical applications of statistics in the tech industry. Companies from startups to FAANG rely on controlled experiments to make product decisions. If you are interviewing for a data scientist, product analyst, or growth role, expect A/B testing questions. This guide covers the concepts, calculations, and common pitfalls you need to know.
A/B Testing Fundamentals
What is an A/B Test?
An A/B test (also called a split test or randomized controlled experiment) compares two versions of something to determine which performs better. Users are randomly assigned to either the control group (A) or treatment group (B), and we measure a key metric to see if the treatment has an effect.
Interview Question: Walk me through how you would design an A/B test
Strong candidates follow a structured approach:
- Define the hypothesis: What specific change are we testing? What do we expect to happen?
- Choose the primary metric: What is the one metric that defines success? (e.g., conversion rate, revenue per user)
- Identify guardrail metrics: What should NOT get worse? (e.g., page load time, customer support tickets)
- Calculate sample size: How many users do we need to detect a meaningful effect?
- Define the randomization unit: Usually users, but could be sessions, devices, or other units
- Determine test duration: Account for weekly patterns, minimum sample size, and maximum time
- Plan the analysis: What statistical test will you use? What is your significance threshold?
Sample Size Calculation
Why Sample Size Matters
Running an A/B test without proper sample size calculation is like guessing. Too small a sample means you might miss real effects (low power). Running longer than necessary wastes time and resources.
The Key Inputs
- Baseline conversion rate: Current performance of the control
- Minimum Detectable Effect (MDE): Smallest improvement worth detecting
- Significance level (alpha): Usually 0.05 (5% false positive rate)
- Statistical power: Usually 0.80 (80% chance of detecting a true effect)
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
# Parameters
baseline_rate = 0.10 # 10% current conversion
mde = 0.02 # Want to detect 2 percentage point lift (to 12%)
alpha = 0.05 # 5% significance level
power = 0.80 # 80% power
# Calculate effect size
effect_size = proportion_effectsize(baseline_rate + mde, baseline_rate)
# Calculate sample size per group
analysis = NormalIndPower()
sample_size = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
ratio=1.0, # Equal sized groups
alternative='two-sided'
)
print(f'Sample size needed per group: {int(sample_size)}')
# Output: approximately 3,842 per group
Interview Question: How does changing MDE affect sample size?
Smaller MDE (wanting to detect smaller effects) requires LARGER sample sizes. This is intuitive: detecting a subtle difference requires more data. If you want to detect a 1% lift instead of 2%, you will need roughly 4x the sample size.
Statistical Significance
Interview Question: A test shows p-value of 0.03. What does this mean?
A p-value of 0.03 means: IF there were truly no difference between control and treatment (null hypothesis is true), there would be only a 3% chance of seeing results as extreme as what we observed.
It does NOT mean there is a 97% chance the treatment is better. It does NOT mean the effect size is large or practically significant.
Confidence Intervals vs P-values
Confidence intervals are often more useful than p-values because they show the range of plausible effect sizes, not just whether an effect exists.
from statsmodels.stats.proportion import confint_proportions_2indep
# Example: Treatment has 120/1000 conversions, Control has 100/1000
n_treatment, conv_treatment = 1000, 120
n_control, conv_control = 1000, 100
# Calculate 95% CI for the difference
ci = confint_proportions_2indep(
conv_treatment, n_treatment,
conv_control, n_control,
method='wald'
)
print(f'95% CI for difference: ({ci[0]:.3f}, {ci[1]:.3f})')
Common Pitfalls
1. Peeking and Early Stopping
Problem: Checking results daily and stopping when you see significance inflates your false positive rate dramatically.
Solution: Decide sample size upfront and commit to it. If you need to peek, use sequential testing methods that account for multiple looks.
Interview Question: We ran a test for 3 days and saw p=0.04. Should we launch?
Not necessarily. Ask: Did we reach our planned sample size? 3 days might not capture weekly variation (weekday vs weekend behavior can differ). A barely significant result (p=0.04) could easily flip with more data. Check if the effect size is practically meaningful, not just statistically significant.
2. Multiple Comparisons
Problem: Testing many metrics or segments increases the chance of false positives. With 20 metrics at alpha=0.05, you expect 1 false positive on average.
Solution: Define one primary metric upfront. Use corrections (Bonferroni, Benjamini-Hochberg) for secondary metrics. Pre-register your analysis plan.
3. Simpson Paradox
Problem: A treatment can appear better overall but worse in every subgroup (or vice versa) due to unequal group sizes.
# Example: Treatment appears better overall
# Control: 100/1000 (10%), Treatment: 150/1200 (12.5%)
# But when you split by platform:
# Mobile - Control: 80/400 (20%), Treatment: 100/600 (16.7%) - Control wins
# Desktop - Control: 20/600 (3.3%), Treatment: 50/600 (8.3%) - Treatment wins
# Treatment just had more mobile users, who convert at higher rates
Solution: Check for balance across important segments. Stratify your analysis if needed.
4. Network Effects and Interference
Problem: In social products, users in treatment can affect users in control (e.g., messaging features, viral content).
Solution: Use cluster randomization (randomize groups of connected users together) or geographic randomization.
5. Novelty and Primacy Effects
Problem: Users may react differently to something new. Initial lift might fade as novelty wears off, or initial resistance might decrease as users adapt.
Solution: Run tests long enough to see stabilization. Consider holdout groups for long-term measurement.
Interpreting Results
Interview Question: Our test shows 5% lift with p=0.02. Should we ship it?
Before recommending launch, consider:
- Is it practically significant? A 5% lift in a metric that does not matter is worthless
- What do guardrail metrics show? Did anything get worse?
- Is the confidence interval narrow enough? CI of [1%, 9%] is very different from [4.5%, 5.5%]
- Are there segment differences? Does it help some users but hurt others?
- What is the cost of being wrong? Is this reversible?
- Did we reach planned sample size? Were there any data quality issues?
When Results Are Inconclusive
A non-significant result does not prove there is no effect. It means we failed to detect one. Consider:
- Was sample size sufficient? (Check power analysis)
- Was MDE realistic? (Maybe the true effect is smaller than we hoped)
- Was implementation correct? (Check that users actually saw the change)
Advanced Topics
Bayesian A/B Testing
Bayesian methods offer an alternative to frequentist testing. Instead of p-values, they give probability distributions over the true effect size.
Advantages: More intuitive interpretation, can incorporate prior knowledge, naturally handles peeking
Disadvantages: Requires choosing priors, computationally heavier, less familiar to many stakeholders
Multi-armed Bandits
Bandits dynamically allocate traffic to better-performing variants during the test. Good for optimizing quickly when exploration cost is high.
Trade-off: Faster optimization but harder to measure precise effect sizes. Best for non-critical decisions where speed matters more than precision.
Interleaving
Used for ranking systems (search, recommendations). Shows items from both variants in a single list and measures which gets more engagement.
Advantage: More sensitive than A/B testing for ranking changes
Limitation: Only works for certain types of changes
Real Interview Questions
Question 1: We want to test a new checkout flow. What metrics would you track?
Primary metric: Checkout completion rate (conversions / users who reached checkout)
Secondary metrics: Revenue per user, average order value, time to complete checkout
Guardrail metrics: Page load time, error rate, customer support contacts, refund rate
Question 2: Our test has been running for 2 weeks but is still not significant. What would you do?
Check if you have reached planned sample size. If not, wait. If yes, the true effect is likely smaller than your MDE. Options: accept that the change is not impactful enough to warrant the work, or decide based on directional results plus qualitative factors if the cost is low.
Question 3: We see a 10% lift in conversion but a 5% drop in revenue per user. What do you recommend?
Calculate net impact. If total revenue increased (lift in conversion outweighs drop in order value), it might still be positive. But investigate why revenue per user dropped. Are we attracting lower-value customers? Are people buying cheaper items? Could this indicate a long-term problem? Consider testing on a segment or running longer to understand the dynamics.
Interview Tips
- Think out loud: Show your reasoning process, not just answers
- Ask clarifying questions: What is the business context? What is the cost of wrong decisions?
- Consider trade-offs: There is rarely one right answer in experimentation
- Be practical: Perfect experiments are not always possible; discuss what you would do with constraints
Demonstrating A/B testing expertise on your resume is valuable for product and analytics roles. Highlight specific experiments you designed, sample sizes you calculated, and decisions that resulted from your analysis.