What roles ask A/B testing questions in interviews?

A/B testing questions are common in data scientist, product analyst, growth analyst, marketing analyst, and product manager interviews. Any role that involves making data-driven product decisions will likely include questions about experiment design and analysis.

Do I need to know the math behind A/B testing?

You should understand the conceptual foundation: what statistical significance means, how sample size affects results, and what Type I and Type II errors are. You do not need to derive formulas by hand, but you should know how to use calculators or code to perform the calculations.

What is the most important thing interviewers look for in A/B testing questions?

Interviewers want to see that you think critically about experiment design before jumping to analysis. This includes defining clear hypotheses, choosing appropriate metrics, calculating sample size, considering edge cases, and understanding when results are actually meaningful for the business.

A/B Testing Interview Questions

Published February 11, 2026

A/B testing is one of the most practical applications of statistics in the tech industry. Companies from startups to FAANG rely on controlled experiments to make product decisions. If you are interviewing for a data scientist, product analyst, or growth role, expect A/B testing questions. This guide covers the concepts, calculations, and common pitfalls you need to know.

A/B Testing Fundamentals

What is an A/B Test?

An A/B test (also called a split test or randomized controlled experiment) compares two versions of something to determine which performs better. Users are randomly assigned to either the control group (A) or treatment group (B), and we measure a key metric to see if the treatment has an effect.

Interview Question: Walk me through how you would design an A/B test

Strong candidates follow a structured approach:

Define the hypothesis: What specific change are we testing? What do we expect to happen?
Choose the primary metric: What is the one metric that defines success? (e.g., conversion rate, revenue per user)
Identify guardrail metrics: What should NOT get worse? (e.g., page load time, customer support tickets)
Calculate sample size: How many users do we need to detect a meaningful effect?
Define the randomization unit: Usually users, but could be sessions, devices, or other units
Determine test duration: Account for weekly patterns, minimum sample size, and maximum time
Plan the analysis: What statistical test will you use? What is your significance threshold?

Sample Size Calculation

Why Sample Size Matters

Running an A/B test without proper sample size calculation is like guessing. Too small a sample means you might miss real effects (low power). Running longer than necessary wastes time and resources.

The Key Inputs

Baseline conversion rate: Current performance of the control
Minimum Detectable Effect (MDE): Smallest improvement worth detecting
Significance level (alpha): Usually 0.05 (5% false positive rate)
Statistical power: Usually 0.80 (80% chance of detecting a true effect)

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Parameters
baseline_rate = 0.10        # 10% current conversion
mde = 0.02                  # Want to detect 2 percentage point lift (to 12%)
alpha = 0.05                # 5% significance level
power = 0.80                # 80% power

# Calculate effect size
effect_size = proportion_effectsize(baseline_rate + mde, baseline_rate)

# Calculate sample size per group
analysis = NormalIndPower()
sample_size = analysis.solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    ratio=1.0,                # Equal sized groups
    alternative='two-sided'
)

print(f'Sample size needed per group: {int(sample_size)}')
# Output: approximately 3,842 per group

Interview Question: How does changing MDE affect sample size?

Smaller MDE (wanting to detect smaller effects) requires LARGER sample sizes. This is intuitive: detecting a subtle difference requires more data. If you want to detect a 1% lift instead of 2%, you will need roughly 4x the sample size.

Statistical Significance

Interview Question: A test shows p-value of 0.03. What does this mean?

A p-value of 0.03 means: IF there were truly no difference between control and treatment (null hypothesis is true), there would be only a 3% chance of seeing results as extreme as what we observed.

It does NOT mean there is a 97% chance the treatment is better. It does NOT mean the effect size is large or practically significant.

Confidence Intervals vs P-values

Confidence intervals are often more useful than p-values because they show the range of plausible effect sizes, not just whether an effect exists.

from statsmodels.stats.proportion import confint_proportions_2indep

# Example: Treatment has 120/1000 conversions, Control has 100/1000
n_treatment, conv_treatment = 1000, 120
n_control, conv_control = 1000, 100

# Calculate 95% CI for the difference
ci = confint_proportions_2indep(
    conv_treatment, n_treatment,
    conv_control, n_control,
    method='wald'
)
print(f'95% CI for difference: ({ci[0]:.3f}, {ci[1]:.3f})')

Common Pitfalls

1. Peeking and Early Stopping

Problem: Checking results daily and stopping when you see significance inflates your false positive rate dramatically.

Solution: Decide sample size upfront and commit to it. If you need to peek, use sequential testing methods that account for multiple looks.

Interview Question: We ran a test for 3 days and saw p=0.04. Should we launch?

Not necessarily. Ask: Did we reach our planned sample size? 3 days might not capture weekly variation (weekday vs weekend behavior can differ). A barely significant result (p=0.04) could easily flip with more data. Check if the effect size is practically meaningful, not just statistically significant.

2. Multiple Comparisons

Problem: Testing many metrics or segments increases the chance of false positives. With 20 metrics at alpha=0.05, you expect 1 false positive on average.

Solution: Define one primary metric upfront. Use corrections (Bonferroni, Benjamini-Hochberg) for secondary metrics. Pre-register your analysis plan.

3. Simpson Paradox

Problem: A treatment can appear better overall but worse in every subgroup (or vice versa) due to unequal group sizes.

# Example: Treatment appears better overall
# Control: 100/1000 (10%), Treatment: 150/1200 (12.5%)
# But when you split by platform:
# Mobile - Control: 80/400 (20%), Treatment: 100/600 (16.7%) - Control wins
# Desktop - Control: 20/600 (3.3%), Treatment: 50/600 (8.3%) - Treatment wins

# Treatment just had more mobile users, who convert at higher rates

Solution: Check for balance across important segments. Stratify your analysis if needed.

4. Network Effects and Interference

Problem: In social products, users in treatment can affect users in control (e.g., messaging features, viral content).

Solution: Use cluster randomization (randomize groups of connected users together) or geographic randomization.

5. Novelty and Primacy Effects

Problem: Users may react differently to something new. Initial lift might fade as novelty wears off, or initial resistance might decrease as users adapt.

Solution: Run tests long enough to see stabilization. Consider holdout groups for long-term measurement.

Interpreting Results

Interview Question: Our test shows 5% lift with p=0.02. Should we ship it?

Before recommending launch, consider:

Is it practically significant? A 5% lift in a metric that does not matter is worthless
What do guardrail metrics show? Did anything get worse?
Is the confidence interval narrow enough? CI of [1%, 9%] is very different from [4.5%, 5.5%]
Are there segment differences? Does it help some users but hurt others?
What is the cost of being wrong? Is this reversible?
Did we reach planned sample size? Were there any data quality issues?

When Results Are Inconclusive

A non-significant result does not prove there is no effect. It means we failed to detect one. Consider:

Was sample size sufficient? (Check power analysis)
Was MDE realistic? (Maybe the true effect is smaller than we hoped)
Was implementation correct? (Check that users actually saw the change)

Advanced Topics

Bayesian A/B Testing

Bayesian methods offer an alternative to frequentist testing. Instead of p-values, they give probability distributions over the true effect size.

Advantages: More intuitive interpretation, can incorporate prior knowledge, naturally handles peeking

Disadvantages: Requires choosing priors, computationally heavier, less familiar to many stakeholders

Multi-armed Bandits

Bandits dynamically allocate traffic to better-performing variants during the test. Good for optimizing quickly when exploration cost is high.

Trade-off: Faster optimization but harder to measure precise effect sizes. Best for non-critical decisions where speed matters more than precision.

Interleaving

Used for ranking systems (search, recommendations). Shows items from both variants in a single list and measures which gets more engagement.

Advantage: More sensitive than A/B testing for ranking changes

Limitation: Only works for certain types of changes

Real Interview Questions

Question 1: We want to test a new checkout flow. What metrics would you track?

Primary metric: Checkout completion rate (conversions / users who reached checkout)

Secondary metrics: Revenue per user, average order value, time to complete checkout

Guardrail metrics: Page load time, error rate, customer support contacts, refund rate

Question 2: Our test has been running for 2 weeks but is still not significant. What would you do?

Check if you have reached planned sample size. If not, wait. If yes, the true effect is likely smaller than your MDE. Options: accept that the change is not impactful enough to warrant the work, or decide based on directional results plus qualitative factors if the cost is low.

Question 3: We see a 10% lift in conversion but a 5% drop in revenue per user. What do you recommend?

Calculate net impact. If total revenue increased (lift in conversion outweighs drop in order value), it might still be positive. But investigate why revenue per user dropped. Are we attracting lower-value customers? Are people buying cheaper items? Could this indicate a long-term problem? Consider testing on a segment or running longer to understand the dynamics.

Interview Tips

Think out loud: Show your reasoning process, not just answers
Ask clarifying questions: What is the business context? What is the cost of wrong decisions?
Consider trade-offs: There is rarely one right answer in experimentation
Be practical: Perfect experiments are not always possible; discuss what you would do with constraints

Demonstrating A/B testing expertise on your resume is valuable for product and analytics roles. Highlight specific experiments you designed, sample sizes you calculated, and decisions that resulted from your analysis.