A/B testing & Interview Question

Here’s some collection on forum. All the code can be found at my github repository.

is significant level?

What’s P-value?

P-value best explained with example Single sentence explain: The p-value is the probability of observing a value as or more extreme than the one observed under the Null Hypothesis.

What is power?

What is a confidence interval and how do you interpret it?

def get_ci(value, cl, sd):
    loc = stats.norm.ppf(1 - cl/2)
    rng_val = sci.norm.cdf(loc - value/sd)

    lwr_bnd = value - rng_val
    upr_bnd = value + rng_val 

    return_val = (lwr_bnd, upr_bnd)
    return(return_val)

What’s the difference between a MAP, MOM, MLE estima-tor? In which cases would you want to use each?

Special Case

In an A/B test, how can you check if assignment to the various buckets was truly random?

It’s almost impossible to check randomness. We can approach the problem in different way. Design A/A test to check if two groups are coming from same distribution.

Too Many Variants?

I need to aware of the # of variants might have different effect on overall test. First number we need to aware of is “cumulative alpha error” There’s an inequality formula using two idea: n := # of variants

Family-wise error (FWE) $\alpha_{FWE} = 1 -(1-\alpha_{EC})^n$
Bonferroni correction $\alpha_{B} = \alpha_{FWE}/ n$

Actually significant level should between these 2 number. It vary by the dependence of different variants.

What would be the hazards of letting users sneak a peek at the other bucket in an A/B test?

Answer from quora

In an A/B test, how can you check if assignment to the various buckets was truly random?

It’s almost impossible to test randomness on statstic. However we can do another trial. We can try to test equivalence for the A/B two part. That’s some time we call** A/A test**.

Here’s an example to show how A/A test going.

How would you run an A/B test if the observations are extremely right-skewed?

Lower the variability of your metric. Say you’re looking at revenue for a retail site (my experience is from Amazon) or revenue from clicks (my experience is from Bing ads). Both have a large mass at $0 (most people don’t purchase or click on ads), and a long right tail.
Instead of revenue, look at two metrics: an indicator for conversion (yes/no), and the conditional (revenue if purchased; null otherwise).
Each of two will have lower variance, thus give more signal for A/B testing.
Their product is revenue.
Cap values.
Here I would caution you to understand the reasons for the skew.
For example, if you’re looking at time on site, you might find a very right-skewed metrics, but also lumpiness. For example, mass around 1440 minutes may mean users visited your site again the day after about the same time. Assuming that’s not your intention, capping time on site to 30 minutes is a common strategy.
At Amazon, we found significant outliers to revenue and number of items purchased. After looking at the details, it turned out these were libraries or purchasing departments. If one of the treatments gets a couple of more libraries than another, it could skew the results.
Capping at a certain dollar values and items purchased lowered the variance (and improved trustworthiness).
Look at percentile metrics or trimmed means. For example, time-to-X (time-to-onload event for page-load, time-to-click) tends to have a lot right-skew. Instead of the mean, look at some percentile, such as 90th.
U-test

How many sample we need in experiment?

We should believe as many as possible is the answer. In real world problem(or Interview problem), the answer is far more complicate. Apply MCEC principle to answer this, to cover multiple aspect is a good idea.

Design Problem

Here’s real problem I’ve enconterd and failed like idot.

Assume we are the a platform provide the suggestion on Cloth style There are two color we need to decide whtich one is better? That’s say we can only test 1000 time, how you design the experiment??

Desgin Process

Case Design

For a general purpose design, there are many many factors in overall performance. Need to be trim down as simple question so that we can apply the A/B test on it.

Build the funnel
focus one stage into another stage
Find out small change and make hypothesis.
Decide the matrices. Geek Link: this part share similar problem with reinforcement learning.

Metrics

click-through-rate
click-through-probability

Sanity checking metrics evaluation metrics

Distribution

bionomial distribution

Design

How many sample need to be set on sample size and control size?
Ethic issue

Reference

Strongly recommend to read These as source:

10 Statistics Traps in A/B Testing: The Ultimate Guide for Optimizers

What is an A/A Test

The Math behind A/B Testing

DS Question Collection-A/B Test

Failure doesn't defeat you

A/B testing & Interview Question

is significant level?

What’s P-value?

What is power?

What is a confidence interval and how do you interpret it?

What’s the difference between a MAP, MOM, MLE estima-tor? In which cases would you want to use each?

Special Case

In an A/B test, how can you check if assignment to the various buckets was truly random?

Too Many Variants?

What would be the hazards of letting users sneak a peek at the other bucket in an A/B test?

In an A/B test, how can you check if assignment to the various buckets was truly random?

How would you run an A/B test if the observations are extremely right-skewed?

How many sample we need in experiment?

Design Problem

Case Design

Metrics

click-through-probability

Distribution

Design

Ethic issue

Reference

CATALOG

FEATURED TAGS