Experimentation infrastructure for recommendation systems: user assignment to control and treatment groups, metric tracking for business impact (CTR, conversion, revenue per user, engagement), statistical significance testing, and reporting dashboards. A/B testing that tells you whether your recommendations are actually driving the outcomes you care about, not just whether they look different.
Holdout groups are assigned at the user level and hashed consistently so users stay in the same bucket across sessions. Online metrics tracked per experiment include click-through rate at position k, add-to-cart rate, conversion rate, revenue per impression, and session depth. Offline evaluation during development uses precision@10 and NDCG (normalised discounted cumulative gain) measured on a time-based held-out split, not a random split, which leaks future signal into training. Bandit algorithms (epsilon-greedy, Thompson sampling) are available for recommendation contexts where a hard A/B split wastes too much opportunity cost on clearly inferior variants. Experiment duration is calculated from expected traffic volume and minimum detectable effect before any test is launched, so you know upfront whether the test will reach statistical significance before your product cycle closes.