Experimentation
Learn to run controlled experiments and measure what actually works • 34 papers
A/B Test Design & Analysis
Design experiments that give you clear, trustworthy answers
Controlled Experiments on the Web: Survey and Practical Guide
THE foundational paper on web experimentation. Covers hypothesis testing, sample size, metrics, and common pitfalls. 2000+ citations.
Overlapping Experiment Infrastructure: More, Better, Faster Experimentation
Google's infrastructure for running overlapping experiments, enabling hundreds of simultaneous tests without interference.
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners
The definitive SRM paper from Microsoft/Booking.com—provides taxonomy of causes and 10 diagnostic rules; ~6% of experiments exhibit SRM.
Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
Foundational Microsoft paper on experiment validity; coined OEC design principles and trustworthiness checks now industry standard.
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments
Essential guide to metric design—introduces metric taxonomy (OEC, guardrail, diagnostic) with real experiment examples.
Variance Reduction
Get faster experiment results with less data
Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data (CUPED)
The CUPED method that uses pre-experiment covariates to dramatically reduce variance in experiment metrics.
Regression Adjustments for Analyzing Randomized Experiments
Shows how regression adjustment in randomized experiments yields valid inference even with heterogeneous treatment effects.
Variance Reduction in Randomized Experiments
Optimal stratification and regression adjustment methods for experimental data with theoretical guarantees.
Machine Learning for Variance Reduction in Online Experiments
Introduces MLRATE—extends CUPED to ML models via cross-fitting; achieved 70%+ variance reduction over difference-in-means at Meta.
Adjusting Treatment Effect Estimates by Post-Stratification in Randomized Experiments
Theoretical foundation for post-stratification; proves it's nearly as efficient as blocking with variance difference O(1/n²).
Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix
Industry workhorse comparing stratification, post-stratification, and CUPED at scale; recommends post-assignment techniques.
Sequential & Adaptive Testing
Know when to stop an experiment early with confidence
Peeking at A/B Tests: Why It Matters, and What to Do About It
Analyzes the peeking problem in A/B tests and introduces always-valid p-values for sequential testing.
Safe Testing
E-values and safe anytime-valid inference that allows optional stopping while controlling type I error.
A/B Testing with Fat Tails
Addresses the problem of highly variable metrics in experiments and proposes robust statistical methods.
A Multiple Testing Procedure for Clinical Trials
Foundational classic (3,200+ citations)—established group sequential boundaries that all modern methods build upon.
Time-uniform, Nonparametric, Nonasymptotic Confidence Sequences
Modern theoretical foundation for anytime-valid inference; backbone for GrowthBook, Spotify, and other industry tools.
Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing
Rigorous theoretical grounding for Bayesian sequential testing with continuous monitoring; Microsoft's framework.
Interference & Spillovers
Handle experiments where users affect each other
Detecting Network Effects: Randomizing Over Randomized Experiments
Graph cluster randomization and design-based methods for detecting interference in network experiments.
Estimating Peer Effects in Networks with Peer Encouragement Designs
Two-stage randomization design for identifying peer effects in social networks.
Experimentation in Two-Sided Marketplaces
Framework for experiments in marketplaces where treatment of one side affects the other.
Estimating Average Causal Effects Under General Interference
The seminal exposure mapping paper—unified framework for design, exposure mapping, and estimands under network interference.
Design and Analysis of Bipartite Experiments Under a Linear Exposure-Response Model
Essential for two-sided marketplaces—introduces ERL estimator for buyer/seller and rider/driver experiment structures.
A Review of Spatial Causal Inference Methods for Environmental and Epidemiological Applications
Comprehensive review of spatial interference methods including geostatistical approaches for location-based spillovers.
Switchback & Geo-Experiments
Run experiments when you can't randomize individual users
Switchback Experiments and Randomized Experiments for Estimating Platform-Level Effects
Design and analysis of switchback experiments for marketplace interventions at Airbnb.
Causal Impact: A New Approach to Estimate Causal Effects
Google's Bayesian structural time-series approach for measuring impact of geo-level interventions.
GeoLift: Open Source Solution for Measuring Incremental Impact
Meta's open-source tool for geo-experiment design and measurement using synthetic control methods.
Design and Analysis of Switchback Experiments
The definitive switchback paper—optimal design under carryover effects with minimax formulation; used by Uber, Lyft, DoorDash.
Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects
Authoritative methodological review from synthetic control's creator; called 'most important innovation in policy evaluation in 15 years'.
Trimmed Match Design for Randomized Paired Geo Experiments
Addresses power analysis for geo experiments with few heterogeneous regions; foundation for Google's Trimmed Match library.
Long-run Effects & Surrogates
Predict long-term impact from short-term metrics
Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects
Framework for using short-term outcomes to predict long-term treatment effects in experiments.
Long-term Causal Inference Under Persistent Confounding via Data Combination
Methods for combining experimental and observational data to estimate persistent treatment effects.
Novelty and Primacy: A Long-Term Estimator for Online Experiments
First scalable method to estimate user-learning effects (novelty/primacy) via difference-in-differences across thousands of Microsoft experiments.
Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix
Largest empirical validation of surrogates—95% consistency between 14-day surrogate predictions and 63-day outcomes across 1,098 test arms.
Estimation of the Proportion of Treatment Effect Explained by a High-Dimensional Surrogate
First rigorous method for high-dimensional surrogates where number of surrogates exceeds sample size; essential for multi-metric settings.