Experimentation

Learn to run controlled experiments and measure what actually works • 34 papers

6 subtopics

A/B Test Design & Analysis

Design experiments that give you clear, trustworthy answers

2009 668 cited

Controlled Experiments on the Web: Survey and Practical Guide

Ron Kohavi, Roger Longbotham, Dan Sommerfield, Randal M. Henne

THE foundational paper on web experimentation. Covers hypothesis testing, sample size, metrics, and common pitfalls. 2000+ citations.

2010 316 cited

Overlapping Experiment Infrastructure: More, Better, Faster Experimentation

Diane Tang, Ashish Agarwal, Deirdre O'Brien, Mike Meyer

Google's infrastructure for running overlapping experiments, enabling hundreds of simultaneous tests without interference.

2019 25 cited

Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners

Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, Pavel Dmitriev

The definitive SRM paper from Microsoft/Booking.com—provides taxonomy of causes and 10 diagnostic rules; ~6% of experiments exhibit SRM.

2012 223 cited

Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained

Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu

Foundational Microsoft paper on experiment validity; coined OEC design principles and trustworthiness checks now industry standard.

2017 94 cited

A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments

Pavel Dmitriev, Somit Gupta, Dong Woo Kim, Garnet Vaz

Essential guide to metric design—introduces metric taxonomy (OEC, guardrail, diagnostic) with real experiment examples.

Variance Reduction

Get faster experiment results with less data

2013 211 cited

Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data (CUPED)

Alex Deng, Ya Xu, Ron Kohavi, Toby Walker

The CUPED method that uses pre-experiment covariates to dramatically reduce variance in experiment metrics.

2013

Regression Adjustments for Analyzing Randomized Experiments

Winston Lin

Shows how regression adjustment in randomized experiments yields valid inference even with heterogeneous treatment effects.

2016 2 cited

Variance Reduction in Randomized Experiments

Susan Athey, Guido Imbens

Optimal stratification and regression adjustment methods for experimental data with theoretical guarantees.

2021 9 cited

Machine Learning for Variance Reduction in Online Experiments

Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, Matt Goldman

Introduces MLRATE—extends CUPED to ML models via cross-fitting; achieved 70%+ variance reduction over difference-in-means at Meta.

2013 162 cited

Adjusting Treatment Effect Estimates by Post-Stratification in Randomized Experiments

Luke Miratrix, Jasjeet Sekhon, Bin Yu

Theoretical foundation for post-stratification; proves it's nearly as efficient as blocking with variance difference O(1/n²).

2016 91 cited

Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix

Huizhi Xie, Juliette Aurisset

Industry workhorse comparing stratification, post-stratification, and CUPED at scale; recommends post-assignment techniques.

Sequential & Adaptive Testing

Know when to stop an experiment early with confidence

2017

Peeking at A/B Tests: Why It Matters, and What to Do About It

Ramesh Johari, Pete Koomen, Leonid Pekelis, David Walsh

Analyzes the peeking problem in A/B tests and introduces always-valid p-values for sequential testing.

2019 46 cited

Safe Testing

Peter Grünwald, Rianne de Heide, Wouter Koolen

E-values and safe anytime-valid inference that allows optional stopping while controlling type I error.

2019 50 cited

A/B Testing with Fat Tails

Eduardo Azevedo, Alex Deng, José Montiel Olea, Justin Rao, E. Glen Weyl

Addresses the problem of highly variable metrics in experiments and proposes robust statistical methods.

1979 3274 cited

A Multiple Testing Procedure for Clinical Trials

Peter O'Brien, Thomas Fleming

Foundational classic (3,200+ citations)—established group sequential boundaries that all modern methods build upon.

2021 82 cited

Time-uniform, Nonparametric, Nonasymptotic Confidence Sequences

Steven Howard, Aaditya Ramdas, Jon McAuliffe, Jasjeet Sekhon

Modern theoretical foundation for anytime-valid inference; backbone for GrowthBook, Spotify, and other industry tools.

2016 60 cited

Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing

Alex Deng, Jiannan Lu, Shouyuan Chen

Rigorous theoretical grounding for Bayesian sequential testing with continuous monitoring; Microsoft's framework.

Interference & Spillovers

Handle experiments where users affect each other

2016

Detecting Network Effects: Randomizing Over Randomized Experiments

Dean Eckles, Brian Karrer, Johan Ugander

Graph cluster randomization and design-based methods for detecting interference in network experiments.

2016 119 cited

Estimating Peer Effects in Networks with Peer Encouragement Designs

Dean Eckles, René Kizilcec, Eytan Bakshy

Two-stage randomization design for identifying peer effects in social networks.

2022 1209 cited

Experimentation in Two-Sided Marketplaces

Ramesh Johari, Hannah Li, Inessa Liskovich, Gabriel Weintraub

Framework for experiments in marketplaces where treatment of one side affects the other.

2017 41 cited

Estimating Average Causal Effects Under General Interference

Peter Aronow, Cyrus Samii

The seminal exposure mapping paper—unified framework for design, exposure mapping, and estimands under network interference.

2023 12 cited

Design and Analysis of Bipartite Experiments Under a Linear Exposure-Response Model

Christopher Harshaw, Fredrik Sävje, David Eisenstat, Vahab Mirrokni, Jean Pouget-Abadie

Essential for two-sided marketplaces—introduces ERL estimator for buyer/seller and rider/driver experiment structures.

2021 11 cited

A Review of Spatial Causal Inference Methods for Environmental and Epidemiological Applications

Brian Reich, Shu Yang, Yawen Guan, Andrew Giffin, Matthew Miller, Ana Rappold

Comprehensive review of spatial interference methods including geostatistical approaches for location-based spillovers.

Switchback & Geo-Experiments

Run experiments when you can't randomize individual users

2020

Switchback Experiments and Randomized Experiments for Estimating Platform-Level Effects

David Holtz, Ruben Lobel, Inessa Liskovich, Sinan Aral

Design and analysis of switchback experiments for marketplace interventions at Airbnb.

2015

Causal Impact: A New Approach to Estimate Causal Effects

Kay Brodersen, Fabian Gallusser, Jim Koehler, et al.

Google's Bayesian structural time-series approach for measuring impact of geo-level interventions.

2022

GeoLift: Open Source Solution for Measuring Incremental Impact

Arturo Esquerra, Nicolas Besasie

Meta's open-source tool for geo-experiment design and measurement using synthetic control methods.

2023 7 cited

Design and Analysis of Switchback Experiments

Iavor Bojinov, David Simchi-Levi, Jinglong Zhao

The definitive switchback paper—optimal design under carryover effects with minimax formulation; used by Uber, Lyft, DoorDash.

2021 1186 cited

Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects

Alberto Abadie

Authoritative methodological review from synthetic control's creator; called 'most important innovation in policy evaluation in 15 years'.

2021 1 cited

Trimmed Match Design for Randomized Paired Geo Experiments

Yueqin Chen, Damien Longfils, Thomas Remy

Addresses power analysis for geo experiments with few heterogeneous regions; foundation for Google's Trimmed Match library.

Long-run Effects & Surrogates

Predict long-term impact from short-term metrics

2019 109 cited

Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects

Susan Athey, Raj Chetty, Guido Imbens, Hyunseung Kang

Framework for using short-term outcomes to predict long-term treatment effects in experiments.

2022 10 cited

Long-term Causal Inference Under Persistent Confounding via Data Combination

Guido Imbens, Nathan Kallus, Xiaojie Mao, Yuhao Wang

Methods for combining experimental and observational data to estimate persistent treatment effects.

2022 16 cited

Novelty and Primacy: A Long-Term Estimator for Online Experiments

Somit Sadeghi, Somit Gupta, Anca Gramatovici, Jiannan Lu, Meng Ai, Mia Zhang

First scalable method to estimate user-learning effects (novelty/primacy) via difference-in-differences across thousands of Microsoft experiments.

2023

Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix

Di Zhang, Sophia Zhao, Maria Dimakopoulou, Diarle Le, Nathan Kallus

Largest empirical validation of surrogates—95% consistency between 14-day surrogate predictions and 63-day outcomes across 1,098 test arms.

2022 1 cited

Estimation of the Proportion of Treatment Effect Explained by a High-Dimensional Surrogate

Xuan Zhou, Xin Zhao, Layla Parast

First rigorous method for high-dimensional surrogates where number of surrogates exceeds sample size; essential for multi-metric settings.

Must-read papers for tech economists and applied researchers