Statistics

Foundational statistical methods for inference and uncertainty quantification • 53 papers

Bootstrap & Resampling Methods

Estimate uncertainty for any statistic without closed-form solutions

1979 9500 cited

Bootstrap Methods: Another Look at the Jackknife

Bradley Efron

THE paper that created the bootstrap field. Shows that by drawing samples with replacement from observed data, you can estimate the sampling distribution of virtually any statistic—no closed-form solutions required. Won the 2018 International Prize in Statistics.

1986 6100 cited

Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy

Bradley Efron, Robert Tibshirani

The paper that made bootstrap accessible to practitioners. Shows how to apply bootstrap to real problems: bias estimation, prediction error, confidence intervals, time series, regression. Covers bootstrap CIs (percentile, BCa), when bootstrap fails, and practical diagnostics.

2014 420 cited

A Scalable Bootstrap for Massive Data

Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, Michael I. Jordan

The Bag of Little Bootstraps (BLB) solves the fundamental problem that standard bootstrap requires O(B×n) operations—impossible for terabyte-scale data. BLB takes small subsamples of size n^0.6, runs weighted bootstrap within each, and achieves same statistical efficiency while being trivially parallelizable across Spark/MapReduce.

2012

Estimating Uncertainty for Massive Data Streams

Nicholas Chamandy, Omkar Muralidharan, Amir Najmi, Siddartha Naidu

The Poisson bootstrap replaces multinomial resampling with independent Poisson(1) weights for each observation. Enables single-pass streaming computation where you don't need to know n in advance, and each data shard can be processed independently. Built into Google's production analysis primitives.

2008 1000 cited

Bootstrap-Based Improvements for Inference with Clustered Errors

A. Colin Cameron, Jonah B. Gelbach, Douglas L. Miller

The wild cluster bootstrap handles clustered/panel data (users within cities, sessions within users, days within experiments) and provides valid inference even with few clusters (5-30) where standard cluster-robust SEs severely over-reject. Essential for diff-in-diff, geographic experiments, and switchback designs.

2022

Resampling-Free Bootstrap Inference for Quantiles

Mårten Schultzberg, Johan Ankargren

A Spotify paper that achieves 828x speedup for quantile bootstrap by deriving the analytical distribution of bootstrap quantile indices. Enables bootstrap CIs for medians/percentiles on hundreds of millions of observations in milliseconds. Already deployed in production at Spotify.

Title	Authors	Year	Citations
Bootstrap Methods: Another Look at the Jackknife THE paper that created the bootstrap field. Shows that by drawing samples with replacement from observed data, you can estimate the sampling distribution of virtually any statistic—no closed-form solutions required. Won the 2018 International Prize in Statistics.	Bradley Efron	1979	9500
Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy The paper that made bootstrap accessible to practitioners. Shows how to apply bootstrap to real problems: bias estimation, prediction error, confidence intervals, time series, regression. Covers bootstrap CIs (percentile, BCa), when bootstrap fails, and practical diagnostics.	Bradley Efron, Robert Tibshirani	1986	6100
A Scalable Bootstrap for Massive Data The Bag of Little Bootstraps (BLB) solves the fundamental problem that standard bootstrap requires O(B×n) operations—impossible for terabyte-scale data. BLB takes small subsamples of size n^0.6, runs weighted bootstrap within each, and achieves same statistical efficiency while being trivially parallelizable across Spark/MapReduce.	Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, Michael I. Jordan	2014	420
Estimating Uncertainty for Massive Data Streams The Poisson bootstrap replaces multinomial resampling with independent Poisson(1) weights for each observation. Enables single-pass streaming computation where you don't need to know n in advance, and each data shard can be processed independently. Built into Google's production analysis primitives.	Nicholas Chamandy, Omkar Muralidharan, Amir Najmi, Siddartha Naidu	2012	—
Bootstrap-Based Improvements for Inference with Clustered Errors The wild cluster bootstrap handles clustered/panel data (users within cities, sessions within users, days within experiments) and provides valid inference even with few clusters (5-30) where standard cluster-robust SEs severely over-reject. Essential for diff-in-diff, geographic experiments, and switchback designs.	A. Colin Cameron, Jonah B. Gelbach, Douglas L. Miller	2008	1000
Resampling-Free Bootstrap Inference for Quantiles A Spotify paper that achieves 828x speedup for quantile bootstrap by deriving the analytical distribution of bootstrap quantile indices. Enables bootstrap CIs for medians/percentiles on hundreds of millions of observations in milliseconds. Already deployed in production at Spotify.	Mårten Schultzberg, Johan Ankargren	2022	—

Survival Analysis & Time-to-Event Models

Model censored time-to-event outcomes for churn, LTV, and engagement

1958 65000 cited

Nonparametric Estimation from Incomplete Observations

Edward L. Kaplan, Paul Meier

Introduced the Kaplan-Meier estimator (product-limit estimator), the universal method for estimating survival curves from censored data. Every survival analysis begins here—essential for visualizing retention curves, comparing cohorts, and calculating median time-to-churn.

1972 65000 cited

Regression Models and Life-Tables

David R. Cox

Introduced the Cox proportional hazards model, enabling regression analysis on survival data while leaving the baseline hazard unspecified. Ranked 24th among all scientific papers ever published. The workhorse for identifying churn drivers and estimating treatment effects of retention interventions.

1999 8000 cited

A Proportional Hazards Model for the Subdistribution of a Competing Risk

Jason P. Fine, Robert J. Gray

Extended Cox regression to competing risks—situations where multiple mutually exclusive event types are possible. Answers 'what's the probability of Event A by time t, given Event B could happen first?' Critical for subscription dynamics where users can churn, upgrade, downgrade, or convert.

2008 2500 cited

Random Survival Forests

Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, Michael S. Lauer

Extended random forests to censored survival data, creating a nonparametric, ensemble-based alternative to Cox regression. Captures complex interactions without proportional hazards assumption. The go-to ML approach for churn prediction with high-dimensional feature sets.

2018 1400 cited

DeepSurv: Personalized Treatment Recommender System Using a Cox Proportional Hazards Deep Neural Network

Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, Yuval Kluger

Married deep learning with Cox regression by replacing the linear predictor with a neural network. Includes framework for personalized treatment recommendations—identifying which retention interventions work best for which users. The bridge between survival analysis and modern recommender systems.

Title	Authors	Year	Citations
Nonparametric Estimation from Incomplete Observations Introduced the Kaplan-Meier estimator (product-limit estimator), the universal method for estimating survival curves from censored data. Every survival analysis begins here—essential for visualizing retention curves, comparing cohorts, and calculating median time-to-churn.	Edward L. Kaplan, Paul Meier	1958	65000
Regression Models and Life-Tables Introduced the Cox proportional hazards model, enabling regression analysis on survival data while leaving the baseline hazard unspecified. Ranked 24th among all scientific papers ever published. The workhorse for identifying churn drivers and estimating treatment effects of retention interventions.	David R. Cox	1972	65000
A Proportional Hazards Model for the Subdistribution of a Competing Risk Extended Cox regression to competing risks—situations where multiple mutually exclusive event types are possible. Answers 'what's the probability of Event A by time t, given Event B could happen first?' Critical for subscription dynamics where users can churn, upgrade, downgrade, or convert.	Jason P. Fine, Robert J. Gray	1999	8000
Random Survival Forests Extended random forests to censored survival data, creating a nonparametric, ensemble-based alternative to Cox regression. Captures complex interactions without proportional hazards assumption. The go-to ML approach for churn prediction with high-dimensional feature sets.	Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, Michael S. Lauer	2008	2500
DeepSurv: Personalized Treatment Recommender System Using a Cox Proportional Hazards Deep Neural Network Married deep learning with Cox regression by replacing the linear predictor with a neural network. Includes framework for personalized treatment recommendations—identifying which retention interventions work best for which users. The bridge between survival analysis and modern recommender systems.	Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, Yuval Kluger	2018	1400

Bayesian Hierarchical Models

Pool information across groups with principled uncertainty quantification

1990 15000 cited

Sampling-Based Approaches to Calculating Marginal Densities

Alan E. Gelfand, Adrian F. M. Smith

Birth of modern Bayesian computation. Demonstrated that the Gibbs sampler could solve any Bayesian posterior computation problem by iteratively sampling from conditional distributions. Before this paper, hierarchical models were limited to conjugate priors. After it, arbitrary model complexity became computationally feasible.

1975 2500 cited

Data Analysis Using Stein's Estimator and Its Generalizations

Bradley Efron, Carl N. Morris

Made James-Stein shrinkage practical and intuitive using baseball batting averages. Showed that shrinking individual estimates toward their collective mean reduces MSE by >50% vs raw sample means. Explains why hierarchical models work—borrowing information via shrinkage slashes estimation error for small cells.

2006 4000 cited

Prior Distributions for Variance Parameters in Hierarchical Models

Andrew Gelman

Identified that conventional inverse-gamma priors for variance parameters often dominate the posterior when group-level sample sizes are small. Introduced half-Cauchy and half-t priors as robust alternatives—now the default in Stan and PyMC. The fix for when hierarchical models return implausible variance estimates.

2014 4750 cited

The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo

Matthew D. Hoffman, Andrew Gelman

NUTS eliminated painful manual tuning that made HMC impractical for non-experts. By automatically determining trajectory lengths, achieved near-optimal efficiency without user intervention. Powers Stan, PyMC, NumPyro—the reason you can fit 50-parameter hierarchical MMMs without tuning anything.

2015 1800 cited

Inferring Causal Impact Using Bayesian Structural Time-Series Models

Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L. Scott

CausalImpact combines Bayesian structural time-series with synthetic control to estimate counterfactual outcomes when experiments are impossible. Answers 'what would have happened without the intervention?' Google's most-used internal causal inference tool for measuring TV campaigns and regional launches.

2017 200 cited

Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects

Yuxue Jin, Yueqing Wang, Yunting Sun, David Chan, Jim Koehler

Established modern Bayesian MMM paradigm now implemented in Google's LightweightMMM and Meridian. Key innovations: flexible functional forms for adstock decay and saturation, full Bayesian treatment propagating uncertainty to ROAS estimates. The starting point for marketing budget optimization at scale.

Title	Authors	Year	Citations
Sampling-Based Approaches to Calculating Marginal Densities Birth of modern Bayesian computation. Demonstrated that the Gibbs sampler could solve any Bayesian posterior computation problem by iteratively sampling from conditional distributions. Before this paper, hierarchical models were limited to conjugate priors. After it, arbitrary model complexity became computationally feasible.	Alan E. Gelfand, Adrian F. M. Smith	1990	15000
Data Analysis Using Stein's Estimator and Its Generalizations Made James-Stein shrinkage practical and intuitive using baseball batting averages. Showed that shrinking individual estimates toward their collective mean reduces MSE by >50% vs raw sample means. Explains why hierarchical models work—borrowing information via shrinkage slashes estimation error for small cells.	Bradley Efron, Carl N. Morris	1975	2500
Prior Distributions for Variance Parameters in Hierarchical Models Identified that conventional inverse-gamma priors for variance parameters often dominate the posterior when group-level sample sizes are small. Introduced half-Cauchy and half-t priors as robust alternatives—now the default in Stan and PyMC. The fix for when hierarchical models return implausible variance estimates.	Andrew Gelman	2006	4000
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo NUTS eliminated painful manual tuning that made HMC impractical for non-experts. By automatically determining trajectory lengths, achieved near-optimal efficiency without user intervention. Powers Stan, PyMC, NumPyro—the reason you can fit 50-parameter hierarchical MMMs without tuning anything.	Matthew D. Hoffman, Andrew Gelman	2014	4750
Inferring Causal Impact Using Bayesian Structural Time-Series Models CausalImpact combines Bayesian structural time-series with synthetic control to estimate counterfactual outcomes when experiments are impossible. Answers 'what would have happened without the intervention?' Google's most-used internal causal inference tool for measuring TV campaigns and regional launches.	Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L. Scott	2015	1800
Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects Established modern Bayesian MMM paradigm now implemented in Google's LightweightMMM and Meridian. Key innovations: flexible functional forms for adstock decay and saturation, full Bayesian treatment propagating uncertainty to ROAS estimates. The starting point for marketing budget optimization at scale.	Yuxue Jin, Yueqing Wang, Yunting Sun, David Chan, Jim Koehler	2017	200

Generalized Linear Models

Model count data, overdispersion, excess zeros, and bounded rate outcomes

1972 18000 cited

Generalized Linear Models

J. A. Nelder, R. W. M. Wedderburn

THE seminal paper that unified linear, logistic, and Poisson regression under a single framework. Showed any exponential family outcome can be modeled through a link function connecting mean response to linear predictor. Introduced iteratively reweighted least squares (IRLS) for ML estimation. Understanding this lets you choose the right GLM family for any outcome type.

1974 3500 cited

Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method

R. W. M. Wedderburn

Introduced quasi-likelihood—requiring only mean-variance relationship specification, not full distribution. Enables valid inference when count data has variance exceeding mean (overdispersion). Real behavioral data almost never follows Poisson assumptions; quasi-likelihood gives valid SEs by specifying variance = φ × mean.

1992 4100 cited

Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing

Diane Lambert

Developed Zero-Inflated Poisson (ZIP) model for data from a mixture: with probability p, outcome is always zero (structural zeros), otherwise follows Poisson. Essential for engagement metrics—separates 'never-users' from 'not-yet users' when modeling clicks, sessions, or purchases. Implemented in R's pscl package.

2004 2800 cited

Beta Regression for Modelling Rates and Proportions

Silvia L. P. Ferrari, Francisco Cribari-Neto

Proposed regression with beta-distributed responses for bounded (0,1) outcomes. Handles natural heteroskedasticity of rate data—variance highest near 0.5, decreasing toward boundaries. Linear regression on rates produces nonsensical predictions outside [0,1]. Essential for CTR, conversion rates, retention rates. Implemented in R's betareg package.

1990 2000 cited

Regression-Based Tests for Overdispersion in the Poisson Model

A. Colin Cameron, Pravin K. Trivedi

Developed practical regression-based tests for overdispersion requiring only mean-variance specification. The optimal test reduces to a simple t-test from auxiliary OLS regression. Before fitting negative binomial for overdispersed counts, use this test to formally reject Poisson. Implemented in R's AER package via dispersiontest().

Title	Authors	Year	Citations
Generalized Linear Models THE seminal paper that unified linear, logistic, and Poisson regression under a single framework. Showed any exponential family outcome can be modeled through a link function connecting mean response to linear predictor. Introduced iteratively reweighted least squares (IRLS) for ML estimation. Understanding this lets you choose the right GLM family for any outcome type.	J. A. Nelder, R. W. M. Wedderburn	1972	18000
Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method Introduced quasi-likelihood—requiring only mean-variance relationship specification, not full distribution. Enables valid inference when count data has variance exceeding mean (overdispersion). Real behavioral data almost never follows Poisson assumptions; quasi-likelihood gives valid SEs by specifying variance = φ × mean.	R. W. M. Wedderburn	1974	3500
Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing Developed Zero-Inflated Poisson (ZIP) model for data from a mixture: with probability p, outcome is always zero (structural zeros), otherwise follows Poisson. Essential for engagement metrics—separates 'never-users' from 'not-yet users' when modeling clicks, sessions, or purchases. Implemented in R's pscl package.	Diane Lambert	1992	4100
Beta Regression for Modelling Rates and Proportions Proposed regression with beta-distributed responses for bounded (0,1) outcomes. Handles natural heteroskedasticity of rate data—variance highest near 0.5, decreasing toward boundaries. Linear regression on rates produces nonsensical predictions outside [0,1]. Essential for CTR, conversion rates, retention rates. Implemented in R's betareg package.	Silvia L. P. Ferrari, Francisco Cribari-Neto	2004	2800
Regression-Based Tests for Overdispersion in the Poisson Model Developed practical regression-based tests for overdispersion requiring only mean-variance specification. The optimal test reduces to a simple t-test from auxiliary OLS regression. Before fitting negative binomial for overdispersed counts, use this test to formally reject Poisson. Implemented in R's AER package via dispersiontest().	A. Colin Cameron, Pravin K. Trivedi	1990	2000

Mixed Effects & Multilevel Models

Handle user-level heterogeneity, repeated measures, and hierarchical platform data

1982 9200 cited

Random-Effects Models for Longitudinal Data

Nan M. Laird, James H. Ware

Won 2021 International Prize in Statistics. Unified empirical Bayes and ML via EM algorithm for unbalanced longitudinal data with subject-specific random effects. Solves the 'users have different numbers of sessions' problem—models user heterogeneity while borrowing strength across users through partial pooling.

1971 4100 cited

Recovery of Inter-Block Information when Block Sizes are Unequal

H. D. Patterson, Robin Thompson

Invented Restricted Maximum Likelihood (REML), which accounts for degrees of freedom lost to estimating fixed effects. Now the default estimation method in virtually every mixed model software. Critical for unbiased variance component estimates, especially important for power analysis in A/B testing.

1991 1900 cited

That BLUP is a Good Thing: The Estimation of Random Effects

G. K. Robinson

Unified BLUP (Best Linear Unbiased Prediction) theory—showing it's the same as Kalman filtering, kriging, and credibility theory. Explains why shrinkage toward the grand mean is optimal: users with little data shrink toward population mean, users with abundant data reflect their own history.

2015 75000 cited

Fitting Linear Mixed-Effects Models Using lme4

Douglas Bates, Martin Mächler, Ben Bolker, Steve Walker

One of the most cited statistical papers in history (~75,000 citations). Documents lme4's computational algorithms and formula syntax: (1|user_id) for random intercepts, (treatment|user_id) for random slopes, (1|user_id) + (1|market) for crossed effects. The implementation guide for mixed models in R.

1978 8000 cited

On the Pooling of Time Series and Cross Section Data

Yair Mundlak

Bridges econometrics and biostatistics: proves fixed effects equals random effects when you include group means as covariates. The 'Mundlak approach' offers a compromise—random effects for efficiency plus cluster means to allow correlation between effects and regressors. Critical for choosing between plm and lmer.

Title	Authors	Year	Citations
Random-Effects Models for Longitudinal Data Won 2021 International Prize in Statistics. Unified empirical Bayes and ML via EM algorithm for unbalanced longitudinal data with subject-specific random effects. Solves the 'users have different numbers of sessions' problem—models user heterogeneity while borrowing strength across users through partial pooling.	Nan M. Laird, James H. Ware	1982	9200
Recovery of Inter-Block Information when Block Sizes are Unequal Invented Restricted Maximum Likelihood (REML), which accounts for degrees of freedom lost to estimating fixed effects. Now the default estimation method in virtually every mixed model software. Critical for unbiased variance component estimates, especially important for power analysis in A/B testing.	H. D. Patterson, Robin Thompson	1971	4100
That BLUP is a Good Thing: The Estimation of Random Effects Unified BLUP (Best Linear Unbiased Prediction) theory—showing it's the same as Kalman filtering, kriging, and credibility theory. Explains why shrinkage toward the grand mean is optimal: users with little data shrink toward population mean, users with abundant data reflect their own history.	G. K. Robinson	1991	1900
Fitting Linear Mixed-Effects Models Using lme4 One of the most cited statistical papers in history (~75,000 citations). Documents lme4's computational algorithms and formula syntax: (1\|user_id) for random intercepts, (treatment\|user_id) for random slopes, (1\|user_id) + (1\|market) for crossed effects. The implementation guide for mixed models in R.	Douglas Bates, Martin Mächler, Ben Bolker, Steve Walker	2015	75000
On the Pooling of Time Series and Cross Section Data Bridges econometrics and biostatistics: proves fixed effects equals random effects when you include group means as covariates. The 'Mundlak approach' offers a compromise—random effects for efficiency plus cluster means to allow correlation between effects and regressors. Critical for choosing between plm and lmer.	Yair Mundlak	1978	8000

Multiple Testing & False Discovery Rate

Control error rates when testing many hypotheses simultaneously

1995 100000 cited

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Yoav Benjamini, Yosef Hochberg

One of the most-cited statistics papers ever (~100,000 citations). Introduced FDR as alternative to FWER—controls expected proportion of false discoveries among rejections. The BH step-up procedure is now default in experimentation platforms at Google, Netflix, Meta. Essential when testing 500 experiments or 20 metrics per A/B test.

1979 23000 cited

A Simple Sequentially Rejective Multiple Test Procedure

Sture Holm

Dominant method for FWER control when you cannot tolerate any false positives. Step-down approach uniformly more powerful than Bonferroni with same guarantee, no dependence assumptions required. The right choice for guardrail metrics (revenue, latency, crash rates) where a single false positive could ship a harmful feature.

2002 5400 cited

A Direct Approach to False Discovery Rates

John D. Storey

Introduced q-values—the FDR analogue of p-values. A q-value tells you the minimum FDR threshold at which a test becomes significant. Also introduced π₀ estimation (proportion of true nulls), which can boost power up to 8× compared to BH when many tests are truly non-null.

2001 10400 cited

The Control of the False Discovery Rate in Multiple Testing Under Dependency

Yoav Benjamini, Daniel Yekutieli

Proves BH controls FDR under positive regression dependence (PRDS), covering most real-world cases. For arbitrary dependence, provides the BY correction guaranteeing FDR control under any correlation structure. Essential since A/B test metrics are correlated, user segments overlap, and experimental units cluster.

2004 2500 cited

Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis

Bradley Efron

Introduced empirical null and local FDR (lfdr). When testing thousands of hypotheses, the theoretical null N(0,1) may be miscalibrated. Estimating null from data corrects for systematic biases. Local FDR assigns each experiment a 'probability of being noise'—essential for prioritizing follow-up in large experimentation portfolios.

2015 1300 cited

Controlling the False Discovery Rate via Knockoffs

Rina Foygel Barber, Emmanuel J. Candès

FDR control for variable selection in regression—where traditional p-values are unreliable due to correlated predictors. Constructs 'fake' knockoff variables mimicking correlation structure but independent of outcome. Enables FDR-controlled claims about which variables matter, not just whether there's signal. Bridge between multiple testing and high-dimensional regression.

Title	Authors	Year	Citations
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing One of the most-cited statistics papers ever (~100,000 citations). Introduced FDR as alternative to FWER—controls expected proportion of false discoveries among rejections. The BH step-up procedure is now default in experimentation platforms at Google, Netflix, Meta. Essential when testing 500 experiments or 20 metrics per A/B test.	Yoav Benjamini, Yosef Hochberg	1995	100000
A Simple Sequentially Rejective Multiple Test Procedure Dominant method for FWER control when you cannot tolerate any false positives. Step-down approach uniformly more powerful than Bonferroni with same guarantee, no dependence assumptions required. The right choice for guardrail metrics (revenue, latency, crash rates) where a single false positive could ship a harmful feature.	Sture Holm	1979	23000
A Direct Approach to False Discovery Rates Introduced q-values—the FDR analogue of p-values. A q-value tells you the minimum FDR threshold at which a test becomes significant. Also introduced π₀ estimation (proportion of true nulls), which can boost power up to 8× compared to BH when many tests are truly non-null.	John D. Storey	2002	5400
The Control of the False Discovery Rate in Multiple Testing Under Dependency Proves BH controls FDR under positive regression dependence (PRDS), covering most real-world cases. For arbitrary dependence, provides the BY correction guaranteeing FDR control under any correlation structure. Essential since A/B test metrics are correlated, user segments overlap, and experimental units cluster.	Yoav Benjamini, Daniel Yekutieli	2001	10400
Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis Introduced empirical null and local FDR (lfdr). When testing thousands of hypotheses, the theoretical null N(0,1) may be miscalibrated. Estimating null from data corrects for systematic biases. Local FDR assigns each experiment a 'probability of being noise'—essential for prioritizing follow-up in large experimentation portfolios.	Bradley Efron	2004	2500
Controlling the False Discovery Rate via Knockoffs FDR control for variable selection in regression—where traditional p-values are unreliable due to correlated predictors. Constructs 'fake' knockoff variables mimicking correlation structure but independent of outcome. Enables FDR-controlled claims about which variables matter, not just whether there's signal. Bridge between multiple testing and high-dimensional regression.	Rina Foygel Barber, Emmanuel J. Candès	2015	1300

Survey Sampling & Weighted Estimation

Reweight non-random samples for valid population inference

1952 4600 cited

A Generalization of Sampling Without Replacement from a Finite Universe

Daniel G. Horvitz, Donovan J. Thompson

Introduced the Horvitz-Thompson estimator: Ŷ = Σ(Yᵢ/πᵢ). Works for any probability sampling design by inverse probability weighting. This is the same math underlying propensity score weighting in causal inference—survey statisticians solved IPW in 1952; causal inference borrowed it three decades later.

1992 2000 cited

Calibration Estimators in Survey Sampling

Jean-Claude Deville, Carl-Erik Särndal

Unified decades of ad-hoc weighting methods—post-stratification, raking, regression estimation—under a single calibration framework. Weights minimize distance from design weights while matching known population totals. Foundation behind every 'weight to Census' adjustment in commercial panels and platform surveys.

1934 3500 cited

On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection

Jerzy Neyman

Proved random probability sampling beats purposive 'representative' selection. Derived optimal allocation formula for stratified sampling: sample proportional to Nₕ × Sₕ (stratum size × stratum SD). Establishes why design-based inference via randomization provides the foundation for valid uncertainty quantification.

1997 1000 cited

Poststratification into Many Categories Using Hierarchical Logistic Regression

Andrew Gelman, Thomas C. Little

Introduced MRP (Multilevel Regression with Poststratification) for small-area estimation when many cells are sparse. Borrows strength across similar cells via hierarchical model, then poststratifies to population proportions. Enables valid estimation for small subgroups from biased opt-in samples—validated by accurate election forecasts from Xbox data that was 93% male.

2020 350 cited

Doubly Robust Inference with Nonprobability Survey Samples

Yilin Chen, Pengfei Li, Changbao Wu

Doubly robust estimators for convenience samples with unknown selection mechanisms—the default for tech company data. Consistent if either propensity model or outcome model is correctly specified. Bridges survey sampling and causal inference, showing survey propensity weights and causal IPW solve mathematically identical problems.

Title	Authors	Year	Citations
A Generalization of Sampling Without Replacement from a Finite Universe Introduced the Horvitz-Thompson estimator: Ŷ = Σ(Yᵢ/πᵢ). Works for any probability sampling design by inverse probability weighting. This is the same math underlying propensity score weighting in causal inference—survey statisticians solved IPW in 1952; causal inference borrowed it three decades later.	Daniel G. Horvitz, Donovan J. Thompson	1952	4600
Calibration Estimators in Survey Sampling Unified decades of ad-hoc weighting methods—post-stratification, raking, regression estimation—under a single calibration framework. Weights minimize distance from design weights while matching known population totals. Foundation behind every 'weight to Census' adjustment in commercial panels and platform surveys.	Jean-Claude Deville, Carl-Erik Särndal	1992	2000
On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection Proved random probability sampling beats purposive 'representative' selection. Derived optimal allocation formula for stratified sampling: sample proportional to Nₕ × Sₕ (stratum size × stratum SD). Establishes why design-based inference via randomization provides the foundation for valid uncertainty quantification.	Jerzy Neyman	1934	3500
Poststratification into Many Categories Using Hierarchical Logistic Regression Introduced MRP (Multilevel Regression with Poststratification) for small-area estimation when many cells are sparse. Borrows strength across similar cells via hierarchical model, then poststratifies to population proportions. Enables valid estimation for small subgroups from biased opt-in samples—validated by accurate election forecasts from Xbox data that was 93% male.	Andrew Gelman, Thomas C. Little	1997	1000
Doubly Robust Inference with Nonprobability Survey Samples Doubly robust estimators for convenience samples with unknown selection mechanisms—the default for tech company data. Consistent if either propensity model or outcome model is correctly specified. Bridges survey sampling and causal inference, showing survey propensity weights and causal IPW solve mathematically identical problems.	Yilin Chen, Pengfei Li, Changbao Wu	2020	350

Extreme Value Theory

Model tail risks, detect anomalies, and quantify rare events

1928 2500 cited

Limiting Forms of the Frequency Distribution of the Largest or Smallest Member of a Sample

R. A. Fisher, L. H. C. Tippett

Proves that maxima of i.i.d. samples converge to exactly three distribution types—Gumbel (light tails), Fréchet (heavy tails), and Weibull (bounded tails). This 'three types theorem' is the foundation of all EVT. Every anomaly detector, VaR model, and tail risk estimator builds on this elegant result.

1975 5800 cited

Statistical Inference Using Extreme Order Statistics

James Pickands III

Introduced GPD (Generalized Pareto Distribution) and Peaks-Over-Threshold methodology. Exceedances beyond any sufficiently high threshold follow GPD regardless of original distribution. POT uses all extreme observations rather than just block maxima, enabling anomaly detection with 10-100x fewer observations. Foundation of SPOT, DSPOT, and all modern streaming EVT.

1990 3500 cited

Models for Exceedances Over High Thresholds

A. C. Davison, R. L. Smith

THE implementation guide for POT modeling. Complete toolkit: MLE for GPD parameters, threshold selection via mean residual life plots, diagnostic methods, handling temporal dependence. Every EVT software package (R's extRemes, Python's scipy.stats) implements methods from this paper. Answers: How to choose threshold? How to validate model?

2000 2000 cited

Estimation of Tail-Related Risk Measures for Heteroscedastic Financial Time Series: An Extreme Value Approach

Alexander J. McNeil, Rüdiger Frey

Solved EVT's limitation for time-varying volatility via two-stage GARCH-EVT: filter through GARCH to remove heteroscedasticity, then apply GPD to standardized residuals. First rigorous formulas for conditional VaR and Expected Shortfall satisfying Basel requirements. Same framework applies to tail latency, fraud detection, any metric with volatility clustering.

2017 500 cited

Anomaly Detection in Streams with Extreme Value Theory

Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, Christine Largouët

SPOT (Streaming Peaks-Over-Threshold) automatically detects anomalies in real-time with no manual threshold tuning—threshold emerges from EVT theory. DSPOT extends to non-stationary streams with concept drift. O(1) per observation, no distributional assumptions. Applications: DDoS detection, equipment failures, fraud, latency spikes. Open-source Python code included.

Title	Authors	Year	Citations
Limiting Forms of the Frequency Distribution of the Largest or Smallest Member of a Sample Proves that maxima of i.i.d. samples converge to exactly three distribution types—Gumbel (light tails), Fréchet (heavy tails), and Weibull (bounded tails). This 'three types theorem' is the foundation of all EVT. Every anomaly detector, VaR model, and tail risk estimator builds on this elegant result.	R. A. Fisher, L. H. C. Tippett	1928	2500
Statistical Inference Using Extreme Order Statistics Introduced GPD (Generalized Pareto Distribution) and Peaks-Over-Threshold methodology. Exceedances beyond any sufficiently high threshold follow GPD regardless of original distribution. POT uses all extreme observations rather than just block maxima, enabling anomaly detection with 10-100x fewer observations. Foundation of SPOT, DSPOT, and all modern streaming EVT.	James Pickands III	1975	5800
Models for Exceedances Over High Thresholds THE implementation guide for POT modeling. Complete toolkit: MLE for GPD parameters, threshold selection via mean residual life plots, diagnostic methods, handling temporal dependence. Every EVT software package (R's extRemes, Python's scipy.stats) implements methods from this paper. Answers: How to choose threshold? How to validate model?	A. C. Davison, R. L. Smith	1990	3500
Estimation of Tail-Related Risk Measures for Heteroscedastic Financial Time Series: An Extreme Value Approach Solved EVT's limitation for time-varying volatility via two-stage GARCH-EVT: filter through GARCH to remove heteroscedasticity, then apply GPD to standardized residuals. First rigorous formulas for conditional VaR and Expected Shortfall satisfying Basel requirements. Same framework applies to tail latency, fraud detection, any metric with volatility clustering.	Alexander J. McNeil, Rüdiger Frey	2000	2000
Anomaly Detection in Streams with Extreme Value Theory SPOT (Streaming Peaks-Over-Threshold) automatically detects anomalies in real-time with no manual threshold tuning—threshold emerges from EVT theory. DSPOT extends to non-stationary streams with concept drift. O(1) per observation, no distributional assumptions. Applications: DDoS detection, equipment failures, fraud, latency spikes. Open-source Python code included.	Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, Christine Largouët	2017	500

Item Response Theory

Model skill assessment, adaptive testing, and ML evaluation

1960 15500 cited

Probabilistic Models for Some Intelligence and Attainment Tests

Georg Rasch

Introduced the one-parameter logistic (1PL/Rasch) model where response probability depends on person ability minus item difficulty. The 'specific objectivity' property enables comparing persons independent of which items they answered—the theoretical cornerstone of all computer adaptive testing (CAT) and Duolingo-style assessments.

1968 12800 cited

Statistical Theories of Mental Test Scores

Frederic M. Lord, Melvin R. Novick, Allan Birnbaum

The 'bible of test theory'. Birnbaum's chapters introduced 2PL (adding discrimination α) and 3PL (adding guessing γ) models. 2PL identifies which items best differentiate abilities—high-discrimination items worth more for rankings. 3PL handles multiple-choice guessing and random baseline performance. The canonical model family for 50+ years.

2020 90 cited

Machine Learning–Driven Language Assessment

Burr Settles, Geoffrey T. LaFlair, Masato Hagiwara

How Duolingo English Test works at scale. Uses ML/NLP to estimate Rasch difficulty from item text—skipping expensive human piloting. Achieves 0.96 internal consistency and r=0.77-0.78 with TOEFL/IELTS. Item exposure drops to 0.10% vs 20% in conventional CAT. Blueprint for building adaptive assessment without human norming.

2019 100 cited

β³-IRT: A New Item Response Model and its Applications

Yu Chen, Telmo Silva Filho, Ricardo B. C. Prudêncio, Tom Diethe, Peter Flach

Extends IRT to continuous responses using Beta-distributed item characteristic curves. Treats ML model evaluation as psychometric problem: each test instance has latent difficulty, each model has latent ability. Identifies which benchmark examples genuinely discriminate strong from weak classifiers—critical for efficient benchmarking.

2016 200 cited

Building an Evaluation Scale using Item Response Theory

John P. Lalor, Hao Wu, Hong Yu

First systematic application of IRT to NLP evaluation. Shows high accuracy ≠ high ability when item difficulty ignored—80% on easy items may indicate lower ability than 70% on hard items. Foundation for IRT-based ML leaderboards accounting for difficulty. Directly applicable to crowdsourcing quality estimation.

Title	Authors	Year	Citations
Probabilistic Models for Some Intelligence and Attainment Tests Introduced the one-parameter logistic (1PL/Rasch) model where response probability depends on person ability minus item difficulty. The 'specific objectivity' property enables comparing persons independent of which items they answered—the theoretical cornerstone of all computer adaptive testing (CAT) and Duolingo-style assessments.	Georg Rasch	1960	15500
Statistical Theories of Mental Test Scores The 'bible of test theory'. Birnbaum's chapters introduced 2PL (adding discrimination α) and 3PL (adding guessing γ) models. 2PL identifies which items best differentiate abilities—high-discrimination items worth more for rankings. 3PL handles multiple-choice guessing and random baseline performance. The canonical model family for 50+ years.	Frederic M. Lord, Melvin R. Novick, Allan Birnbaum	1968	12800
Machine Learning–Driven Language Assessment How Duolingo English Test works at scale. Uses ML/NLP to estimate Rasch difficulty from item text—skipping expensive human piloting. Achieves 0.96 internal consistency and r=0.77-0.78 with TOEFL/IELTS. Item exposure drops to 0.10% vs 20% in conventional CAT. Blueprint for building adaptive assessment without human norming.	Burr Settles, Geoffrey T. LaFlair, Masato Hagiwara	2020	90
β³-IRT: A New Item Response Model and its Applications Extends IRT to continuous responses using Beta-distributed item characteristic curves. Treats ML model evaluation as psychometric problem: each test instance has latent difficulty, each model has latent ability. Identifies which benchmark examples genuinely discriminate strong from weak classifiers—critical for efficient benchmarking.	Yu Chen, Telmo Silva Filho, Ricardo B. C. Prudêncio, Tom Diethe, Peter Flach	2019	100
Building an Evaluation Scale using Item Response Theory First systematic application of IRT to NLP evaluation. Shows high accuracy ≠ high ability when item difficulty ignored—80% on easy items may indicate lower ability than 70% on hard items. Foundation for IRT-based ML leaderboards accounting for difficulty. Directly applicable to crowdsourcing quality estimation.	John P. Lalor, Hao Wu, Hong Yu	2016	200

Post-Selection Inference

Valid confidence intervals after model selection

2013 900 cited

Valid Post-Selection Inference

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, Linda Zhao

First practical PoSI framework for valid inference after arbitrary model selection. Key insight: treat as simultaneous inference by widening CIs to cover all 2^p possible coefficient estimates across submodels. Conservative but universally valid—works regardless of whether selection used stepwise, lasso, AIC, or informal judgment.

2014 800 cited

A Significance Test for the Lasso

Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani

Breakthrough bringing p-values to lasso regression via covariance test statistic. Under null, test statistic follows Exp(1)—though variables chosen adaptively, lasso shrinkage makes null distribution tractable. Works in high-dimensional settings (p > n). The first principled answer to 'which lasso-selected features are real.'

2016 1100 cited

Exact Post-Selection Inference, with Application to the Lasso

Jason D. Lee, Dennis L. Sun, Yuekai Sun, Jonathan E. Taylor

THE foundational methods paper. Introduces polyhedral lemma: lasso selection event = response y falling into polyhedral set (Ay ≤ b). Conditioning yields truncated Gaussian distribution for exact finite-sample CIs accounting for selection. No asymptotics required. Implemented in selectiveInference R package.

2015 500 cited

Statistical Learning and Selective Inference

Jonathan Taylor, Robert J. Tibshirani

Accessible PNAS entry point to the field. Poses the question: 'Having mined data to find potential associations, how do we properly assess their strength?' Illustrates methods for forward stepwise, lasso, PCA with worked examples. Connects selective inference to replication crisis—cherry-picking requires higher significance bar.

2016 450 cited

Exact Post-Selection Inference for Sequential Regression Procedures

Jonathan E. Taylor, Richard Lockhart, Ryan J. Tibshirani, Robert Tibshirani

Extends polyhedral framework to forward stepwise and LAR—the most commonly used selection procedures. Proves these produce polyhedral selection events enabling exact conditional inference. Primary methods paper underlying selectiveInference R package: fs(), fsInf(), lar(), larInf(), fixedLassoInf().

Title	Authors	Year	Citations
Valid Post-Selection Inference First practical PoSI framework for valid inference after arbitrary model selection. Key insight: treat as simultaneous inference by widening CIs to cover all 2^p possible coefficient estimates across submodels. Conservative but universally valid—works regardless of whether selection used stepwise, lasso, AIC, or informal judgment.	Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, Linda Zhao	2013	900
A Significance Test for the Lasso Breakthrough bringing p-values to lasso regression via covariance test statistic. Under null, test statistic follows Exp(1)—though variables chosen adaptively, lasso shrinkage makes null distribution tractable. Works in high-dimensional settings (p > n). The first principled answer to 'which lasso-selected features are real.'	Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani	2014	800
Exact Post-Selection Inference, with Application to the Lasso THE foundational methods paper. Introduces polyhedral lemma: lasso selection event = response y falling into polyhedral set (Ay ≤ b). Conditioning yields truncated Gaussian distribution for exact finite-sample CIs accounting for selection. No asymptotics required. Implemented in selectiveInference R package.	Jason D. Lee, Dennis L. Sun, Yuekai Sun, Jonathan E. Taylor	2016	1100
Statistical Learning and Selective Inference Accessible PNAS entry point to the field. Poses the question: 'Having mined data to find potential associations, how do we properly assess their strength?' Illustrates methods for forward stepwise, lasso, PCA with worked examples. Connects selective inference to replication crisis—cherry-picking requires higher significance bar.	Jonathan Taylor, Robert J. Tibshirani	2015	500
Exact Post-Selection Inference for Sequential Regression Procedures Extends polyhedral framework to forward stepwise and LAR—the most commonly used selection procedures. Proves these produce polyhedral selection events enabling exact conditional inference. Primary methods paper underlying selectiveInference R package: fs(), fsInf(), lar(), larInf(), fixedLassoInf().	Jonathan E. Taylor, Richard Lockhart, Ryan J. Tibshirani, Robert Tibshirani	2016	450

Back to all topics