Statistics
Foundational statistical methods for inference and uncertainty quantification • 53 papers
Bootstrap & Resampling Methods
Estimate uncertainty for any statistic without closed-form solutions
Bootstrap Methods: Another Look at the Jackknife
THE paper that created the bootstrap field. Shows that by drawing samples with replacement from observed data, you can estimate the sampling distribution of virtually any statistic—no closed-form solutions required. Won the 2018 International Prize in Statistics.
Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy
The paper that made bootstrap accessible to practitioners. Shows how to apply bootstrap to real problems: bias estimation, prediction error, confidence intervals, time series, regression. Covers bootstrap CIs (percentile, BCa), when bootstrap fails, and practical diagnostics.
A Scalable Bootstrap for Massive Data
The Bag of Little Bootstraps (BLB) solves the fundamental problem that standard bootstrap requires O(B×n) operations—impossible for terabyte-scale data. BLB takes small subsamples of size n^0.6, runs weighted bootstrap within each, and achieves same statistical efficiency while being trivially parallelizable across Spark/MapReduce.
Estimating Uncertainty for Massive Data Streams
The Poisson bootstrap replaces multinomial resampling with independent Poisson(1) weights for each observation. Enables single-pass streaming computation where you don't need to know n in advance, and each data shard can be processed independently. Built into Google's production analysis primitives.
Bootstrap-Based Improvements for Inference with Clustered Errors
The wild cluster bootstrap handles clustered/panel data (users within cities, sessions within users, days within experiments) and provides valid inference even with few clusters (5-30) where standard cluster-robust SEs severely over-reject. Essential for diff-in-diff, geographic experiments, and switchback designs.
Resampling-Free Bootstrap Inference for Quantiles
A Spotify paper that achieves 828x speedup for quantile bootstrap by deriving the analytical distribution of bootstrap quantile indices. Enables bootstrap CIs for medians/percentiles on hundreds of millions of observations in milliseconds. Already deployed in production at Spotify.
Survival Analysis & Time-to-Event Models
Model censored time-to-event outcomes for churn, LTV, and engagement
Nonparametric Estimation from Incomplete Observations
Introduced the Kaplan-Meier estimator (product-limit estimator), the universal method for estimating survival curves from censored data. Every survival analysis begins here—essential for visualizing retention curves, comparing cohorts, and calculating median time-to-churn.
Regression Models and Life-Tables
Introduced the Cox proportional hazards model, enabling regression analysis on survival data while leaving the baseline hazard unspecified. Ranked 24th among all scientific papers ever published. The workhorse for identifying churn drivers and estimating treatment effects of retention interventions.
A Proportional Hazards Model for the Subdistribution of a Competing Risk
Extended Cox regression to competing risks—situations where multiple mutually exclusive event types are possible. Answers 'what's the probability of Event A by time t, given Event B could happen first?' Critical for subscription dynamics where users can churn, upgrade, downgrade, or convert.
Random Survival Forests
Extended random forests to censored survival data, creating a nonparametric, ensemble-based alternative to Cox regression. Captures complex interactions without proportional hazards assumption. The go-to ML approach for churn prediction with high-dimensional feature sets.
DeepSurv: Personalized Treatment Recommender System Using a Cox Proportional Hazards Deep Neural Network
Married deep learning with Cox regression by replacing the linear predictor with a neural network. Includes framework for personalized treatment recommendations—identifying which retention interventions work best for which users. The bridge between survival analysis and modern recommender systems.
Bayesian Hierarchical Models
Pool information across groups with principled uncertainty quantification
Sampling-Based Approaches to Calculating Marginal Densities
Birth of modern Bayesian computation. Demonstrated that the Gibbs sampler could solve any Bayesian posterior computation problem by iteratively sampling from conditional distributions. Before this paper, hierarchical models were limited to conjugate priors. After it, arbitrary model complexity became computationally feasible.
Data Analysis Using Stein's Estimator and Its Generalizations
Made James-Stein shrinkage practical and intuitive using baseball batting averages. Showed that shrinking individual estimates toward their collective mean reduces MSE by >50% vs raw sample means. Explains why hierarchical models work—borrowing information via shrinkage slashes estimation error for small cells.
Prior Distributions for Variance Parameters in Hierarchical Models
Identified that conventional inverse-gamma priors for variance parameters often dominate the posterior when group-level sample sizes are small. Introduced half-Cauchy and half-t priors as robust alternatives—now the default in Stan and PyMC. The fix for when hierarchical models return implausible variance estimates.
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo
NUTS eliminated painful manual tuning that made HMC impractical for non-experts. By automatically determining trajectory lengths, achieved near-optimal efficiency without user intervention. Powers Stan, PyMC, NumPyro—the reason you can fit 50-parameter hierarchical MMMs without tuning anything.
Inferring Causal Impact Using Bayesian Structural Time-Series Models
CausalImpact combines Bayesian structural time-series with synthetic control to estimate counterfactual outcomes when experiments are impossible. Answers 'what would have happened without the intervention?' Google's most-used internal causal inference tool for measuring TV campaigns and regional launches.
Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects
Established modern Bayesian MMM paradigm now implemented in Google's LightweightMMM and Meridian. Key innovations: flexible functional forms for adstock decay and saturation, full Bayesian treatment propagating uncertainty to ROAS estimates. The starting point for marketing budget optimization at scale.
Generalized Linear Models
Model count data, overdispersion, excess zeros, and bounded rate outcomes
Generalized Linear Models
THE seminal paper that unified linear, logistic, and Poisson regression under a single framework. Showed any exponential family outcome can be modeled through a link function connecting mean response to linear predictor. Introduced iteratively reweighted least squares (IRLS) for ML estimation. Understanding this lets you choose the right GLM family for any outcome type.
Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method
Introduced quasi-likelihood—requiring only mean-variance relationship specification, not full distribution. Enables valid inference when count data has variance exceeding mean (overdispersion). Real behavioral data almost never follows Poisson assumptions; quasi-likelihood gives valid SEs by specifying variance = φ × mean.
Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing
Developed Zero-Inflated Poisson (ZIP) model for data from a mixture: with probability p, outcome is always zero (structural zeros), otherwise follows Poisson. Essential for engagement metrics—separates 'never-users' from 'not-yet users' when modeling clicks, sessions, or purchases. Implemented in R's pscl package.
Beta Regression for Modelling Rates and Proportions
Proposed regression with beta-distributed responses for bounded (0,1) outcomes. Handles natural heteroskedasticity of rate data—variance highest near 0.5, decreasing toward boundaries. Linear regression on rates produces nonsensical predictions outside [0,1]. Essential for CTR, conversion rates, retention rates. Implemented in R's betareg package.
Regression-Based Tests for Overdispersion in the Poisson Model
Developed practical regression-based tests for overdispersion requiring only mean-variance specification. The optimal test reduces to a simple t-test from auxiliary OLS regression. Before fitting negative binomial for overdispersed counts, use this test to formally reject Poisson. Implemented in R's AER package via dispersiontest().
Mixed Effects & Multilevel Models
Handle user-level heterogeneity, repeated measures, and hierarchical platform data
Random-Effects Models for Longitudinal Data
Won 2021 International Prize in Statistics. Unified empirical Bayes and ML via EM algorithm for unbalanced longitudinal data with subject-specific random effects. Solves the 'users have different numbers of sessions' problem—models user heterogeneity while borrowing strength across users through partial pooling.
Recovery of Inter-Block Information when Block Sizes are Unequal
Invented Restricted Maximum Likelihood (REML), which accounts for degrees of freedom lost to estimating fixed effects. Now the default estimation method in virtually every mixed model software. Critical for unbiased variance component estimates, especially important for power analysis in A/B testing.
That BLUP is a Good Thing: The Estimation of Random Effects
Unified BLUP (Best Linear Unbiased Prediction) theory—showing it's the same as Kalman filtering, kriging, and credibility theory. Explains why shrinkage toward the grand mean is optimal: users with little data shrink toward population mean, users with abundant data reflect their own history.
Fitting Linear Mixed-Effects Models Using lme4
One of the most cited statistical papers in history (~75,000 citations). Documents lme4's computational algorithms and formula syntax: (1|user_id) for random intercepts, (treatment|user_id) for random slopes, (1|user_id) + (1|market) for crossed effects. The implementation guide for mixed models in R.
On the Pooling of Time Series and Cross Section Data
Bridges econometrics and biostatistics: proves fixed effects equals random effects when you include group means as covariates. The 'Mundlak approach' offers a compromise—random effects for efficiency plus cluster means to allow correlation between effects and regressors. Critical for choosing between plm and lmer.
Multiple Testing & False Discovery Rate
Control error rates when testing many hypotheses simultaneously
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing
One of the most-cited statistics papers ever (~100,000 citations). Introduced FDR as alternative to FWER—controls expected proportion of false discoveries among rejections. The BH step-up procedure is now default in experimentation platforms at Google, Netflix, Meta. Essential when testing 500 experiments or 20 metrics per A/B test.
A Simple Sequentially Rejective Multiple Test Procedure
Dominant method for FWER control when you cannot tolerate any false positives. Step-down approach uniformly more powerful than Bonferroni with same guarantee, no dependence assumptions required. The right choice for guardrail metrics (revenue, latency, crash rates) where a single false positive could ship a harmful feature.
A Direct Approach to False Discovery Rates
Introduced q-values—the FDR analogue of p-values. A q-value tells you the minimum FDR threshold at which a test becomes significant. Also introduced π₀ estimation (proportion of true nulls), which can boost power up to 8× compared to BH when many tests are truly non-null.
The Control of the False Discovery Rate in Multiple Testing Under Dependency
Proves BH controls FDR under positive regression dependence (PRDS), covering most real-world cases. For arbitrary dependence, provides the BY correction guaranteeing FDR control under any correlation structure. Essential since A/B test metrics are correlated, user segments overlap, and experimental units cluster.
Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis
Introduced empirical null and local FDR (lfdr). When testing thousands of hypotheses, the theoretical null N(0,1) may be miscalibrated. Estimating null from data corrects for systematic biases. Local FDR assigns each experiment a 'probability of being noise'—essential for prioritizing follow-up in large experimentation portfolios.
Controlling the False Discovery Rate via Knockoffs
FDR control for variable selection in regression—where traditional p-values are unreliable due to correlated predictors. Constructs 'fake' knockoff variables mimicking correlation structure but independent of outcome. Enables FDR-controlled claims about which variables matter, not just whether there's signal. Bridge between multiple testing and high-dimensional regression.
Survey Sampling & Weighted Estimation
Reweight non-random samples for valid population inference
A Generalization of Sampling Without Replacement from a Finite Universe
Introduced the Horvitz-Thompson estimator: Ŷ = Σ(Yᵢ/πᵢ). Works for any probability sampling design by inverse probability weighting. This is the same math underlying propensity score weighting in causal inference—survey statisticians solved IPW in 1952; causal inference borrowed it three decades later.
Calibration Estimators in Survey Sampling
Unified decades of ad-hoc weighting methods—post-stratification, raking, regression estimation—under a single calibration framework. Weights minimize distance from design weights while matching known population totals. Foundation behind every 'weight to Census' adjustment in commercial panels and platform surveys.
On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection
Proved random probability sampling beats purposive 'representative' selection. Derived optimal allocation formula for stratified sampling: sample proportional to Nₕ × Sₕ (stratum size × stratum SD). Establishes why design-based inference via randomization provides the foundation for valid uncertainty quantification.
Poststratification into Many Categories Using Hierarchical Logistic Regression
Introduced MRP (Multilevel Regression with Poststratification) for small-area estimation when many cells are sparse. Borrows strength across similar cells via hierarchical model, then poststratifies to population proportions. Enables valid estimation for small subgroups from biased opt-in samples—validated by accurate election forecasts from Xbox data that was 93% male.
Doubly Robust Inference with Nonprobability Survey Samples
Doubly robust estimators for convenience samples with unknown selection mechanisms—the default for tech company data. Consistent if either propensity model or outcome model is correctly specified. Bridges survey sampling and causal inference, showing survey propensity weights and causal IPW solve mathematically identical problems.
Extreme Value Theory
Model tail risks, detect anomalies, and quantify rare events
Limiting Forms of the Frequency Distribution of the Largest or Smallest Member of a Sample
Proves that maxima of i.i.d. samples converge to exactly three distribution types—Gumbel (light tails), Fréchet (heavy tails), and Weibull (bounded tails). This 'three types theorem' is the foundation of all EVT. Every anomaly detector, VaR model, and tail risk estimator builds on this elegant result.
Statistical Inference Using Extreme Order Statistics
Introduced GPD (Generalized Pareto Distribution) and Peaks-Over-Threshold methodology. Exceedances beyond any sufficiently high threshold follow GPD regardless of original distribution. POT uses all extreme observations rather than just block maxima, enabling anomaly detection with 10-100x fewer observations. Foundation of SPOT, DSPOT, and all modern streaming EVT.
Models for Exceedances Over High Thresholds
THE implementation guide for POT modeling. Complete toolkit: MLE for GPD parameters, threshold selection via mean residual life plots, diagnostic methods, handling temporal dependence. Every EVT software package (R's extRemes, Python's scipy.stats) implements methods from this paper. Answers: How to choose threshold? How to validate model?
Estimation of Tail-Related Risk Measures for Heteroscedastic Financial Time Series: An Extreme Value Approach
Solved EVT's limitation for time-varying volatility via two-stage GARCH-EVT: filter through GARCH to remove heteroscedasticity, then apply GPD to standardized residuals. First rigorous formulas for conditional VaR and Expected Shortfall satisfying Basel requirements. Same framework applies to tail latency, fraud detection, any metric with volatility clustering.
Anomaly Detection in Streams with Extreme Value Theory
SPOT (Streaming Peaks-Over-Threshold) automatically detects anomalies in real-time with no manual threshold tuning—threshold emerges from EVT theory. DSPOT extends to non-stationary streams with concept drift. O(1) per observation, no distributional assumptions. Applications: DDoS detection, equipment failures, fraud, latency spikes. Open-source Python code included.
Item Response Theory
Model skill assessment, adaptive testing, and ML evaluation
Probabilistic Models for Some Intelligence and Attainment Tests
Introduced the one-parameter logistic (1PL/Rasch) model where response probability depends on person ability minus item difficulty. The 'specific objectivity' property enables comparing persons independent of which items they answered—the theoretical cornerstone of all computer adaptive testing (CAT) and Duolingo-style assessments.
Statistical Theories of Mental Test Scores
The 'bible of test theory'. Birnbaum's chapters introduced 2PL (adding discrimination α) and 3PL (adding guessing γ) models. 2PL identifies which items best differentiate abilities—high-discrimination items worth more for rankings. 3PL handles multiple-choice guessing and random baseline performance. The canonical model family for 50+ years.
Machine Learning–Driven Language Assessment
How Duolingo English Test works at scale. Uses ML/NLP to estimate Rasch difficulty from item text—skipping expensive human piloting. Achieves 0.96 internal consistency and r=0.77-0.78 with TOEFL/IELTS. Item exposure drops to 0.10% vs 20% in conventional CAT. Blueprint for building adaptive assessment without human norming.
β³-IRT: A New Item Response Model and its Applications
Extends IRT to continuous responses using Beta-distributed item characteristic curves. Treats ML model evaluation as psychometric problem: each test instance has latent difficulty, each model has latent ability. Identifies which benchmark examples genuinely discriminate strong from weak classifiers—critical for efficient benchmarking.
Building an Evaluation Scale using Item Response Theory
First systematic application of IRT to NLP evaluation. Shows high accuracy ≠ high ability when item difficulty ignored—80% on easy items may indicate lower ability than 70% on hard items. Foundation for IRT-based ML leaderboards accounting for difficulty. Directly applicable to crowdsourcing quality estimation.
Post-Selection Inference
Valid confidence intervals after model selection
Valid Post-Selection Inference
First practical PoSI framework for valid inference after arbitrary model selection. Key insight: treat as simultaneous inference by widening CIs to cover all 2^p possible coefficient estimates across submodels. Conservative but universally valid—works regardless of whether selection used stepwise, lasso, AIC, or informal judgment.
A Significance Test for the Lasso
Breakthrough bringing p-values to lasso regression via covariance test statistic. Under null, test statistic follows Exp(1)—though variables chosen adaptively, lasso shrinkage makes null distribution tractable. Works in high-dimensional settings (p > n). The first principled answer to 'which lasso-selected features are real.'
Exact Post-Selection Inference, with Application to the Lasso
THE foundational methods paper. Introduces polyhedral lemma: lasso selection event = response y falling into polyhedral set (Ay ≤ b). Conditioning yields truncated Gaussian distribution for exact finite-sample CIs accounting for selection. No asymptotics required. Implemented in selectiveInference R package.
Statistical Learning and Selective Inference
Accessible PNAS entry point to the field. Poses the question: 'Having mined data to find potential associations, how do we properly assess their strength?' Illustrates methods for forward stepwise, lasso, PCA with worked examples. Connects selective inference to replication crisis—cherry-picking requires higher significance bar.
Exact Post-Selection Inference for Sequential Regression Procedures
Extends polyhedral framework to forward stepwise and LAR—the most commonly used selection procedures. Proves these produce polyhedral selection events enabling exact conditional inference. Primary methods paper underlying selectiveInference R package: fs(), fsInf(), lar(), larInf(), fixedLassoInf().