Reinforcement Learning
Make sequential decisions that improve outcomes over time • 53 papers
Multi-Armed Bandits
Balance exploring new options vs. exploiting known winners
Asymptotically Efficient Adaptive Allocation Rules
Establishes the fundamental logarithmic regret lower bound; defines optimal exploration-exploitation tradeoff.
Finite-time Analysis of the Multiarmed Bandit Problem
Introduces UCB1 with finite-time regret bounds; the practical workhorse algorithm still widely deployed.
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems
Comprehensive survey unifying stochastic and adversarial bandits; essential theoretical reference.
Optimal Best Arm Identification with Fixed Confidence
Establishes lower bounds and optimal algorithms for best-arm identification; key for A/B testing.
On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models
Characterizes the sample complexity of best-arm identification in fixed-budget and fixed-confidence settings.
Contextual Bandits
Personalize decisions based on user context
A Contextual-Bandit Approach to Personalized News Article Recommendation
Introduces LinUCB; deployed at Yahoo! with 33M events; foundational industry paper.
Contextual Bandits with Linear Payoff Functions
Rigorous theoretical analysis providing SupLinUCB algorithm.
Thompson Sampling for Contextual Bandits with Linear Payoffs
First near-optimal regret bounds for Thompson Sampling in contextual settings.
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits
Efficient algorithm handling large policy classes for practical implementation.
Neural Contextual Bandits with UCB-based Exploration
Extends contextual bandits to neural network function approximation; bridges deep learning and bandit theory.
Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles
SquareCB algorithm achieves optimal regret using only regression oracles; practical for complex function classes.
Thompson Sampling & Bayesian Bandits
Exploration with uncertainty
A Tutorial on Thompson Sampling
Definitive survey covering theory, applications, and extensions; essential practitioner reference.
An Empirical Evaluation of Thompson Sampling
Demonstrates strong empirical performance; sparked renewed industry interest.
Analysis of Thompson Sampling for the Multi-armed Bandit Problem
First proof of optimal O(√KT log T) regret bounds.
Learning to Optimize via Posterior Sampling
Theoretical foundations extending Thompson Sampling to general RL.
Learning to Optimize via Information-Directed Sampling
Generalizes Thompson Sampling using information ratio; provides tighter regret bounds for structured problems.
Off-Policy Evaluation
Evaluate new policies using historical data
Doubly Robust Policy Evaluation and Learning
Introduces doubly robust estimator combining IPS and direct method; robust to model misspecification.
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
Principled framework for learning from logged data; widely used in industry.
The Self-Normalized Estimator for Counterfactual Learning
Addresses high variance in IPS with self-normalization for real systems.
Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
Minimax optimal bounds providing theoretical foundation for modern OPE.
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
High-confidence bounds for off-policy evaluation; safety-critical applications in healthcare and education.
Towards Optimal Off-Policy Evaluation for Reinforcement Learning
Achieves minimax optimal rates for OPE in tabular MDPs; foundational theoretical result.
Off-policy Evaluation for Slate Recommendation
Extends OPE to slate/list recommendations; critical for search and recommendation systems.
Batch/Offline RL
Learn optimal decisions from logged data
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Comprehensive tutorial defining the field's challenges and research directions.
Off-Policy Deep Reinforcement Learning without Exploration (BCQ)
Identifies extrapolation error as core challenge; introduces Batch-Constrained Q-learning.
Conservative Q-Learning for Offline Reinforcement Learning (CQL)
Learns conservative Q-functions lower-bounding true value; SOTA on D4RL benchmarks.
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Standard benchmark datasets enabling reproducible offline RL research.
Decision Transformer: Reinforcement Learning via Sequence Modeling
Frames offline RL as sequence modeling using transformers; avoids bootstrapping entirely.
Offline Reinforcement Learning with Implicit Q-Learning
Simple algorithm avoiding explicit policy constraint; strong performance on D4RL benchmarks.
RL for Bidding & Pricing
Automate bidding and pricing with RL
Real-Time Bidding by Reinforcement Learning in Display Advertising
MDP framework for RTB with neural network value approximation for budget-constrained bidding.
Deep Reinforcement Learning for Sponsored Search Real-time Bidding
DRL for sponsored search deployed at Alibaba scale; handles non-stationary environments.
Budget Constrained Bidding by Model-free Reinforcement Learning
Model-free RL for budget-constrained bidding with practical reward function design.
Web-scale Bayesian Click-through Rate Prediction for Sponsored Search
Thompson Sampling at web-scale in Bing; demonstrates industrial viability of Bayesian methods.
Learning in Repeated Auctions with Budgets: Regret Minimization and Equilibrium
Regret bounds for bidding with budget constraints; foundational for pacing algorithms.
Optimal Auctions through Deep Learning
Neural networks learn near-optimal auction mechanisms; connects ML and mechanism design.
Contextual Bandits with Cross-Learning
Learning across related auctions; critical for ad platforms with correlated contexts.
Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity
Combines ML feature learning with dynamic pricing; optimal regret in high dimensions.
RL for Recommendations
Optimize long-term user engagement in recommendation systems
Top-K Off-Policy Correction for a REINFORCE Recommender System
Deployed at YouTube; addresses large action spaces in slate recommendation with off-policy correction.
SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets
Decomposes slate Q-values for tractable optimization; deployed at Google.
Deep Reinforcement Learning for Page-wise Recommendations
DRL for whole-page recommendations considering item interactions and user browsing patterns.
Generative Adversarial User Model for Reinforcement Learning Based Recommendation System
Learns user simulator for offline RL training; addresses exploration challenges in recommendations.
Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems
Optimizes long-term user retention metrics beyond immediate clicks; deployed at JD.com.
Safe & Constrained RL
Ensure safety constraints during learning and deployment
Constrained Policy Optimization
First practical algorithm for RL with safety constraints; foundational for safe RL.
Safe Model-based Reinforcement Learning with Stability Guarantees
Provides formal safety guarantees using Lyapunov functions; critical for robotics applications.
Benchmarking Safe Exploration in Deep Reinforcement Learning
OpenAI Safety Gym benchmark suite; standard evaluation for safe RL algorithms.
Exploration & Sample Efficiency
Learn effective policies with minimal data
Deep Exploration via Bootstrapped DQN
Posterior sampling for deep RL exploration; efficient exploration without explicit uncertainty.
Unifying Count-Based Exploration and Intrinsic Motivation
Pseudo-counts for exploration in high-dimensional spaces; breakthrough for sparse-reward problems.
Curiosity-driven Exploration by Self-supervised Prediction
Intrinsic curiosity module using prediction error as exploration bonus; widely influential approach.
Model Based Reinforcement Learning for Atari
SimPLe achieves 100x sample efficiency on Atari using learned world models.
Multi-Agent & Game-Theoretic RL
Learning in strategic multi-player environments
A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning
PSRO framework unifying game theory and deep RL; foundational for competitive multi-agent systems.
Learning with Opponent-Learning Awareness
LOLA accounts for opponent adaptation during learning; key insight for non-stationary multi-agent settings.
Grandmaster level in StarCraft II using multi-agent reinforcement learning
AlphaStar achieves grandmaster level; landmark result in complex multi-agent real-time strategy.
Superhuman AI for multiplayer poker
Pluribus beats top humans in 6-player poker; breakthrough in imperfect-information games.