Reinforcement Learning

Make sequential decisions that improve outcomes over time • 53 papers

10 subtopics

Multi-Armed Bandits

Balance exploring new options vs. exploiting known winners

1985 2419 cited

Asymptotically Efficient Adaptive Allocation Rules

T.L. Lai, Herbert Robbins

Establishes the fundamental logarithmic regret lower bound; defines optimal exploration-exploitation tradeoff.

2002 5589 cited

Finite-time Analysis of the Multiarmed Bandit Problem

Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer

Introduces UCB1 with finite-time regret bounds; the practical workhorse algorithm still widely deployed.

2012 1524 cited

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Sébastien Bubeck, Nicolò Cesa-Bianchi

Comprehensive survey unifying stochastic and adversarial bandits; essential theoretical reference.

2016 412 cited

Optimal Best Arm Identification with Fixed Confidence

Aurélien Garivier, Emilie Kaufmann

Establishes lower bounds and optimal algorithms for best-arm identification; key for A/B testing.

2016 358 cited

On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models

Emilie Kaufmann, Olivier Cappé, Aurélien Garivier

Characterizes the sample complexity of best-arm identification in fixed-budget and fixed-confidence settings.

Contextual Bandits

Personalize decisions based on user context

2010 2403 cited

A Contextual-Bandit Approach to Personalized News Article Recommendation

Lihong Li, Wei Chu, John Langford, Robert E. Schapire

Introduces LinUCB; deployed at Yahoo! with 33M events; foundational industry paper.

2011 577 cited

Contextual Bandits with Linear Payoff Functions

Wei Chu, Lihong Li, Lev Reyzin, Robert Schapire

Rigorous theoretical analysis providing SupLinUCB algorithm.

2013 547 cited

Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal, Navin Goyal

First near-optimal regret bounds for Thompson Sampling in contextual settings.

2014 313 cited

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert Schapire

Efficient algorithm handling large policy classes for practical implementation.

2020 286 cited

Neural Contextual Bandits with UCB-based Exploration

Dongruo Zhou, Lihong Li, Quanquan Gu

Extends contextual bandits to neural network function approximation; bridges deep learning and bandit theory.

2020 189 cited

Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles

Dylan Foster, Alexander Rakhlin

SquareCB algorithm achieves optimal regret using only regression oracles; practical for complex function classes.

Thompson Sampling & Bayesian Bandits

Exploration with uncertainty

2018 478 cited

A Tutorial on Thompson Sampling

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen

Definitive survey covering theory, applications, and extensions; essential practitioner reference.

2011 998 cited

An Empirical Evaluation of Thompson Sampling

Olivier Chapelle, Lihong Li

Demonstrates strong empirical performance; sparked renewed industry interest.

2012 737 cited

Analysis of Thompson Sampling for the Multi-armed Bandit Problem

Shipra Agrawal, Navin Goyal

First proof of optimal O(√KT log T) regret bounds.

2014 526 cited

Learning to Optimize via Posterior Sampling

Daniel Russo, Benjamin Van Roy

Theoretical foundations extending Thompson Sampling to general RL.

2018 245 cited

Learning to Optimize via Information-Directed Sampling

Daniel Russo, Benjamin Van Roy

Generalizes Thompson Sampling using information ratio; provides tighter regret bounds for structured problems.

Off-Policy Evaluation

Evaluate new policies using historical data

2011 302 cited

Doubly Robust Policy Evaluation and Learning

Miroslav Dudík, John Langford, Lihong Li

Introduces doubly robust estimator combining IPS and direct method; robust to model misspecification.

2015 124 cited

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

Adith Swaminathan, Thorsten Joachims

Principled framework for learning from logged data; widely used in industry.

2015 200 cited

The Self-Normalized Estimator for Counterfactual Learning

Adith Swaminathan, Thorsten Joachims

Addresses high variance in IPS with self-normalization for real systems.

2017 20 cited

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudík

Minimax optimal bounds providing theoretical foundation for modern OPE.

2016 312 cited

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Philip Thomas, Emma Brunskill

High-confidence bounds for off-policy evaluation; safety-critical applications in healthcare and education.

2019 156 cited

Towards Optimal Off-Policy Evaluation for Reinforcement Learning

Tengyang Xie, Yifei Ma, Yu-Xiang Wang

Achieves minimax optimal rates for OPE in tabular MDPs; foundational theoretical result.

2017 189 cited

Off-policy Evaluation for Slate Recommendation

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni

Extends OPE to slate/list recommendations; critical for search and recommendation systems.

Batch/Offline RL

Learn optimal decisions from logged data

2020 786 cited

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, Justin Fu

Comprehensive tutorial defining the field's challenges and research directions.

2019

Off-Policy Deep Reinforcement Learning without Exploration (BCQ)

Scott Fujimoto, David Meger, Doina Precup

Identifies extrapolation error as core challenge; introduces Batch-Constrained Q-learning.

2020 532 cited

Conservative Q-Learning for Offline Reinforcement Learning (CQL)

Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine

Learns conservative Q-functions lower-bounding true value; SOTA on D4RL benchmarks.

2020 328 cited

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine

Standard benchmark datasets enabling reproducible offline RL research.

2021 1245 cited

Decision Transformer: Reinforcement Learning via Sequence Modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch

Frames offline RL as sequence modeling using transformers; avoids bootstrapping entirely.

2022 478 cited

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, Sergey Levine

Simple algorithm avoiding explicit policy constraint; strong performance on D4RL benchmarks.

RL for Bidding & Pricing

Automate bidding and pricing with RL

2017 153 cited

Real-Time Bidding by Reinforcement Learning in Display Advertising

Han Cai, Kan Ren, Weinan Zhang et al.

MDP framework for RTB with neural network value approximation for budget-constrained bidding.

2018 79 cited

Deep Reinforcement Learning for Sponsored Search Real-time Bidding

Jun Zhao, Guang Qiu et al. (Alibaba)

DRL for sponsored search deployed at Alibaba scale; handles non-stationary environments.

2018 79 cited

Budget Constrained Bidding by Model-free Reinforcement Learning

Di Wu, Xiujun Chen et al. (Alibaba)

Model-free RL for budget-constrained bidding with practical reward function design.

2010 457 cited

Web-scale Bayesian Click-through Rate Prediction for Sponsored Search

Thore Graepel, Joaquin Quiñonero Candela et al. (Microsoft)

Thompson Sampling at web-scale in Bing; demonstrates industrial viability of Bayesian methods.

2019 187 cited

Learning in Repeated Auctions with Budgets: Regret Minimization and Equilibrium

Santiago Balseiro, Yonatan Gur

Regret bounds for bidding with budget constraints; foundational for pacing algorithms.

2019 312 cited

Optimal Auctions through Deep Learning

Paul Dütting, Zhe Feng, Harikrishna Narasimhan, David Parkes, Sai Srivatsa Ravindranath

Neural networks learn near-optimal auction mechanisms; connects ML and mechanism design.

2021 89 cited

Contextual Bandits with Cross-Learning

Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, Jon Schneider

Learning across related auctions; critical for ad platforms with correlated contexts.

2021 156 cited

Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity

Gah-Yi Ban, N. Bora Keskin

Combines ML feature learning with dynamic pricing; optimal regret in high dimensions.

RL for Recommendations

Optimize long-term user engagement in recommendation systems

2019 423 cited

Top-K Off-Policy Correction for a REINFORCE Recommender System

Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed Chi

Deployed at YouTube; addresses large action spaces in slate recommendation with off-policy correction.

2019 198 cited

SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets

Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim McFadden, Tushar Chandra, Craig Boutilier

Decomposes slate Q-values for tractable optimization; deployed at Google.

2018 356 cited

Deep Reinforcement Learning for Page-wise Recommendations

Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, Jiliang Tang

DRL for whole-page recommendations considering item interactions and user browsing patterns.

2019 187 cited

Generative Adversarial User Model for Reinforcement Learning Based Recommendation System

Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, Le Song

Learns user simulator for offline RL training; addresses exploration challenges in recommendations.

2019 245 cited

Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems

Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, Dawei Yin

Optimizes long-term user retention metrics beyond immediate clicks; deployed at JD.com.

Safe & Constrained RL

Ensure safety constraints during learning and deployment

2017 1245 cited

Constrained Policy Optimization

Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel

First practical algorithm for RL with safety constraints; foundational for safe RL.

2017 567 cited

Safe Model-based Reinforcement Learning with Stability Guarantees

Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, Andreas Krause

Provides formal safety guarantees using Lyapunov functions; critical for robotics applications.

2019 389 cited

Benchmarking Safe Exploration in Deep Reinforcement Learning

Alex Ray, Joshua Achiam, Dario Amodei

OpenAI Safety Gym benchmark suite; standard evaluation for safe RL algorithms.

Exploration & Sample Efficiency

Learn effective policies with minimal data

2016 876 cited

Deep Exploration via Bootstrapped DQN

Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy

Posterior sampling for deep RL exploration; efficient exploration without explicit uncertainty.

2016 923 cited

Unifying Count-Based Exploration and Intrinsic Motivation

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, Remi Munos

Pseudo-counts for exploration in high-dimensional spaces; breakthrough for sparse-reward problems.

2017 2156 cited

Curiosity-driven Exploration by Self-supervised Prediction

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell

Intrinsic curiosity module using prediction error as exploration bonus; widely influential approach.

2020 478 cited

Model Based Reinforcement Learning for Atari

Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk Michalewski

SimPLe achieves 100x sample efficiency on Atari using learned world models.

Multi-Agent & Game-Theoretic RL

Learning in strategic multi-player environments

2017 689 cited

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, Thore Graepel

PSRO framework unifying game theory and deep RL; foundational for competitive multi-agent systems.

2018 534 cited

Learning with Opponent-Learning Awareness

Jakob Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, Igor Mordatch

LOLA accounts for opponent adaptation during learning; key insight for non-stationary multi-agent settings.

2019 2134 cited

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik et al.

AlphaStar achieves grandmaster level; landmark result in complex multi-agent real-time strategy.

2019 876 cited

Superhuman AI for multiplayer poker

Noam Brown, Tuomas Sandholm

Pluribus beats top humans in 6-player poker; breakthrough in imperfect-information games.

Must-read papers for tech economists and applied researchers