tech-econ

Multi-Armed Bandits

Balance exploring new options vs. exploiting known winners

1985 2419 cited

Asymptotically Efficient Adaptive Allocation Rules

T.L. Lai, Herbert Robbins

Establishes the fundamental logarithmic regret lower bound; defines optimal exploration-exploitation tradeoff.

2002 5589 cited

Finite-time Analysis of the Multiarmed Bandit Problem

Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer

Introduces UCB1 with finite-time regret bounds; the practical workhorse algorithm still widely deployed.

2012 1524 cited

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Sébastien Bubeck, Nicolò Cesa-Bianchi

Comprehensive survey unifying stochastic and adversarial bandits; essential theoretical reference.

2016 412 cited

Optimal Best Arm Identification with Fixed Confidence

Aurélien Garivier, Emilie Kaufmann

Establishes lower bounds and optimal algorithms for best-arm identification; key for A/B testing.

2016 358 cited

On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models

Emilie Kaufmann, Olivier Cappé, Aurélien Garivier

Characterizes the sample complexity of best-arm identification in fixed-budget and fixed-confidence settings.

Title	Authors	Year	Citations
Asymptotically Efficient Adaptive Allocation Rules Establishes the fundamental logarithmic regret lower bound; defines optimal exploration-exploitation tradeoff.	T.L. Lai, Herbert Robbins	1985	2419
Finite-time Analysis of the Multiarmed Bandit Problem Introduces UCB1 with finite-time regret bounds; the practical workhorse algorithm still widely deployed.	Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer	2002	5589
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Comprehensive survey unifying stochastic and adversarial bandits; essential theoretical reference.	Sébastien Bubeck, Nicolò Cesa-Bianchi	2012	1524
Optimal Best Arm Identification with Fixed Confidence Establishes lower bounds and optimal algorithms for best-arm identification; key for A/B testing.	Aurélien Garivier, Emilie Kaufmann	2016	412
On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models Characterizes the sample complexity of best-arm identification in fixed-budget and fixed-confidence settings.	Emilie Kaufmann, Olivier Cappé, Aurélien Garivier	2016	358

Contextual Bandits

Personalize decisions based on user context

2010 2403 cited

A Contextual-Bandit Approach to Personalized News Article Recommendation

Lihong Li, Wei Chu, John Langford, Robert E. Schapire

Introduces LinUCB; deployed at Yahoo! with 33M events; foundational industry paper.

2011 577 cited

Contextual Bandits with Linear Payoff Functions

Wei Chu, Lihong Li, Lev Reyzin, Robert Schapire

Rigorous theoretical analysis providing SupLinUCB algorithm.

2013 547 cited

Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal, Navin Goyal

First near-optimal regret bounds for Thompson Sampling in contextual settings.

2014 313 cited

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert Schapire

Efficient algorithm handling large policy classes for practical implementation.

2020 286 cited

Neural Contextual Bandits with UCB-based Exploration

Dongruo Zhou, Lihong Li, Quanquan Gu

Extends contextual bandits to neural network function approximation; bridges deep learning and bandit theory.

2020 189 cited

Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles

Dylan Foster, Alexander Rakhlin

SquareCB algorithm achieves optimal regret using only regression oracles; practical for complex function classes.

Title	Authors	Year	Citations
A Contextual-Bandit Approach to Personalized News Article Recommendation Introduces LinUCB; deployed at Yahoo! with 33M events; foundational industry paper.	Lihong Li, Wei Chu, John Langford, Robert E. Schapire	2010	2403
Contextual Bandits with Linear Payoff Functions Rigorous theoretical analysis providing SupLinUCB algorithm.	Wei Chu, Lihong Li, Lev Reyzin, Robert Schapire	2011	577
Thompson Sampling for Contextual Bandits with Linear Payoffs First near-optimal regret bounds for Thompson Sampling in contextual settings.	Shipra Agrawal, Navin Goyal	2013	547
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits Efficient algorithm handling large policy classes for practical implementation.	Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert Schapire	2014	313
Neural Contextual Bandits with UCB-based Exploration Extends contextual bandits to neural network function approximation; bridges deep learning and bandit theory.	Dongruo Zhou, Lihong Li, Quanquan Gu	2020	286
Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles SquareCB algorithm achieves optimal regret using only regression oracles; practical for complex function classes.	Dylan Foster, Alexander Rakhlin	2020	189

Thompson Sampling & Bayesian Bandits

Exploration with uncertainty

2018 478 cited

A Tutorial on Thompson Sampling

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen

Definitive survey covering theory, applications, and extensions; essential practitioner reference.

2011 998 cited

An Empirical Evaluation of Thompson Sampling

Olivier Chapelle, Lihong Li

Demonstrates strong empirical performance; sparked renewed industry interest.

2012 737 cited

Analysis of Thompson Sampling for the Multi-armed Bandit Problem

Shipra Agrawal, Navin Goyal

First proof of optimal O(√KT log T) regret bounds.

2014 526 cited

Learning to Optimize via Posterior Sampling

Daniel Russo, Benjamin Van Roy

Theoretical foundations extending Thompson Sampling to general RL.

2018 245 cited

Learning to Optimize via Information-Directed Sampling

Daniel Russo, Benjamin Van Roy

Generalizes Thompson Sampling using information ratio; provides tighter regret bounds for structured problems.

Title	Authors	Year	Citations
A Tutorial on Thompson Sampling Definitive survey covering theory, applications, and extensions; essential practitioner reference.	Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen	2018	478
An Empirical Evaluation of Thompson Sampling Demonstrates strong empirical performance; sparked renewed industry interest.	Olivier Chapelle, Lihong Li	2011	998
Analysis of Thompson Sampling for the Multi-armed Bandit Problem First proof of optimal O(√KT log T) regret bounds.	Shipra Agrawal, Navin Goyal	2012	737
Learning to Optimize via Posterior Sampling Theoretical foundations extending Thompson Sampling to general RL.	Daniel Russo, Benjamin Van Roy	2014	526
Learning to Optimize via Information-Directed Sampling Generalizes Thompson Sampling using information ratio; provides tighter regret bounds for structured problems.	Daniel Russo, Benjamin Van Roy	2018	245

Off-Policy Evaluation

Evaluate new policies using historical data

2011 302 cited

Doubly Robust Policy Evaluation and Learning

Miroslav Dudík, John Langford, Lihong Li

Introduces doubly robust estimator combining IPS and direct method; robust to model misspecification.

2015 124 cited

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

Adith Swaminathan, Thorsten Joachims

Principled framework for learning from logged data; widely used in industry.

2015 200 cited

The Self-Normalized Estimator for Counterfactual Learning

Adith Swaminathan, Thorsten Joachims

Addresses high variance in IPS with self-normalization for real systems.

2017 20 cited

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudík

Minimax optimal bounds providing theoretical foundation for modern OPE.

2016 312 cited

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Philip Thomas, Emma Brunskill

High-confidence bounds for off-policy evaluation; safety-critical applications in healthcare and education.

2019 156 cited

Towards Optimal Off-Policy Evaluation for Reinforcement Learning

Tengyang Xie, Yifei Ma, Yu-Xiang Wang

Achieves minimax optimal rates for OPE in tabular MDPs; foundational theoretical result.

2017 189 cited

Off-policy Evaluation for Slate Recommendation

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni

Extends OPE to slate/list recommendations; critical for search and recommendation systems.

Title	Authors	Year	Citations
Doubly Robust Policy Evaluation and Learning Introduces doubly robust estimator combining IPS and direct method; robust to model misspecification.	Miroslav Dudík, John Langford, Lihong Li	2011	302
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback Principled framework for learning from logged data; widely used in industry.	Adith Swaminathan, Thorsten Joachims	2015	124
The Self-Normalized Estimator for Counterfactual Learning Addresses high variance in IPS with self-normalization for real systems.	Adith Swaminathan, Thorsten Joachims	2015	200
Optimal and Adaptive Off-policy Evaluation in Contextual Bandits Minimax optimal bounds providing theoretical foundation for modern OPE.	Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudík	2017	20
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning High-confidence bounds for off-policy evaluation; safety-critical applications in healthcare and education.	Philip Thomas, Emma Brunskill	2016	312
Towards Optimal Off-Policy Evaluation for Reinforcement Learning Achieves minimax optimal rates for OPE in tabular MDPs; foundational theoretical result.	Tengyang Xie, Yifei Ma, Yu-Xiang Wang	2019	156
Off-policy Evaluation for Slate Recommendation Extends OPE to slate/list recommendations; critical for search and recommendation systems.	Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni	2017	189

Batch/Offline RL

Learn optimal decisions from logged data

2020 786 cited

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, Justin Fu

Comprehensive tutorial defining the field's challenges and research directions.

2019

Off-Policy Deep Reinforcement Learning without Exploration (BCQ)

Scott Fujimoto, David Meger, Doina Precup

Identifies extrapolation error as core challenge; introduces Batch-Constrained Q-learning.

2020 532 cited

Conservative Q-Learning for Offline Reinforcement Learning (CQL)

Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine

Learns conservative Q-functions lower-bounding true value; SOTA on D4RL benchmarks.

2020 328 cited

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine

Standard benchmark datasets enabling reproducible offline RL research.

2021 1245 cited

Decision Transformer: Reinforcement Learning via Sequence Modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch

Frames offline RL as sequence modeling using transformers; avoids bootstrapping entirely.

2022 478 cited

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, Sergey Levine

Simple algorithm avoiding explicit policy constraint; strong performance on D4RL benchmarks.

Title	Authors	Year	Citations
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems Comprehensive tutorial defining the field's challenges and research directions.	Sergey Levine, Aviral Kumar, George Tucker, Justin Fu	2020	786
Off-Policy Deep Reinforcement Learning without Exploration (BCQ) Identifies extrapolation error as core challenge; introduces Batch-Constrained Q-learning.	Scott Fujimoto, David Meger, Doina Precup	2019	—
Conservative Q-Learning for Offline Reinforcement Learning (CQL) Learns conservative Q-functions lower-bounding true value; SOTA on D4RL benchmarks.	Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine	2020	532
D4RL: Datasets for Deep Data-Driven Reinforcement Learning Standard benchmark datasets enabling reproducible offline RL research.	Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine	2020	328
Decision Transformer: Reinforcement Learning via Sequence Modeling Frames offline RL as sequence modeling using transformers; avoids bootstrapping entirely.	Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch	2021	1245
Offline Reinforcement Learning with Implicit Q-Learning Simple algorithm avoiding explicit policy constraint; strong performance on D4RL benchmarks.	Ilya Kostrikov, Ashvin Nair, Sergey Levine	2022	478

RL for Bidding & Pricing

Automate bidding and pricing with RL

2017 153 cited

Real-Time Bidding by Reinforcement Learning in Display Advertising

Han Cai, Kan Ren, Weinan Zhang et al.

MDP framework for RTB with neural network value approximation for budget-constrained bidding.

2018 79 cited

Deep Reinforcement Learning for Sponsored Search Real-time Bidding

Jun Zhao, Guang Qiu et al. (Alibaba)

DRL for sponsored search deployed at Alibaba scale; handles non-stationary environments.

2018 79 cited

Budget Constrained Bidding by Model-free Reinforcement Learning

Di Wu, Xiujun Chen et al. (Alibaba)

Model-free RL for budget-constrained bidding with practical reward function design.

2010 457 cited

Web-scale Bayesian Click-through Rate Prediction for Sponsored Search

Thore Graepel, Joaquin Quiñonero Candela et al. (Microsoft)

Thompson Sampling at web-scale in Bing; demonstrates industrial viability of Bayesian methods.

2019 187 cited

Learning in Repeated Auctions with Budgets: Regret Minimization and Equilibrium

Santiago Balseiro, Yonatan Gur

Regret bounds for bidding with budget constraints; foundational for pacing algorithms.

2019 312 cited

Optimal Auctions through Deep Learning

Paul Dütting, Zhe Feng, Harikrishna Narasimhan, David Parkes, Sai Srivatsa Ravindranath

Neural networks learn near-optimal auction mechanisms; connects ML and mechanism design.

2021 89 cited

Contextual Bandits with Cross-Learning

Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, Jon Schneider

Learning across related auctions; critical for ad platforms with correlated contexts.

2021 156 cited

Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity

Gah-Yi Ban, N. Bora Keskin

Combines ML feature learning with dynamic pricing; optimal regret in high dimensions.

Title	Authors	Year	Citations
Real-Time Bidding by Reinforcement Learning in Display Advertising MDP framework for RTB with neural network value approximation for budget-constrained bidding.	Han Cai, Kan Ren, Weinan Zhang et al.	2017	153
Deep Reinforcement Learning for Sponsored Search Real-time Bidding DRL for sponsored search deployed at Alibaba scale; handles non-stationary environments.	Jun Zhao, Guang Qiu et al. (Alibaba)	2018	79
Budget Constrained Bidding by Model-free Reinforcement Learning Model-free RL for budget-constrained bidding with practical reward function design.	Di Wu, Xiujun Chen et al. (Alibaba)	2018	79
Web-scale Bayesian Click-through Rate Prediction for Sponsored Search Thompson Sampling at web-scale in Bing; demonstrates industrial viability of Bayesian methods.	Thore Graepel, Joaquin Quiñonero Candela et al. (Microsoft)	2010	457
Learning in Repeated Auctions with Budgets: Regret Minimization and Equilibrium Regret bounds for bidding with budget constraints; foundational for pacing algorithms.	Santiago Balseiro, Yonatan Gur	2019	187
Optimal Auctions through Deep Learning Neural networks learn near-optimal auction mechanisms; connects ML and mechanism design.	Paul Dütting, Zhe Feng, Harikrishna Narasimhan, David Parkes, Sai Srivatsa Ravindranath	2019	312
Contextual Bandits with Cross-Learning Learning across related auctions; critical for ad platforms with correlated contexts.	Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, Jon Schneider	2021	89
Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity Combines ML feature learning with dynamic pricing; optimal regret in high dimensions.	Gah-Yi Ban, N. Bora Keskin	2021	156

RL for Recommendations

Optimize long-term user engagement in recommendation systems

2019 423 cited

Top-K Off-Policy Correction for a REINFORCE Recommender System

Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed Chi

Deployed at YouTube; addresses large action spaces in slate recommendation with off-policy correction.

2019 198 cited

SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets

Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim McFadden, Tushar Chandra, Craig Boutilier

Decomposes slate Q-values for tractable optimization; deployed at Google.

2018 356 cited

Deep Reinforcement Learning for Page-wise Recommendations

Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, Jiliang Tang

DRL for whole-page recommendations considering item interactions and user browsing patterns.

2019 187 cited

Generative Adversarial User Model for Reinforcement Learning Based Recommendation System

Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, Le Song

Learns user simulator for offline RL training; addresses exploration challenges in recommendations.

2019 245 cited

Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems

Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, Dawei Yin

Optimizes long-term user retention metrics beyond immediate clicks; deployed at JD.com.

Title	Authors	Year	Citations
Top-K Off-Policy Correction for a REINFORCE Recommender System Deployed at YouTube; addresses large action spaces in slate recommendation with off-policy correction.	Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed Chi	2019	423
SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets Decomposes slate Q-values for tractable optimization; deployed at Google.	Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim McFadden, Tushar Chandra, Craig Boutilier	2019	198
Deep Reinforcement Learning for Page-wise Recommendations DRL for whole-page recommendations considering item interactions and user browsing patterns.	Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, Jiliang Tang	2018	356
Generative Adversarial User Model for Reinforcement Learning Based Recommendation System Learns user simulator for offline RL training; addresses exploration challenges in recommendations.	Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, Le Song	2019	187
Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems Optimizes long-term user retention metrics beyond immediate clicks; deployed at JD.com.	Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, Dawei Yin	2019	245

Safe & Constrained RL

Ensure safety constraints during learning and deployment

2017 1245 cited

Constrained Policy Optimization

Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel

First practical algorithm for RL with safety constraints; foundational for safe RL.

2017 567 cited

Safe Model-based Reinforcement Learning with Stability Guarantees

Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, Andreas Krause

Provides formal safety guarantees using Lyapunov functions; critical for robotics applications.

2019 389 cited

Benchmarking Safe Exploration in Deep Reinforcement Learning

Alex Ray, Joshua Achiam, Dario Amodei

OpenAI Safety Gym benchmark suite; standard evaluation for safe RL algorithms.

Title	Authors	Year	Citations
Constrained Policy Optimization First practical algorithm for RL with safety constraints; foundational for safe RL.	Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel	2017	1245
Safe Model-based Reinforcement Learning with Stability Guarantees Provides formal safety guarantees using Lyapunov functions; critical for robotics applications.	Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, Andreas Krause	2017	567
Benchmarking Safe Exploration in Deep Reinforcement Learning OpenAI Safety Gym benchmark suite; standard evaluation for safe RL algorithms.	Alex Ray, Joshua Achiam, Dario Amodei	2019	389

Exploration & Sample Efficiency

Learn effective policies with minimal data

2016 876 cited

Deep Exploration via Bootstrapped DQN

Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy

Posterior sampling for deep RL exploration; efficient exploration without explicit uncertainty.

2016 923 cited

Unifying Count-Based Exploration and Intrinsic Motivation

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, Remi Munos

Pseudo-counts for exploration in high-dimensional spaces; breakthrough for sparse-reward problems.

2017 2156 cited

Curiosity-driven Exploration by Self-supervised Prediction

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell

Intrinsic curiosity module using prediction error as exploration bonus; widely influential approach.

2020 478 cited

Model Based Reinforcement Learning for Atari

Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk Michalewski

SimPLe achieves 100x sample efficiency on Atari using learned world models.

Title	Authors	Year	Citations
Deep Exploration via Bootstrapped DQN Posterior sampling for deep RL exploration; efficient exploration without explicit uncertainty.	Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy	2016	876
Unifying Count-Based Exploration and Intrinsic Motivation Pseudo-counts for exploration in high-dimensional spaces; breakthrough for sparse-reward problems.	Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, Remi Munos	2016	923
Curiosity-driven Exploration by Self-supervised Prediction Intrinsic curiosity module using prediction error as exploration bonus; widely influential approach.	Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell	2017	2156
Model Based Reinforcement Learning for Atari SimPLe achieves 100x sample efficiency on Atari using learned world models.	Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk Michalewski	2020	478

Multi-Agent & Game-Theoretic RL

Learning in strategic multi-player environments

2017 689 cited

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, Thore Graepel

PSRO framework unifying game theory and deep RL; foundational for competitive multi-agent systems.

2018 534 cited

Learning with Opponent-Learning Awareness

Jakob Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, Igor Mordatch

LOLA accounts for opponent adaptation during learning; key insight for non-stationary multi-agent settings.

2019 2134 cited

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik et al.

AlphaStar achieves grandmaster level; landmark result in complex multi-agent real-time strategy.

2019 876 cited

Superhuman AI for multiplayer poker

Noam Brown, Tuomas Sandholm

Pluribus beats top humans in 6-player poker; breakthrough in imperfect-information games.