Inverse RL & Preference Learning

Learn what people want from their observed behavior • 44 papers

8 subtopics

Reward Inference from Behavior

Inferring objectives from actions

2000 41 cited

Algorithms for Inverse Reinforcement Learning

Andrew Y. Ng, Stuart Russell

Foundational IRL paper formalizing reward extraction from observed optimal behavior.

2004 2810 cited

Apprenticeship Learning via Inverse Reinforcement Learning

Pieter Abbeel, Andrew Y. Ng

Extends IRL to practical apprenticeship learning with feature expectation matching.

2008 2050 cited

Maximum Entropy Inverse Reinforcement Learning

Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, Anind K. Dey

Resolves IRL ambiguity using maximum entropy principle; now the standard IRL formulation.

2016 321 cited

Cooperative Inverse Reinforcement Learning

Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

Frames value alignment as cooperative game; foundational for AI safety.

2007 1156 cited

Bayesian Inverse Reinforcement Learning

Deepak Ramachandran, Eyal Amir

First Bayesian framework for IRL; provides posterior distributions over rewards capturing uncertainty.

2016 1089 cited

Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization

Chelsea Finn, Sergey Levine, Pieter Abbeel

First deep IRL method learning arbitrary neural network cost functions; enabled learning from raw images.

2017 467 cited

Inverse Reward Design

Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan

Treats designed rewards as noisy observations of true objectives; addresses reward hacking and negative side effects.

Imitation Learning

Learn policies from expert demonstrations

1989 1426 cited

ALVINN: An Autonomous Land Vehicle in a Neural Network

Dean A. Pomerleau

Pioneering behavioral cloning; first end-to-end neural network steering for autonomous vehicles.

2011 840 cited

A Reduction of Imitation Learning to No-Regret Online Learning (DAgger)

Stéphane Ross, Geoffrey J. Gordon, J. Andrew Bagnell

Solves distribution shift in behavioral cloning; reduces imitation to online learning with O(T) error.

2016 11 cited

Generative Adversarial Imitation Learning (GAIL)

Jonathan Ho, Stefano Ermon

GAN-style adversarial training directly learning policy without reward recovery.

1997 1823 cited

Learning from Demonstration

Stefan Schaal

Foundational work showing demonstrations accelerate RL; established paradigm for robot skill acquisition.

2018 456 cited

Behavioral Cloning from Observation

Faraz Torabi, Garrett Warnell, Peter Stone

Learning from state-only observations without action labels; enables learning from video demonstrations.

2016 5678 cited

End to End Learning for Self-Driving Cars

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski et al. (NVIDIA)

Industry-defining paper: CNNs mapping pixels to steering; achieved 98% autonomous driving in road tests.

Revealed Preference at Scale

Rationalizing observed choices

1967 1146 cited

Construction of a Utility Function from Expenditure Data

Sydney N. Afriat

Foundational theorem: data is rationalizable iff it satisfies GARP; basis for computational revealed preference.

1982 1154 cited

The Nonparametric Approach to Demand Analysis

Hal R. Varian

Makes Afriat computationally tractable; shows how to test and recover preferences nonparametrically.

2016 646 cited

Revealed Preference Theory

Christopher P. Chambers, Federico Echenique

Comprehensive modern treatment covering GARP extensions, complexity, and mechanism design applications.

2008

Nonparametric Engel Curves and Revealed Preference

Richard Blundell, Martin Browning, Ian Crawford

Combines revealed preference with nonparametric estimation; sharp bounds on counterfactual demands.

1974 18756 cited

Conditional Logit Analysis of Qualitative Choice Behavior

Daniel McFadden

Nobel Prize-winning random utility framework; foundation of discrete choice models used throughout tech.

2015 289 cited

Stochastic Choice and Revealed Perturbed Utility

Drew Fudenberg, Ryota Iijima, Tomasz Strzalecki

Axiomatic foundations for perturbed utility models; generalizes logit choice to capture bounded rationality.

2019 156 cited

Dynamic Random Utility

Mira Frick, Ryota Iijima, Tomasz Strzalecki

Extends random utility to sequential choice with preference correlation; applicable to session-based user modeling.

Human Feedback & RLHF

Train models to align with human preferences

2017 508 cited

Deep Reinforcement Learning from Human Preferences

Paul Christiano, Jan Leike, Tom Brown et al.

Foundational RLHF paper learning rewards from preference comparisons with ~1% feedback.

2022

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

Long Ouyang, Jeff Wu et al. (OpenAI)

1.3B InstructGPT outperforms 175B GPT-3 on human preferences; foundation for ChatGPT.

2022

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai et al. (Anthropic)

RLAIF using AI self-critique against constitutional principles; Claude's training methodology.

2023

Direct Preference Optimization (DPO)

Rafael Rafailov, Archit Sharma, Eric Mitchell et al.

Eliminates reward model and RL loop; preference optimization via simple classification loss.

2017 15678 cited

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

Stable policy gradient algorithm with clipped objectives; THE optimizer underlying RLHF in all major LLMs.

2020 1456 cited

Learning to Summarize from Human Feedback

Nisan Stiennon, Long Ouyang, Jeff Wu et al. (OpenAI)

Demonstrated reward model + PPO pipeline for text; direct precursor to InstructGPT methodology.

2024 234 cited

A General Theoretical Paradigm to Understand Learning from Human Preferences

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot et al. (DeepMind)

Unifies RLHF/DPO theoretically; Identity Preference Optimization fixes DPO overfitting issues.

2024 145 cited

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela

Aligns LLMs using binary good/bad signal via Kahneman-Tversky prospect theory; no preference pairs needed.

Preference Elicitation & Active Learning

Efficiently collect preference data from users

2009 567 cited

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem

Yisong Yue, Thorsten Joachims

Introduced dueling bandits for pairwise preference learning; enables online learning without absolute labels.

2012 389 cited

The K-Armed Dueling Bandits Problem

Yisong Yue, Josef Broder, Robert Kleinberg, Thorsten Joachims

Extended dueling bandits to K arms with regret bounds; Interleaved Filter algorithm for search evaluation.

2021 178 cited

Preference-based Online Learning with Dueling Bandits: A Survey

Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, Eyke Hüllermeier

Comprehensive 108-page survey of dueling bandits variants, algorithms, and applications.

2018 234 cited

Stagewise Safe Bayesian Optimization with Gaussian Processes

Yanan Sui, Vincent Zhuang, Joel Burdick, Yisong Yue

Safe preference-based optimization separating exploration from exploitation; applicable to clinical/robotics.

2015 678 cited

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

Adith Swaminathan, Thorsten Joachims

Propensity-weighted learning from logged actions; foundation for offline policy learning in recommendations.

Choice Modeling from Behavioral Data

Learn preferences from click logs and digital traces

2008 1234 cited

An Experimental Comparison of Click Position-Bias Models

Nick Craswell, Onno Zoeter, Michael Taylor, Bill Ramsey

Seminal click modeling paper; introduced cascade and position-based models; foundation for bias correction.

2009 987 cited

A Dynamic Bayesian Network Click Model for Web Search Ranking

Olivier Chapelle, Ya Zhang

DBN click model capturing examination chains and satisfaction; enables unbiased relevance estimation.

2017 756 cited

Unbiased Learning-to-Rank with Biased Feedback

Thorsten Joachims, Adith Swaminathan, Tobias Schnabel

Counterfactual framework for unbiased LTR; Propensity-Weighted Ranking SVM; highly influential for debiasing.

2015 456 cited

Click Models for Web Search

Aleksandr Chuklin, Ilya Markov, Maarten de Rijke

Comprehensive survey of click models, estimation methods, and applications to search evaluation.

Value Alignment & AI Safety

Ensure AI systems pursue intended objectives

2016 2345 cited

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané

Taxonomy of five safety problems: side effects, reward hacking, scalable oversight, safe exploration, distributional shift.

2022 234 cited

Goal Misgeneralization in Deep Reinforcement Learning

Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna et al. (DeepMind)

Demonstrates agents can pursue wrong goals even with correct specifications; distinct from reward hacking.

2023 189 cited

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, Jacob Hilton

First systematic study of Goodhart's Law in RLHF; provides predictable scaling for safe optimization bounds.

2023 145 cited

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt (OpenAI)

Shows GPT-2 can supervise GPT-4; core empirical work on scalable oversight for superhuman AI alignment.

Personalization & User Modeling

Learn and represent individual user preferences

2009 14567 cited

Matrix Factorization Techniques for Recommender Systems

Yehuda Koren, Robert Bell, Chris Volinsky

Netflix Prize winners' tutorial; latent factor models, implicit feedback, temporal dynamics; 14,000+ citations.

2008 5678 cited

Collaborative Filtering for Implicit Feedback Datasets

Yifan Hu, Yehuda Koren, Chris Volinsky

Weighted matrix factorization for clicks/views; confidence-weighted preference learning; industry standard.

2009 3456 cited

BPR: Bayesian Personalized Ranking from Implicit Feedback

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, Lars Schmidt-Thieme

Pairwise ranking optimization from Bayesian principles; first method optimizing ranking directly for implicit data.

Must-read papers for tech economists and applied researchers