Inverse RL & Preference Learning
Learn what people want from their observed behavior • 44 papers
Reward Inference from Behavior
Inferring objectives from actions
Algorithms for Inverse Reinforcement Learning
Foundational IRL paper formalizing reward extraction from observed optimal behavior.
Apprenticeship Learning via Inverse Reinforcement Learning
Extends IRL to practical apprenticeship learning with feature expectation matching.
Maximum Entropy Inverse Reinforcement Learning
Resolves IRL ambiguity using maximum entropy principle; now the standard IRL formulation.
Cooperative Inverse Reinforcement Learning
Frames value alignment as cooperative game; foundational for AI safety.
Bayesian Inverse Reinforcement Learning
First Bayesian framework for IRL; provides posterior distributions over rewards capturing uncertainty.
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
First deep IRL method learning arbitrary neural network cost functions; enabled learning from raw images.
Inverse Reward Design
Treats designed rewards as noisy observations of true objectives; addresses reward hacking and negative side effects.
Imitation Learning
Learn policies from expert demonstrations
ALVINN: An Autonomous Land Vehicle in a Neural Network
Pioneering behavioral cloning; first end-to-end neural network steering for autonomous vehicles.
A Reduction of Imitation Learning to No-Regret Online Learning (DAgger)
Solves distribution shift in behavioral cloning; reduces imitation to online learning with O(T) error.
Generative Adversarial Imitation Learning (GAIL)
GAN-style adversarial training directly learning policy without reward recovery.
Learning from Demonstration
Foundational work showing demonstrations accelerate RL; established paradigm for robot skill acquisition.
Behavioral Cloning from Observation
Learning from state-only observations without action labels; enables learning from video demonstrations.
End to End Learning for Self-Driving Cars
Industry-defining paper: CNNs mapping pixels to steering; achieved 98% autonomous driving in road tests.
Revealed Preference at Scale
Rationalizing observed choices
Construction of a Utility Function from Expenditure Data
Foundational theorem: data is rationalizable iff it satisfies GARP; basis for computational revealed preference.
The Nonparametric Approach to Demand Analysis
Makes Afriat computationally tractable; shows how to test and recover preferences nonparametrically.
Revealed Preference Theory
Comprehensive modern treatment covering GARP extensions, complexity, and mechanism design applications.
Nonparametric Engel Curves and Revealed Preference
Combines revealed preference with nonparametric estimation; sharp bounds on counterfactual demands.
Conditional Logit Analysis of Qualitative Choice Behavior
Nobel Prize-winning random utility framework; foundation of discrete choice models used throughout tech.
Stochastic Choice and Revealed Perturbed Utility
Axiomatic foundations for perturbed utility models; generalizes logit choice to capture bounded rationality.
Dynamic Random Utility
Extends random utility to sequential choice with preference correlation; applicable to session-based user modeling.
Human Feedback & RLHF
Train models to align with human preferences
Deep Reinforcement Learning from Human Preferences
Foundational RLHF paper learning rewards from preference comparisons with ~1% feedback.
Training Language Models to Follow Instructions with Human Feedback (InstructGPT)
1.3B InstructGPT outperforms 175B GPT-3 on human preferences; foundation for ChatGPT.
Constitutional AI: Harmlessness from AI Feedback
RLAIF using AI self-critique against constitutional principles; Claude's training methodology.
Direct Preference Optimization (DPO)
Eliminates reward model and RL loop; preference optimization via simple classification loss.
Proximal Policy Optimization Algorithms
Stable policy gradient algorithm with clipped objectives; THE optimizer underlying RLHF in all major LLMs.
Learning to Summarize from Human Feedback
Demonstrated reward model + PPO pipeline for text; direct precursor to InstructGPT methodology.
A General Theoretical Paradigm to Understand Learning from Human Preferences
Unifies RLHF/DPO theoretically; Identity Preference Optimization fixes DPO overfitting issues.
KTO: Model Alignment as Prospect Theoretic Optimization
Aligns LLMs using binary good/bad signal via Kahneman-Tversky prospect theory; no preference pairs needed.
Preference Elicitation & Active Learning
Efficiently collect preference data from users
Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem
Introduced dueling bandits for pairwise preference learning; enables online learning without absolute labels.
The K-Armed Dueling Bandits Problem
Extended dueling bandits to K arms with regret bounds; Interleaved Filter algorithm for search evaluation.
Preference-based Online Learning with Dueling Bandits: A Survey
Comprehensive 108-page survey of dueling bandits variants, algorithms, and applications.
Stagewise Safe Bayesian Optimization with Gaussian Processes
Safe preference-based optimization separating exploration from exploitation; applicable to clinical/robotics.
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
Propensity-weighted learning from logged actions; foundation for offline policy learning in recommendations.
Choice Modeling from Behavioral Data
Learn preferences from click logs and digital traces
An Experimental Comparison of Click Position-Bias Models
Seminal click modeling paper; introduced cascade and position-based models; foundation for bias correction.
A Dynamic Bayesian Network Click Model for Web Search Ranking
DBN click model capturing examination chains and satisfaction; enables unbiased relevance estimation.
Unbiased Learning-to-Rank with Biased Feedback
Counterfactual framework for unbiased LTR; Propensity-Weighted Ranking SVM; highly influential for debiasing.
Click Models for Web Search
Comprehensive survey of click models, estimation methods, and applications to search evaluation.
Value Alignment & AI Safety
Ensure AI systems pursue intended objectives
Concrete Problems in AI Safety
Taxonomy of five safety problems: side effects, reward hacking, scalable oversight, safe exploration, distributional shift.
Goal Misgeneralization in Deep Reinforcement Learning
Demonstrates agents can pursue wrong goals even with correct specifications; distinct from reward hacking.
Scaling Laws for Reward Model Overoptimization
First systematic study of Goodhart's Law in RLHF; provides predictable scaling for safe optimization bounds.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Shows GPT-2 can supervise GPT-4; core empirical work on scalable oversight for superhuman AI alignment.
Personalization & User Modeling
Learn and represent individual user preferences
Matrix Factorization Techniques for Recommender Systems
Netflix Prize winners' tutorial; latent factor models, implicit feedback, temporal dynamics; 14,000+ citations.
Collaborative Filtering for Implicit Feedback Datasets
Weighted matrix factorization for clicks/views; confidence-weighted preference learning; industry standard.
BPR: Bayesian Personalized Ranking from Implicit Feedback
Pairwise ranking optimization from Bayesian principles; first method optimizing ranking directly for implicit data.