Reinforcement Learning Papers

Related papers for Reinforcement Learning (we mainly focus on single-agent).

Since there are tens of thousands of new papers on reinforcement learning in each conference every year, we are only able to list those we read and consider as insightful.

We have added some ICLR23 papers on RL

Model Free (Online) RL

Classic Methods

Title	Method	Conference	on/off policy	Action Space	Policy	Description
Human-level control through deep reinforcement learning, [other link]	DQN	Nature15	off	Discrete	based on value function	use deep neural network to train q learning and reach the human level in the Atari games; mainly two trick: replay buffer for improving sample efficiency, decouple target network and behavior network
Deep reinforcement learning with double q-learning	Double DQN	AAAI16	off	Discrete	based on value function	find that the Q function in DQN may overestimate; decouple calculating q function and choosing action with two neural networks
Dueling network architectures for deep reinforcement learning	Dueling DQN	ICML16	off	Discrete	based on value function	use the same neural network to approximate q function and value function for calculating advantage function
Prioritized Experience Replay	Priority Sampling	ICLR16	off	Discrete	based on value function	give different weights to the samples in the replay buffer (e.g. TD error)
Rainbow: Combining Improvements in Deep Reinforcement Learning	Rainbow	AAAI18	off	Discrete	based on value function	combine different improvements to DQN: Double DQN, Dueling DQN, Priority Sampling, Multi-step learning, Distributional RL, Noisy Nets
Policy Gradient Methods for Reinforcement Learning with Function Approximation	PG	NeurIPS99	on/off	Continuous or Discrete	function approximation	propose Policy Gradient Theorem: how to calculate the gradient of the expected cumulative return to policy
----	AC/A2C	----	on/off	Continuous or Discrete	parameterized neural network	AC: replace the return in PG with q function approximator to reduce variance; A2C: replace the q function in AC with advantage function to reduce variance
Asynchronous Methods for Deep Reinforcement Learning	A3C	ICML16	on/off	Continuous or Discrete	parameterized neural network	propose three tricks to improve performance: (i) use different agents to interact with the environment; (ii) value function and policy share network parameters; (iii) modify the loss function (mse of value function + pg loss + policy entropy)
Trust Region Policy Optimization	TRPO	ICML15	on	Continuous or Discrete	parameterized neural network	introduce trust region to policy optimization for guaranteed monotonic improvement
Proximal Policy Optimization Algorithms	PPO	arxiv17	on	Continuous or Discrete	parameterized neural network	replace the hard constraint of TRPO with a penalty by clipping the coefficient
Deterministic Policy Gradient Algorithms	DPG	ICML14	off	Continuous	function approximation	consider deterministic policy for continuous action space and prove Deterministic Policy Gradient Theorem; use a stochastic behaviour policy for encouraging exploration
Continuous Control with Deep Reinforcement Learning	DDPG	ICLR16	off	Continuous	parameterized neural network	adapt the ideas of DQN to DPG: (i) deep neural network function approximators, (ii) replay buffer, (iii) fix the target q function at each epoch
Addressing Function Approximation Error in Actor-Critic Methods	TD3	ICML18	off	Continuous	parameterized neural network	adapt the ideas of Double DQN to DDPG: taking the minimum value between a pair of critics to limit overestimation
Reinforcement Learning with Deep Energy-Based Policies	SQL	ICML17	off	main for Continuous	parameterized neural network	consider max-entropy rl and propose soft q iteration as well as soft q learning
Soft Actor-Critic Algorithms and Applications, Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, [appendix]	SAC	ICML18	off	main for Continuous	parameterized neural network	base the theoretical analysis of SQL and extend soft q iteration (soft q evaluation + soft q improvement); reparameterize the policy and use two parameterized value functions; propose SAC

Exploration

Title	Method	Conference	Description
Curiosity-driven Exploration by Self-supervised Prediction	ICM	ICML17	propose that curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills when rewards are sparse; formulate curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model

Off-Policy Evaluation

Title	Method	Conference
Weighted importance sampling for off-policy learning with linear function approximation	WIS-LSTD	NeurIPS14
Importance Sampling Policy Evaluation with an Estimated Behavior Policy	RIS	ICML19
Off-Policy Evaluation for Large Action Spaces via Embeddings		ICML22
Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning	LDR2OPE	ICML22
On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation		ICML22
A Unified Off-Policy Evaluation Approach for General Value Function		NeurIPS22
The Pitfalls of Regularizations in Off-Policy TD Learning		NeurIPS22
Off-Policy Evaluation for Action-Dependent Non-Stationary Environments		NeurIPS22
Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions		NeurIPS22
Off-Policy Evaluation with Policy-Dependent Optimization Response		NeurIPS22
Variational Latent Branching Model for Off-Policy Evaluation		ICLR23

Soft RL

Title	Method	Conference	Description
A Max-Min Entropy Framework for Reinforcement Learning	MME	NeurIPS21	find that SAC may fail in explore states with low entropy (arrive states with high entropy and increase their entropies); propose a max-min entropy framework to address this issue
Maximum Entropy RL (Provably) Solves Some Robust RL Problems	----	ICLR22	theoretically prove that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function
The Importance of Non-Markovianity in Maximum State Entropy Exploration		ICML22 oral
Communicating via Maximum Entropy Reinforcement Learning		ICML22

Bisimulation

Title	Method	Conference
Equivalence notions and model minimization in Markov decision processes		Artificial Intelligence, 2003
Metrics for Finite Markov Decision Processes		UAI04
Bisimulation metrics for continuous Markov decision processes		SIAM Journal on Computing, 2011
Scalable methods for computing state similarity in deterministic Markov Decision Processes		AAAI20
Learning Invariant Representations for Reinforcement Learning without Reconstruction	DBC	ICLR21

Current methods

Title	Method	Conference	Description
Provably efficient RL with Rich Observations via Latent State Decoding	Block MDP	ICML19
Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels	DrQ	ICLR20	propsoe to apply data augmentation with model-free methods to reach state-of-the-art performance in image-pixels tasks
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO	----	ICLR20	show that the improvement of performance is related to code-level optimizations
What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study	----	ICLR21	do a large scale empirical study to evaluate different tricks for on-policy algorithms on MuJoCo
Mirror Descent Policy Optimization	MDPO	ICLR21
Randomized Ensemble Double Q-Learning: Learning Fast Without a Model	REDQ	ICLR21	consider three ingredients: (i) update q functions many times at every epoch; (ii) use an ensemble of Q functions; (iii) use the minimization across a random subset of Q functions from the ensemble for avoiding the overestimation; propose REDQ and achieve similar performance with model-based methods
Generalizable Episodic Memory for Deep Reinforcement Learning	GEM	ICML21	propose to integrate the generalization ability of neural networks and the fast retrieval manner of episodic memory
SO(2)-Equivariant Reinforcement Learning	Equi DQN, Equi SAC	ICLR22 Spotlight	consider to learn transformation-invariant policies and value functions; define and analyze group equivariant MDPs
CoBERL: Contrastive BERT for Reinforcement Learning	CoBERL	ICLR22 Spotlight	propose Contrastive BERT for RL (COBERL) that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency
Understanding and Preventing Capacity Loss in Reinforcement Learning	InFeR	ICLR22 Spotlight	propose that deep RL agents lose some of their capacity to quickly fit new prediction tasks during training; propose InFeR to regularize a set of network outputs towards their initial values
On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning	----	ICLR22 Spotlight	consider lottery ticket hypothesis in deep reinforcement learning
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration	LOGO	ICLR22 Spotlight	consider the sparse reward challenges in RL; propose LOGO that exploits the offline demonstration data generated by a sub-optimal behavior policy; each step of LOGO contains a policy improvement step via TRPO and an additional policy guidance step by using the sub-optimal behavior policy
Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation	IV-RL	ICLR22 Spotlight	analyze the sources of uncertainty in the supervision of modelfree DRL algorithms, and show that the variance of the supervision noise can be estimated with negative log-likelihood and variance ensembles
Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning	GPM	ICLR22 Spotlight	focus on generating consistent actions for model-free RL, and borrow ideas from Model-based planning and action-repeat; use the policy to generate multi-step actions
When should agents explore?	----	ICLR22 Spotlight	consider when to explore and propose to choose a heterogeneous mode-switching behavior policy
Maximizing Ensemble Diversity in Deep Reinforcement Learning	MED-RL	ICLR22
Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities	AMBS	ICLR22
Large Batch Experience Replay	LaBER	ICML22 oral	cast the replay buffer sampling problem as an importance sampling one for estimating the gradient and derive the theoretically optimal sampling distribution
Do Differentiable Simulators Give Better Gradients for Policy Optimization?	----	ICML22 oral	consider whether differentiable simulators give better policy gradients; show some pitfalls of First-order estimates and propose alpha-order estimates
Federated Reinforcement Learning: Communication-Efficient Algorithms and Convergence Analysis		ICML22 oral
An Analytical Update Rule for General Policy Optimization	----	ICML22 oral	provide a tighter bound for truse-region methods
Generalised Policy Improvement with Geometric Policy Composition	GSPs	ICML22 oral	propose the concept of geometric switching policy (GSP), i.e., we have a set of policies and will use them to take action in turn, for each policy, we sample a number from the geometric distribution and take this policy such number of steps; consider policy improvement over nonMarkov GSPs
Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error	----	ICML22	aim to better understand the relationship between the Bellman error and the accuracy of value functions through theoretical analysis and empirical study; point out that the Bellman error is a poor replacement for value error, including (i) The magnitude of the Bellman error hides bias, (ii) Missing transitions breaks the Bellman equation
Adaptive Model Design for Markov Decision Process	----	ICML22	consider Regularized Markov Decision Process and formulate it as a bi-level problem
Stabilizing Off-Policy Deep Reinforcement Learning from Pixels	A-LIX	ICML22	propose that temporal-difference learning with a convolutional encoder and lowmagnitude reward will cause instabilities, which is named catastrophic self-overfitting; propose to provide adaptive regularization to the encoder’s gradients that explicitly prevents the occurrence of catastrophic self-overfitting
Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach	----	ICML22	study PG from a perturbation perspective
Mirror Learning: A Unifying Framework of Policy Optimisation	Mirror Learning	ICML22	propose a novel unified theoretical framework named Mirror Learning to provide theoretical guarantees for General Policy Improvement (GPI) and Trust-Region Learning (TRL); propose an interesting, graph-theoretical perspective on mirror learning
Continuous Control with Action Quantization from Demonstrations	AQuaDem	ICML22	leverag the prior of human demonstrations for reducing a continuous action space to a discrete set of meaningful actions; point out that using a set of actions rather than a single one (Behavioral Cloning) enables to capture the multimodality of behaviors in the demonstrations
Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory	----	ICML22	analyze Fitted Q Evaluation (FQE) with general differentiable function approximators, including neural function approximations by using the Z-estimation theory
A Temporal-Difference Approach to Policy Gradient Estimation		ICML22
The Primacy Bias in Deep Reinforcement Learning		ICML22
Optimizing Sequential Experimental Design with Deep Reinforcement Learning		ICML22	use DRL for solving the optimal design of sequential experiments
The Geometry of Robust Value Functions		ICML22	study the geometry of the robust value space for the more general Robust MDPs
Direct Behavior Specification via Constrained Reinforcement Learning		ICML22
Utility Theory for Markovian Sequential Decision Making	Affine-Reward MDPs	ICML22	extend von Neumann-Morgenstern (VNM) utility theorem to decision making setting
Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks	MeanQ	ICML22	consider variance reduction in Temporal-Difference Value Estimation; propose MeanQ to estimate target values by ensembling
Unifying Approximate Gradient Updates for Policy Optimization		ICML22
EqR: Equivariant Representations for Data-Efficient Reinforcement Learning		ICML22
Provable Reinforcement Learning with a Short-Term Memory		ICML22
Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration		ICML22
Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments		ICML22
Lagrangian Method for Q-Function Learning (with Applications to Machine Translation)		ICML22
Learning to Assemble with Large-Scale Structured Reinforcement Learning		ICML22
Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning		ICML22
Off-Policy Reinforcement Learning with Delayed Rewards		ICML22
Reachability Constrained Reinforcement Learning		ICML22
Reinforcement Learning with Neural Radiance Fields	NeRF-RL	NeurIPS22	propose to train an encoder that maps multiple image observations to a latent space describing the objects in the scene
Recursive Reinforcement Learning		NeurIPS22
Challenging Common Assumptions in Convex Reinforcement Learning		NeurIPS22
Explicable Policy Search		NeurIPS22
On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting	----	NeurIPS22	explore the theoretical connections between Reward Maximization (RM) and Distribution Matching (DM)
When to Ask for Help: Proactive Interventions in Autonomous Reinforcement Learning		NeurIPS22
Adaptive Bio-Inspired Fish Simulation with Deep Reinforcement Learning		NeurIPS22
Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space		NeurIPS22
Discovered Policy Optimisation		NeurIPS22
Faster Deep Reinforcement Learning with Slower Online Network		NeurIPS22
exploration-guided reward shaping for reinforcement learning under sparse rewards		NeurIPS22
an adaptive deep rl method for non-stationary environments with piecewise stable context		NeurIPS22
Large-Scale Retrieval for Reinforcement Learning		NeurIPS22
Sustainable Online Reinforcement Learning for Auto-bidding		NeurIPS22
LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward		NeurIPS22
DNA: Proximal Policy Optimization with a Dual Network Architecture		NeurIPS22
Faster Deep Reinforcement Learning with Slower Online Network	DQN Pro, Rainbow Pro	NeurIPS22	incentivize the online network to remain in the proximity of the target network
Online Reinforcement Learning for Mixed Policy Scopes		NeurIPS22
ProtoX: Explaining a Reinforcement Learning Agent via Prototyping		NeurIPS22
Hardness in Markov Decision Processes: Theory and Practice		NeurIPS22
Robust Phi-Divergence MDPs		NeurIPS22
On the convergence of policy gradient methods to Nash equilibria in general stochastic games		NeurIPS22
A Unified Off-Policy Evaluation Approach for General Value Function		NeurIPS22
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning		NeurIPS22
Continuous Deep Q-Learning in Optimal Control Problems: Normalized Advantage Functions Analysis		NeurIPS22
Parametrically Retargetable Decision-Makers Tend To Seek Power		NeurIPS22
Batch size-invariance for policy optimization		NeurIPS22
Trust Region Policy Optimization with Optimal Transport Discrepancies: Duality and Algorithm for Continuous Actions		NeurIPS22
Adaptive Interest for Emphatic Reinforcement Learning		NeurIPS22
The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning		NeurIPS22
Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress	PVRL	NeurIPS22	focus on reincarnating RL from any agent to any other agent; present reincarnating RL as an alternative workflow or class of problem settings, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another
Bayesian Risk Markov Decision Processes		NeurIPS22
Explainable Reinforcement Learning via Model Transforms		NeurIPS22
PDSketch: Integrated Planning Domain Programming and Learning		NeurIPS22
Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier	SR-SAC, SR-SPR	ICLR23 oral	show that fully or partially resetting the parameters of deep reinforcement learning agents causes better replay ratio scaling capabilities to emerge
Guarded Policy Optimization with Imperfect Online Demonstrations	TS2C	ICLR23 Spotlight	h incorporate teacher intervention based on trajectory-based value estimation
Towards Interpretable Deep Reinforcement Learning with Human-Friendly Prototypes	PW-Net	ICLR23 Spotlight	focus on making an “interpretable-by-design” deep reinforcement learning agent which is forced to use human-friendly prototypes in its decisions for making its reasoning process clear; train a “wrapper” model called PW-Net that can be added to any pre-trained agent, which allows them to be interpretableby-design
Pink Noise Is All You Need: Colored Noise Exploration in Deep Reinforcement Learning		ICLR23 Spotlight
DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems		ICLR23 Spotlight
Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting		ICLR23
Replay Memory as An Empirical MDP: Combining Conservative Estimation with Experience Replay		ICLR23
Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement	CCEM, GreedyAC	ICLR23	propose to iteratively take the top percentile of actions, ranked according to the learned action-values; leverage theory for CEM to validate that CCEM concentrates on maximally valued actions across states over time
Reward Design with Language Models	----	ICLR23	explore how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior
Solving Continuous Control via Q-learning	DecQN	ICLR23	combine value decomposition with bang-bang action space discretization to DQN to handle continuous control tasks; evaluate on DMControl, Meta WOrld, and Isaac Gym
Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees	WAE-MDP	ICLR23	minimize a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy
Quality-Similar Diversity via Population Based Reinforcement Learning		ICLR23
Human-level Atari 200x faster	MEME	ICLR23	outperform the human baseline across all 57 Atari games in 390M frames; four key components: (1) an approximate trust region method which enables stable bootstrapping from the online network, (2) a normalisation scheme for the loss and priorities which improves robustness when learning a set of value functions with a wide range of scales, (3) an improved architecture employing techniques from NFNets in order to leverage deeper networks without the need for normalization layers, and (4) a policy distillation method which serves to smooth out the instantaneous greedy policy over time.
Policy Expansion for Bridging Offline-to-Online Reinforcement Learning		ICLR23
Improving Deep Policy Gradients with Value Function Search	VFS	ICLR23	focus on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient; show that value functions with better predictions improve Deep PG primitives, leading to better sample efficiency and policies with higher returns
Memory Gym: Partially Observable Challenges to Memory-Based Agents	Memory Gym	ICLR23	a benchmark for challenging Deep Reinforcement Learning agents to memorize events across long sequences, be robust to noise, and generalize; consists of the partially observable 2D and discrete control environments Mortar Mayhem, Mystery Path, and Searing Spotlights; [code]
Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality		ICLR23
Hybrid RL: Using both offline and online data can make RL efficient	Hy-Q	ICLR23	focus on a hybrid setting named Hybrid RL, where the agent has both an offline dataset and the ability to interact with the environment; extend fitted Q-iteration algorithm
POPGym: Benchmarking Partially Observable Reinforcement Learning	POPGym	ICLR23	a two-part library containing (1) a diverse collection of 15 partially observable environments, each with multiple difficulties and (2) implementations of 13 memory model baselines; [code]
Critic Sequential Monte Carlo	CriticSMC	ICLR23	combine sequential Monte Carlo with learned Soft-Q function heuristic factors
Revocable Deep Reinforcement Learning with Affinity Regularization for Outlier-Robust Graph Matching		ICLR23

Model Based (Online) RL

Classic Methods

Title	Method	Conference	Description
Value-Aware Loss Function for Model-based Reinforcement Learning	VAML	AISTATS17	propose to train model by using the difference between TD error rather than KL-divergence
Model-Ensemble Trust-Region Policy Optimization	ME-TRPO	ICLR18	analyze the behavior of vanilla MBRL methods with DNN; propose ME-TRPO with two ideas: (i) use an ensemble of models, (ii) use likelihood ratio derivatives; significantly reduce the sample complexity compared to model-free methods
Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning	MVE	ICML18	use a dynamics model to simulate the short-term horizon and Q-learning to estimate the long-term value beyond the simulation horizon; use the trained model and the policy to estimate k-step value function for updating value function
Iterative value-aware model learning	IterVAML	NeurIPS18	replace e the supremum in VAML with the current estimate of the value function
Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion	STEVE	NeurIPS18	an extension to MVE; only utilize roll-outs without introducing significant errors
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models	PETS	NeurIPS18	propose PETS that incorporate uncertainty via an ensemble of bootstrapped models
Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees	SLBO	ICLR19	propose a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees: provide a lower bound of the true return satisfying some properties s.t. optimizing this lower bound can actually optimize the true return
When to Trust Your Model: Model-Based Policy Optimization	MBPO	NeurIPS19	propose MBPO with monotonic model-based improvement; theoretically discuss how to choose k for model rollouts
Model Based Reinforcement Learning for Atari	SimPLe	ICLR20	first successfully handle ALE benchmark with model-based method with some designs: (i) deterministic Model; (ii) well-designed loss functions; (iii) scheduled sampling; (iv) stochastic Models
Bidirectional Model-based Policy Optimization	BMPO	ICML20	an extension to MBPO; consider both forward dynamics model and backward dynamics model
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning	CaDM	ICML20	develop a context-aware dynamics model (CaDM) capable of generalizing across a distribution of environments with varying transition dynamics; introduce a backward dynamics model that predicts a previous state by utilizing a context latent vector
A Game Theoretic Framework for Model Based Reinforcement Learning	PAL, MAL	ICML20	develop a novel framework that casts MBRL as a game between a policy player and a model player; setup a Stackelberg game between the two players
Planning to Explore via Self-Supervised World Models	Plan2Explore	ICML20	propose a self-supervised reinforcement learning agent for addressing two challenges: quick adaptation and expected future novelty
Trust the Model When It Is Confident: Masked Model-based Actor-Critic	M2AC	NeurIPS20	an extension to MBPO; use model rollouts only when the model is confident
The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning	LoCA	NeurIPS20	propose LoCA to measure how quickly a method adapts its policy after the environment is changed from the first task to the second
Generative Temporal Difference Learning for Infinite-Horizon Prediction	GHM, or gamma-model	NeurIPS20	propose gamma-model to make long-horizon predictions without the need to repeatedly apply a single-step model
Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning	----	arXiv2012	study a number of design decisions for the predictive model in visual MBRL algorithms, focusing specifically on methods that use a predictive model for planning
Mastering Atari Games with Limited Data	EfficientZero	NeurIPS21	first achieve super-human performance on Atari games with limited data; propose EfficientZero with three components: (i) use self-supervised learning to learn a temporally consistent environment model, (ii) learn the value prefix in an end-to-end manner, (iii) use the learned model to correct off-policy value targets
On Effective Scheduling of Model-based Reinforcement Learning	AutoMBPO	NeurIPS21	an extension to MBPO; automatically schedule the real data ratio as well as other hyperparameters for MBPO
Model-Advantage and Value-Aware Models for Model-Based Reinforcement Learning: Bridging the Gap in Theory and Practice	----	arxiv22	bridge the gap in theory and practice of value-aware model learning (VAML) for model-based RL
Value Gradient weighted Model-Based Reinforcement Learning	VaGraM	ICLR22 Spotlight	consider the objective mismatch problem in MBRL; propose VaGraM by rescaling the MSE loss function with gradient information from the current value function estimate
Constrained Policy Optimization via Bayesian World Models	LAMBDA	ICLR22 Spotlight	consider Bayesian model-based methods for CMDP
On-Policy Model Errors in Reinforcement Learning	OPC	ICLR22	consider to combine real-world data and a learned model in order to get the best of both worlds; propose to exploit the real-world data for onpolicy predictions and use the learned model only to generalize to different actions; propose to use on-policy transition data on top of a separately learned model to enable accurate long-term predictions for MBRL
Temporal Difference Learning for Model Predictive Control	TD-MPC	ICML22	propose to use the model only to predice reward; use a policy to accelerate the planning
Causal Dynamics Learning for Task-Independent State Abstraction		ICML22
Mismatched no More: Joint Model-Policy Optimization for Model-Based RL	MnM	NeurIPS22	propose a model-based RL algorithm where the model and policy are jointly optimized with respect to the same objective, which is a lower bound on the expected return under the true environment dynamics, and becomes tight under certain assumptions
When to Update Your Model: Constrained Model-based Reinforcement Learning		NeurIPS22
Bayesian Optimistic Optimization: Optimistic Exploration for Model-Based Reinforcement Learning		NeurIPS22
Model-based Lifelong Reinforcement Learning with Bayesian Exploration		NeurIPS22
Plan to Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning		NeurIPS22
data-driven model-based optimization via invariant representation learning		NeurIPS22
Reinforcement Learning with Non-Exponential Discounting	----	NeurIPS22	propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions; derive a Hamilton–Jacobi–Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method
Making Better Decision by Directly Planning in Continuous Control		ICLR23
HiT-MDP: Learning the SMDP option framework on MDPs with Hidden Temporal Embeddings		ICLR23
Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning		ICLR23
Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective	ALM	ICLR23	propose a single objective which jointly optimizes the policy, the
latent-space model, and the representations produced by the encoder using the same objective: maximize predicted rewards while minimizing the errors in the predicted representations
SpeedyZero: Mastering Atari with Limited Data and Time	SpeedyZero	ICLR23	a distributed RL system built upon EfficientZero with Priority Refresh and Clipped LARS; lead to human-level performances on the Atari benchmark within 35 minutes using only 300k samples

World Models

Title	Method	Conference	Description
World Models, [NeurIPS version]	World Models	NeurIPS18	use an unsupervised manner to learn a compressed spatial and temporal representation of the environment and use the world model to train a very compact and simple policy for solving the required task
Learning latent dynamics for planning from pixels	PlaNet	ICML19	propose PlaNet to learn the environment dynamics from images; the dynamic model consists transition model, observation model, reward model and encoder; use the cross entropy method for selecting actions for planning
Dream to Control: Learning Behaviors by Latent Imagination	Dreamer	ICLR20	solve long-horizon tasks from images purely by latent imagination; test in image-based MuJoCo; propose to use an agent to replace the control algorithm in the PlaNet
Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning	BIRD	NeurIPS20	propose to maximize the mutual information between imaginary and real trajectories so that the policy improvement learned from imaginary trajectories can be easily generalized to real trajectories
Planning to Explore via Self-Supervised World Models	Plan2Explore	ICML20	propose Plan2Explore to self-supervised exploration and fast adaptation to new tasks
Mastering Atari with Discrete World Models	Dreamerv2	ICLR21	solve long-horizon tasks from images purely by latent imagination; test in image-based Atari
Temporal Predictive Coding For Model-Based Planning In Latent Space	TPC	ICML21	propose a temporal predictive coding approach for planning from high-dimensional observations and theoretically analyze its ability to prioritize the encoding of task-relevant information
Learning Task Informed Abstractions	TIA	ICML21	introduce the formalism of Task Informed MDP (TiMDP) that is realized by training two models that learn visual features via cooperative reconstruction, but one model is adversarially dissociated from the reward signal
Dreaming: Model-based Reinforcement Learning by Latent Imagination without Reconstruction	Dreaming	ICRA21	propose a decoder-free extension of Dreamer since the autoencoding based approach often causes object vanishing
Model-Based Reinforcement Learning via Imagination with Derived Memory	IDM	NeurIPS21	hope to improve the diversity of imagination for model-based policy optimization with the derived memory; point out that current methods cannot effectively enrich the imagination if the latent state is disturbed by random noises
Maximum Entropy Model-based Reinforcement Learning	MaxEnt Dreamer	NeurIPS21	create a connection between exploration methods and model-based reinforcement learning; apply maximum-entropy exploration for Dreamer
DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations	DreamerPro	ICML22	consider reconstruction-free MBRL; propose to learn the prototypes from the recurrent states of the world model, thereby distilling temporal structures from past observations and actions into the prototypes.
TransDreamer: Reinforcement Learning with Transformer World Models	TransDreamer	arxiv2202	replace the RNN in RSSM by a transformer
DreamingV2: Reinforcement Learning with Discrete World Models without Reconstruction	Dreamingv2	arxiv2203	adopt both the discrete representation of DreamerV2 and the reconstruction-free objective of Dreaming
Masked World Models for Visual Control	MWM	arxiv2206	decouple visual representation learning and dynamics learning for visual model-based RL and use masked autoencoder to train visual representation
DayDreamer: World Models for Physical Robot Learning	DayDreamer	arxiv2206	apply Dreamer to 4 robots to learn online and directly in the real world, without any simulators
Towards Evaluating Adaptivity of Model-Based Reinforcement Learning Methods	----	ICML22	introduce an improved version of the LoCA setup and use it to evaluate PlaNet and Dreamerv2
Reinforcement Learning with Action-Free Pre-Training from Videos	APV	ICML22	pre-train an action-free latent video prediction model using videos from different domains, and then fine-tune the pre-trained model on target domains
Denoised MDPs: Learning World Models Better Than the World Itself	Denoised MDP	ICML22	divide information into four categories: controllable/uncontrollable (whether infected by the action) and reward-relevant/irrelevant (whether affects the return); propose to only consider information which is controllable and reward-relevant
Iso-Dream: Isolating Noncontrollable Visual Dynamics in World Models	Iso-Dream	NeurIPS22	consider noncontrollable dynamics independent of the action signals; encourage the world model to learn controllable and noncontrollable sources of spatiotemporal changes on isolated state transition branches; optimize the behavior of the agent on the decoupled latent imaginations of the world model
Learning General World Models in a Handful of Reward-Free Deployments	CASCADE	NeurIPS22	introduce the reward-free deployment efficiency setting to facilitate generalization (exploration should be task agnostic) and scalability (exploration policies should collect large quantities of data without costly centralized retraining); propose an information theoretic objective inspired by Bayesian Active Learning by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective
Learning Robust Dynamics through Variational Sparse Gating	VSG, SVSG, BBS	NeurIPS22	consider to sparsely update the latent states at each step; develope a new partially-observable and stochastic environment, called BringBackShapes (BBS)
Transformers are Sample Efficient World Models	IRIS	ICLR23 oral	use a discrete autoencoder and an autoregressive Transformer to conduct World Models and significantly improve the data efficiency in Atari (2 hours of real-time experience); [code]
Transformer-based World Models Are Happy With 100k Interactions	TWM	ICLR23	present a new autoregressive world model based on the Transformer-XL; obtain excellent results on the Atari 100k benchmark; [code]
Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting	DUTD	ICLR23	propose a new general method that dynamically adjusts the update to data (UTD) ratio during training based on underand overfitting detection on a small subset of the continuously collected experience not used for training; apply this method in DreamerV2
Evaluating Long-Term Memory in 3D Mazes	Memory Maze	ICLR23	introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents, including an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation; [code]
Mastering Diverse Domains through World Models	DreamerV3	arxiv2301	propose DreamerV3 to handle a wide range of domains, including continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales
Reward Informed Dreamer for Task Generalization in Reinforcement Learning	RID	arXiv2303	propose Task Distribution Relevance to capture the relevance of the task distribution quantitatively; propose RID to use world models to improve task generalization via encoding reward signals into policies

CodeBase

Title	Conference	Methods	Github
MBRL-Lib: A Modular Library for Model-based Reinforcement Learning	arxiv21	MBPO,PETS,PlaNet	link

(Model Free) Offline RL

Current Methods

Title	Method	Conference	Description
Off-Policy Deep Reinforcement Learning without Exploration	BCQ	ICML19	show that off-policy methods perform badly because of extrapolation error; propose batch-constrained reinforcement learning: maximizing the return as well as minimizing the mismatch between the state-action visitation of the policy and the state-action pairs contained in the batch
Conservative Q-Learning for Offline Reinforcement Learning	CQL	NeurIPS20	propose CQL with conservative q function, which is a lower bound of its true value, since standard off-policy methods will overestimate the value function
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems	----	arxiv20	tutorial about methods, applications and open problems of offline rl
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble		NeurIPS21
A Minimalist Approach to Offline Reinforcement Learning	TD3+BC	NeurIPS21	propsoe to add a behavior cloning term to regularize the policy, and normalize the states over the dataset
DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization	DR3	ICLR22 Spotlight	consider the implicit regularization effect of SGD in RL; based on theoretical analyses, propose an explicit regularizer, called DR3, and combine with offline RL methods
Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning	PBRL	ICLR22 Spotlight	consider the distributional shift and extrapolation error in offline RL; propose PBRL with bootstrapping, for uncertainty quantification, and an OOD sampling method as a regularizer
COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation	COptiDICE	ICLR22 Spotlight	consider offline constrained reinforcement learning; propose COptiDICE to directly optimize the distribution of state-action pair with contraints
Offline Reinforcement Learning with Value-based Episodic Memory	EVL, VEM	ICLR22	present a new offline V -learning method to learn the value function through the trade-offs between imitation learning and optimal value learning; use a memory-based planning scheme to enhance advantage estimation and conduct policy learning in a regression manner
Offline reinforcement learning with implicit Q-learning	IQL	ICLR22	propose to learn an optimal policy with in-sample learning, without ever querying the values of any unseen actions
Adversarially Trained Actor Critic for Offline Reinforcement Learning		ICML22 oral
Learning Bellman Complete Representations for Offline Policy Evaluation		ICML22 oral
Offline RL Policies Should Be Trained to be Adaptive	APE-V	ICML22 oral	show that learning from an offline dataset does not fully specify the environment; formally demonstrate the necessity of adaptation in offline RL by using the Bayesian formalism and to provide a practical algorithm for learning optimally adaptive policies; propose an ensemble-based offline RL algorithm that imbues policies with the ability to adapt within an episode
Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity		ICML22
How to Leverage Unlabeled Data in Offline Reinforcement Learning?		ICML22
On the Role of Discount Factor in Offline Reinforcement Learning		ICML22
Model Selection in Batch Policy Optimization		ICML22
Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics		ICML22
Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning		ICML22
Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning		ICML22
Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Matters		ICML22
Constrained Offline Policy Optimization		ICML22
DASCO: Dual-Generator Adversarial Support Constrained Offline Reinforcement Learning		NeurIPS22
Supported Policy Optimization for Offline Reinforcement Learning		NeurIPS22
Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters		NeurIPS22
Oracle Inequalities for Model Selection in Offline Reinforcement Learning		NeurIPS22
Mildly Conservative Q-Learning for Offline Reinforcement Learning		NeurIPS22
A Policy-Guided Imitation Approach for Offline Reinforcement Learning		NeurIPS22
Bootstrapped Transformer for Offline Reinforcement Learning		NeurIPS22
LobsDICE: Offline Learning from Observation via Stationary Distribution Correction Estimation		NeurIPS22
Latent-Variable Advantage-Weighted Policy Optimization for Offline RL		NeurIPS22
How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via f-Advantage Regression		NeurIPS22
NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning		NeurIPS22
When does return-conditioned supervised learning work for offline reinforcement learning?		NeurIPS22
Bellman Residual Orthogonalization for Offline Reinforcement Learning		NeurIPS22
Oracle Inequalities for Model Selection in Offline Reinforcement Learning		NeurIPS22
Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes		ICLR23 oral
Confidence-Conditioned Value Functions for Offline Reinforcement Learning		ICLR23 oral
Extreme Q-Learning: MaxEnt RL without Entropy		ICLR23 oral
Sparse Q-Learning: Offline Reinforcement Learning with Implicit Value Regularization		ICLR23 oral
The In-Sample Softmax for Offline Reinforcement Learning		ICLR23 Spotlight
Benchmarking Offline Reinforcement Learning on Real-Robot Hardware		ICLR23 Spotlight
Decision S4: Efficient Sequence-Based RL via State Spaces Layers		ICLR23
Behavior Proximal Policy Optimization		ICLR23
Learning Achievement Structure for Structured Exploration in Domains with Sparse Reward		ICLR23
Explaining RL Decisions with Trajectories		ICLR23
User-Interactive Offline Reinforcement Learning		ICLR23
Pareto-Efficient Decision Agents for Offline Multi-Objective Reinforcement Learning		ICLR23
Offline RL for Natural Language Generation with Implicit Language Q Learning		ICLR23
In-sample Actor Critic for Offline Reinforcement Learning		ICLR23
Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting		ICLR23
Mind the Gap: Offline Policy Optimizaiton for Imperfect Rewards		ICLR23
When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning	DOGE	ICLR23	train a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint

Combined with Diffusion Models

Title	Method	Conference	Description
Planning with Diffusion for Flexible Behavior Synthesis		ICML22 oral
Is Conditional Generative Modeling all you need for Decision Making?		ICLR23 oral
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning		ICLR23
Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling		ICLR23

Model Based Offline RL

Title	Method	Conference	Description
Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization	BREMEN	ICLR20	propose deployment efficiency, to count the number of changes in the data-collection policy during learning (offline: 1, online: no limit); propose BERMEN with an ensemble of dynamics models for off-policy and offline rl
MOPO: Model-based Offline Policy Optimization	MOPO	NeurIPS20	observe that existing model-based RL algorithms can improve the performance of offline RL compared with model free RL algorithms; design MOPO by extending MBPO on uncertainty-penalized MDPs (new_reward = reward - uncertainty)
MOReL: Model-Based Offline Reinforcement Learning	MOReL	NeurIPS20	present MOReL for model-based offline RL, including two steps: (a) learning a pessimistic MDP, (b) learning a near-optimal policy in this P-MDP
Model-Based Offline Planning	MBOP	ICLR21	learn a model for planning
Representation Balancing Offline Model-Based Reinforcement Learning	RepB-SDE	ICLR21	focus on learning the representation for a robust model of the environment under the distribution shift and extend RepBM to deal with the curse of horizon; propose RepB-SDE framework for off-policy evaluation and offline rl
Conservative Objective Models for Effective Offline Model-Based Optimization	COMs	ICML21	consider offline model-based optimization (MBO, optimize an unknown function only with some samples); add a regularizer (resemble adversarial training methods) to the objective forlearning conservative objective models
COMBO: Conservative Offline Model-Based Policy Optimization	COMBO	NeurIPS21	try to optimize a lower bound of performance without considering uncertainty quantification; extend CQL with model-based methods
Weighted Model Estimation for Offline Model-Based Reinforcement Learning	----	NeurIPS21	address the covariate shift issue by re-weighting the model losses for different datapoints
Revisiting Design Choices in Model-Based Offline Reinforcement Learning	----	ICLR22 Spotlight	conduct a rigorous investigation into a series of these design choices for Model-based Offline RL
Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage	CPPO	ICLR22
Pareto Policy Pool for Model-based Offline Reinforcement Learning		ICLR22
Planning with Diffusion for Flexible Behavior Synthesis	Diffuser	ICML22 oral	first design a denoising diffusion model for trajectory data and an associated probabilistic framework for behavior synthesis
Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning		ICML22
Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief		NeurIPS22
A Unified Framework for Alternating Offline Model Training and Policy Learning		NeurIPS22
Bidirectional Learning for Offline Infinite-width Model-based Optimization		NeurIPS22
Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization		ICLR23
Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning		ICLR23
Efficient Offline Policy Optimization with a Learned Model		ICLR23

Meta RL

Title	Method	Conference	Description
RL2 : Fast reinforcement learning via slow reinforcement learning	RL2	arxiv16	view the learning process of the agent itself as an objective; structure the agent as a recurrent neural network to store past rewards, actions, observations and termination flags for adapting to the task at hand when deployed
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks	MAML	ICML17	propose a general framework for different learning problems, including classification, regression andreinforcement learning; the main idea is to optimize the parameters to quickly adapt to new tasks (with a few steps of gradient descent)
Meta reinforcement learning with latent variable gaussian processes	----	arxiv18
Learning to adapt in dynamic, real-world environments through meta-reinforcement learning	ReBAL, GrBAL	ICLR18	consider learning online adaptation in the context of model-based reinforcement learning
Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory	----	ICML18	extend various PAC-Bayes bounds to meta learning
Meta reinforcement learning of structured exploration strategies		NeurIPS18
Meta-learning surrogate models for sequential decision making		arxiv19
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables	PEARL	ICML19	encode past tasks’ experience with probabilistic latent context and use inference network to estimate the posterior
Fast context adaptation via meta-learning	CAVIA	ICML19	propose CAVIA as an extension to MAML that is less prone to meta-overfitting, easier to parallelise, and more interpretable; partition the model parameters into two parts: context parameters and shared parameters, and only update the former one in the test stage
Taming MAML: Efficient Unbiased Meta-Reinforcement Learning		ICML19
Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning	Meta World	CoRL19	an envoriment for meta RL as well as multi-task RL
Guided meta-policy search	GMPS	NeurIPS19	consider the sample efficiency during the meta-training process by using supervised imitation learning;
Meta-Q-Learning	MQL	ICLR20	an off-policy algorithm for meta RL andbuilds upon three simple ideas: (i) Q Learning with context variable represented by pasttrajectories is competitive with SOTA; (ii) Multi-task objective is useful for meta RL; (iii) Past data from the meta-training replay buffer can be recycled
Varibad: A very good method for bayes-adaptive deep RL via meta-learning	variBAD	ICLR20	represent a single MDP M using a learned, low-dimensional stochastic latent variable m; jointly meta-train a variational auto-encoder that can infer the posterior distribution over m in a new task, and a policy that conditions on this posterior belief over MDP embeddings
On the global optimality of modelagnostic meta-learning, ICML version	----	ICML20	characterize the optimality gap of the stationary points attained by MAML for both rl and sl
Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling	MIER	arxiv20
FOCAL: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization	FOCAL	ICLR21	first consider offline meta-reinforcement learning; propose FOCAL based on PEARL
Offline meta reinforcement learning with advantage weighting	MACAW	ICML21	introduce the offline meta reinforcement learning problem setting; propose an optimization-based meta-learning algorithm named MACAW that uses simple, supervised regression objectives for both the inner and outer loop of meta-training
Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture	LDM	NeurIPS21	aim to train an agent that prepares for unseen test tasks during training, propose to train a policy on mixture tasks along with original training tasks for preventing the agent from overfitting the training tasks
Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation	----	NeurIPS21	present a unified framework for estimating higher-order derivatives of value functions, based on the concept of off-policy evaluation, for gradient-based meta rl
Generalization of Model-Agnostic Meta-Learning Algorithms: Recurring and Unseen Tasks	----	NeurIPS21
Offline Meta Learning of Exploration, Offline Meta Reinforcement Learning -- Identifiability Challenges and Effective Data Collection Strategies	BOReL	NeurIPS21
On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning	SG-MRL	NeurIPS21
Hindsight Task Relabelling: Experience Replay for Sparse Reward Meta-RL	----	NeurIPS21
Generalization Bounds for Meta-Learning via PAC-Bayes and Uniform Stability	----	NeurIPS21	provide generalization bound on meta-learning by combining PAC-Bayes thchnique and uniform stability
Bootstrapped Meta-Learning	BMG	ICLR22 Oral	propose BMG to let the metalearner teach itself for tackling ill-conditioning problems and myopic metaobjectives in meta learning; BGM introduces meta-bootstrap to mitigate myopia and formulate the meta-objective in terms of minimising distance to control curvature
Model-Based Offline Meta-Reinforcement Learning with Regularization	MerPO, RAC	ICLR22	empirically point out that offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets; consider how to learn an informative offline meta-policy in order to achieve the optimal tradeoff between “exploring” the out-of-distribution state-actions by following the meta-policy and “exploiting” the offline dataset by staying close to the behavior policy; propose MerPO which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions
Skill-based Meta-Reinforcement Learning	SiMPL	ICLR22	propose a method that jointly leverages (i) a large offline dataset of prior experience collected across many tasks without reward or task annotations and (ii) a set of meta-training tasks to learn how to quickly solve unseen long-horizon tasks.
Hindsight Foresight Relabeling for Meta-Reinforcement Learning	HFR	ICLR22	focus on improving the sample efficiency of the meta-training phase via data sharing; combine relabeling techniques with meta-RL algorithms in order to boost both sample efficiency and asymptotic performance
CoMPS: Continual Meta Policy Search	CoMPS	ICLR22	first formulate the continual meta-RL setting, where the agent interacts with a single task at a time and, once finished with a task, never interacts with it again
Learning a subspace of policies for online adaptation in Reinforcement Learning	----	ICLR22	consider the setting with just a single train environment; propose an approach where we learn a subspace of policies within the parameter space
Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models and Amortized Policy Search	GSSM	ICML22	consider model-based meta reinforcement learning, which consists of dynamics model learning and policy optimization; develop a graph structured dynamics model with superior generalization capability across tasks
Meta-Learning Hypothesis Spaces for Sequential Decision-making	Meta-KeL	ICML22	argue that two critical capabilities of transformers, reason over long-term dependencies and present context-dependent weights from self-attention, compose the central role of a Meta-Reinforcement Learner; propose Meta-LeL for meta-learning the hypothesis space of a sequential decision task
Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learning		ICML22
Transformers are Meta-Reinforcement Learners	TrMRL	ICML22	propose TrMRL, a memory-based meta-Reinforcement Learner which uses the transformer architecture to formulate the learning process;
Offline Meta-Reinforcement Learning with Online Self-Supervision		ICML22
Distributional Meta-Gradient Reinforcement Learning		ICLR23

Adversarial RL

Title	Method	Conference	Description
Adversarial Attacks on Neural Network Policies	----	ICLR 2017 workshop	first show that existing rl policies coupled with deep neural networks are vulnerable to adversarial noises in white-box and black-box settings
Delving into Adversarial Attacks on Deep Policies	----	ICLR 2017 workshop	show rl algorithms are vulnerable to adversarial noises; show adversarial training can improve robustness
Robust Adversarial Reinforcement Learning	RARL	ICML17	formulate the robust policy learning as a zero-sum, minimax objective function
Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning	Critical Point Attack, Antagonist Attack	AAAI20	critical point attack: build a model to predict the future environmental states and agent’s actions for attacking; antagonist attack: automatically learn a domain-agnostic model for attacking
Safe Reinforcement Learning in Constrained Markov Decision Processes	SNO-MDP	ICML20	explore and optimize Markov decision processes under unknown safety constraints
Robust Deep Reinforcement Learning Against Adversarial Perturbations on State Observations	SA-MDP	NeurIPS20	formalize adversarial attack on state observation as SA-MDP; propose some novel attack methods: Robust SARSA and Maximal Action Difference; propose a defence framework and some practical methods: SA-DQN, SA-PPO and SA-DDPG
Robust Reinforcement Learning on State Observations with Learned Optimal Adversary	ATLA	ICLR21	use rl algorithms to train an "optimal" adversary; alternatively train "optimal" adversary and robust agent
Robust Deep Reinforcement Learning through Adversarial Loss	RADIAL-RL	NeurIPS21	propose a robust rl framework, which penalizes the overlap between output bounds of actions; propose a more efficient evaluation method (GWC) to measure attack agnostic robustness
Policy Smoothing for Provably Robust Reinforcement Learning	Policy Smoothing	ICLR22	introduce randomized smoothing into RL; propose adaptive Neyman-Person Lemma
CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing	CROP	ICLR22	present a framework of Certifying Robust Policies for RL (CROP) against adversarial state perturbations with two certification criteria: robustness of per-state actions and lower bound of cumulative rewards; theoretically prove the certification radius; conduct experiments to provide certification for six empirically robust RL algorithms on Atari
Policy Gradient Method For Robust Reinforcement Learning		ICML22
SAUTE RL: Toward Almost Surely Safe Reinforcement Learning Using State Augmentation		ICML22
Constrained Variational Policy Optimization for Safe Reinforcement Learning		ICML22
Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum		ICML22
Distributionally Robust Q-Learning		ICML22
Robust Meta-learning with Sampling Noise and Label Noise via Eigen-Reptile		ICML22
DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck		ICML22
Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning	----	SCIS 2023	summarize current optimization-based adversarial attacks in RL; propose a two-stage methods: train a deceptive policy and mislead the victim to imitate the deceptive policy
On the Robustness of Safe Reinforcement Learning under Observational Perturbations		ICLR23
Consistent Attack: Universal Adversarial Perturbation on Embodied Vision Navigation	Reward UAP, Trajectory UAP	PRL 2023	extend universal adversarial perturbations into sequential decision and propose Reward UAP as well as Trajectory UAP via utilizing the dynamic; experiment in Embodied Vision Navigation tasks

Genaralisation in RL

Environments

Title	Method	Conference	Description
Quantifying Generalization in Reinforcement Learning	CoinRun	ICML19	introduce a new environment called CoinRun for generalisation in RL; empirically show L2 regularization, dropout, data augmentation and batch normalization can improve generalization in RL
Leveraging Procedural Generation to Benchmark Reinforcement Learning	Procgen Benchmark	ICML20	introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning

Methods

Title	Method	Conference	Description
Towards Generalization and Simplicity in Continuous Control	----	NeurIPS17	policies with simple linear and RBF parameterizations can be trained to solve a variety of widely studied continuous control tasks; training with a diverse initial state distribution induces more global policies with better generalization
Universal Planning Networks	UPN	ICML18	study a model-based architecture that performs a differentiable planning computation in a latent space jointly learned with forward dynamics, trained end-to-end to encode what is necessary for solving tasks by gradient-based planning
On the Generalization Gap in Reparameterizable Reinforcement Learning	----	ICML19	theoretically provide guarantees on the gap between the expected and empirical return for both intrinsic and external errors in reparameterizable RL
Investigating Generalisation in Continuous Deep Reinforcement Learning	----	arxiv19	study generalisation in Deep RL for continuous control
Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck	SNI	NeurIPS19	consder regularization techniques relying on the injection of noise into the learned function for improving generalization; hope to maintain the regularizing effect of the injected noise and mitigate its adverse effects on the gradient quality
Network randomization: A simple technique for generalization in deep reinforcement learning	Network Randomization	ICLR20	introduce a randomized (convolutional) neural network that randomly perturbs input observations, which enables trained agents to adapt to new domains by learning robust features invariant across varied and randomized environments
Observational Overfitting in Reinforcement Learning	observational overfitting	ICLR20	discuss realistic instances where observational overfitting may occur and its difference from other confounding factors, and design a parametric theoretical framework to induce observational overfitting that can be applied to any underlying MDP
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning	CaDM	ICML20	decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it
Improving Generalization in Reinforcement Learning with Mixture Regularization	mixreg	NeurIPS20	train agents on a mixture of observations from different training environments and imposes linearity constraints on the observation interpolations and the supervision (e.g. associated reward) interpolations
Instance based Generalization in Reinforcement Learning	IPAE	NeurIPS20	formalize the concept of training levels as instances and show that this instance-based view is fully consistent with the standard POMDP formulation; provide generalization bounds to the value gap in train and test environments based on the number of training instances, and use insights based on these to improve performance on unseen levels
Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning	PSM	ICLR21	incorporate the inherent sequential structure in reinforcement learning into the representation learning process to improve generalization; introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states
Generalization in Reinforcement Learning by Soft Data Augmentation	SODA	ICRA21	imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data,
Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment	AugWM	ICML21	consider the setting named "dynamics generalization from a single offline environment" and concentrate on the zero-shot performance to unseen dynamics; propose dynamics augmentation for model based offline RL; propose a simple self-supervised context adaptation reward-free algorithm
Decoupling Value and Policy for Generalization in Reinforcement Learning	IDAAC	ICML21	decouples the optimization of the policy and value function, using separate networks to model them; introduce an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment
Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability	LEEP	NeurIPS21	generalisation in RL induces implicit partial observability; propose LEEP to use an ensemble of policies to approximately learn the Bayes-optimal policy for maximizing test-time performance
Automatic Data Augmentation for Generalization in Reinforcement Learning	DrAC	NeurIPS21	focus on automatic data augmentation based two novel regularization terms for the policy and value function
When Is Generalizable Reinforcement Learning Tractable?	----	NeurIPS21	propose Weak Proximity and Strong Proximity for theoretically analyzing the generalisation of RL
A Survey of Generalisation in Deep Reinforcement Learning	----	arxiv21	provide a unifying formalism and terminology for discussing different generalisation problems
Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL	CTRL	ICLR22	consider zero-shot generalization (ZSG); use self-supervised learning to learn a representation across tasks
The Role of Pretrained Representations for the OOD Generalization of RL Agents	----	ICLR22	train 240 representations and 11,520 downstream policies and systematically investigate their performance under a diverse range of distribution shifts; find that a specific representation metric that measures the generalization of a simple downstream proxy task reliably predicts the generalization of downstream RL agents under the broad spectrum of OOD settings considered here
Generalisation in Lifelong Reinforcement Learning through Logical Composition	----	ICLR22	e leverage logical composition in reinforcement learning to create a framework that enables an agent to autonomously determine whether a new task can be immediately solved using its existing abilities, or whether a task-specific skill should be learned
Local Feature Swapping for Generalization in Reinforcement Learning	CLOP	ICLR22	introduce a new regularization technique consisting of channel-consistent local permutations of the feature maps
A Generalist Agent	Gato	arxiv2205	slide
Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk	CPPO	IJCAI22	find the connection between modifying observations and dynamics, which are structurally different
CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer	CtrlFormer	ICML22	jointly learns self-attention mechanisms between visual tokens and policy tokens among different control tasks, where multitask representation can be learned and transferred without catastrophic forgetting
Learning Dynamics and Generalization in Reinforcement Learning	----	ICML22	show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization
Improving Policy Optimization with Generalist-Specialist Learning	GSL	ICML22	hope to utilize experiences from the specialists to aid the policy optimization of the generalist; propose the phenomenon “catastrophic ignorance” in multi-task learning
DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck	DRIBO	ICML22	learn robust representations that encode only task-relevant information from observations based on the unsupervised multi-view setting; introduce a novel contrastive version of the Multi-View Information Bottleneck (MIB) objective for temporal data
Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning	GRADER	NeurIPS22	use the causal graph as a latent variable to reformulate the GCRL problem and then derive an iterative training framework from solving this problem
Rethinking Value Function Learning for Generalization in Reinforcement Learning	DCPG, DDCPG	NeurIPS22	consider to train agents on multiple training environments to improve observational generalization performance; identify that the value network in the multiple-environment setting is more challenging to optimize; propose regularization methods that penalize large estimates of the value network for preventing overfitting
Masked Autoencoding for Scalable and Generalizable Decision Making	MaskDP	NeurIPS22	employ a masked autoencoder (MAE) to state-action trajectories for reinforcement learning (RL) and behavioral cloning (BC) and gain the capability of zero-shot transfer to new tasks
Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning	PIE-G	NeurIPS22	find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL
GALOIS: Boosting Deep Reinforcement Learning via Generalizable Logic Synthesis		NeurIPS22
Human-Timescale Adaptation in an Open-Ended Task Space	AdA	arXiv 2301	demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans
In-context Reinforcement Learning with Algorithm Distillation	AD	ICLR23 oral	propose Algorithm Distillation for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model
Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories		ICLR23
Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPs		ICLR23	show that, given a fixed amount of pretraining data, agents trained with more variations are able to generalize better; find that increasing the capacity of the value and policy network is critical to achieve good performance
Investigating Multi-task Pretraining and Generalization in Reinforcement Learning	----	ICLR23	find that, given a fixed amount of pretraining data, agents trained with more variations are able to generalize better; this advantage can still be present after fine-tuning for 200M environment frames than when doing zero-shot transfer
Priors, Hierarchy, and Information Asymmetry for Skill Transfer in Reinforcement Learning		ICLR23
Cross-domain Random Pre-training with Prototypes for Reinforcement Learning	CRPTpro	arXiv2302	use prototypical representation learning with a novel intrinsic loss to pre-train an effective and generic encoder across different domains
Reward Informed Dreamer for Task Generalization in Reinforcement Learning	RID	arXiv2303	propose Task Distribution Relevance to capture the relevance of the task distribution quantitatively; propose RID to use world models to improve task generalization via encoding reward signals into policies

RL with Transformer

Title	Method	Conference	Description
Stabilizing transformers for reinforcement learning	GTrXL	ICML20	stabilizing training with a reordering of the layer normalization coupled with the addition of a new gating mechanism to key points in the submodules of the transformer
Decision Transformer: Reinforcement Learning via Sequence Modeling	DT	NeurIPS21	regard RL as a sequence generation task and use transformer to generate (return-to-go, state, action, return-to-go, ...); there is not explicit optimization process; evaluate on Offline RL
Offline Reinforcement Learning as One Big Sequence Modeling Problem	TT	NeurIPS21	regard RL as a sequence generation task and use transformer to generate (s_0^0, ..., s_0^N, a_0^0, ..., a_0^M, r_0, ...); use beam search to inference; evaluate on imitation learning, goal-conditioned RL and Offline RL
Can Wikipedia Help Offline Reinforcement Learning?	ChibiT	arxiv2201	demonstrate that pre-training on autoregressively modeling natural language provides consistent performance gains when compared to the Decision Transformer on both the popular OpenAI Gym and Atari
Online Decision Transformer	ODT	ICML22 oral	blends offline pretraining with online finetuning in a unified framework; use sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning
Prompting Decision Transformer for Few-shot Policy Generalization		ICML22
Multi-Game Decision Transformers	----	NeurIPS22	show that a single transformer-based model trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance
Bootstrapped Transformer for Offline Reinforcement Learning		NeurIPS22
Dichotomy of Control: Separating What You Can Control from What You Cannot		ICLR23 oral
Decision Transformer under Random Frame Dropping		ICLR23
Hyper-Decision Transformer for Efficient Online Policy Adaptation		ICLR23
Preference Transformer: Modeling Human Preferences using Transformers for RL		ICLR23
On the Data-Efficiency with Contrastive Image Transformation in Reinforcement Learning		ICLR23

Representation RL

Note: representation learning with MBRL is in the part World Models

Title	Method	Conference	Description
Diversity is All You Need: Learning Skills without a Reward Function	DIAYN	ICLR19	learn diverse skills in environments without any rewards by maximizing an information theoretic objective
CURL: Contrastive Unsupervised Representations for Reinforcement Learning	CURL	ICML20	extracts high-level features from raw pixels using contrastive learning and performs offpolicy control on top of the extracted features
Learning Invariant Representations for Reinforcement Learning without Reconstruction	DBC	ICLR21	propose using Bisimulation to learn robust latent representations which encode only the task-relevant information from observations
Decoupling representation learning from reinforcement learning	ATC	ICML21	propose a new unsupervised task tailored to reinforcement learning named Augmented Temporal Contrast (ATC), which borrows ideas from Contrastive learning; benchmark several leading Unsupervised Learning algorithms by pre-training encoders on expert demonstrations and using them in RL agents
Pretraining representations for data-efficient reinforcement learning	SGI	NeurIPS21	consider to pretrian with unlabeled data and finetune on a small amount of task-specific data to improve the data efficiency of RL; employ a combination of latent dynamics modelling and unsupervised goal-conditioned RL
Understanding the World Through Action	----	CoRL21	discusse how self-supervised reinforcement learning combined with offline RL can enable scalable representation learning
URLB: Unsupervised Reinforcement Learning Benchmark	URLB	NeurIPS21	a benchmark for unsupervised reinforcement learning
The Information Geometry of Unsupervised Reinforcement Learning	----	ICLR22	show that unsupervised skill discovery algorithms based on mutual information maximization do not learn skills that are optimal for every possible reward function; provide a geometric perspective on some skill learning methods
The Unsurprising Effectiveness of Pre-Trained Vision Models for Control		ICML22 oral
a mixture of supervised and unsupervised reinforcement learning		NeurIPS22
Contrastive Learning as Goal-Conditioned Reinforcement Learning	Contrastive RL	NeurIPS22	show (contrastive) representation learning methods can be cast as RL algorithms in their own right
Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?	----	NeurIPS22	conduct an extensive comparison of various self-supervised losses under the existing joint learning framework for pixel-based reinforcement learning in many environments from different benchmarks, including one real-world environment
Unsupervised Reinforcement Learning with Contrastive Intrinsic Control	CIC	NeurIPS22	propose to maximize the mutual information between statetransitions and latent skill vectors
Reinforcement Learning with Automated Auxiliary Loss Search	A2LS	NeurIPS22	propose to automatically search top-performing auxiliary loss functions for learning better representations in RL; define a general auxiliary loss space of size 7.5 × 1020 based on the collected trajectory data and explore the space with an efficient evolutionary search strategy
Mask-based Latent Reconstruction for Reinforcement Learning	MLR	NeurIPS22	propose an effective self-supervised method to predict complete state representations in the latent space from the observations with spatially and temporally masked pixels
Look where you look! Saliency-guided Q-networks for visual RL tasks	SGQN	NeurIPS22	propose that a good visual policy should be able to identify which pixels are important for its decision; preserve this identification of important sources of information across images
Choreographer: Learning and Adapting Skills in Imagination		ICLR23 Spotlight
Flow-based Recurrent Belief State Learning for POMDPs	FORBES	ICML22	incorporate normalizing flows into the variational inference to learn general continuous belief states for POMDPs
Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training	VIP	ICLR23 Spotlight	cast representation learning from human videos as an offline goal-conditioned reinforcement learning problem; derive a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos
Latent Variable Representation for Reinforcement Learning		ICLR23
Spectral Decomposition Representation for Reinforcement Learning		ICLR23
Behavior Prior Representation learning for Offline Reinforcement Learning		ICLR23
Provable Unsupervised Data Sharing for Offline Reinforcement Learning		ICLR23
Become a Proficient Player with Limited Data through Watching Pure Videos		ICLR23

Continual / Lifelong RL

Title	Method	Conference	Description
Revisiting Curiosity for Exploration in Procedurally Generated Environments		ICLR23

Tutorial and Lesson

Tutorial and Lesson
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
Introduction to Reinforcement Learning with David Silver
Deep Reinforcement Learning, CS285
Deep Reinforcement Learning and Control, CMU 10703
RLChina

Name		Name	Last commit message	Last commit date
Latest commit History 357 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning Papers

Contents

Model Free (Online) RL

Classic Methods

Exploration

Off-Policy Evaluation

Soft RL

Bisimulation

Current methods

Model Based (Online) RL

Classic Methods

World Models

CodeBase

(Model Free) Offline RL

Current Methods

Combined with Diffusion Models

Model Based Offline RL

Meta RL

Adversarial RL

Genaralisation in RL

Environments

Methods

RL with Transformer

Representation RL

Continual / Lifelong RL

Tutorial and Lesson

About

Releases

Packages

License

giakoumidis/Reinforcement-Learning-Papers

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning Papers

Contents

Model Free (Online) RL

Classic Methods

Exploration

Off-Policy Evaluation

Soft RL

Bisimulation

Current methods

Model Based (Online) RL

Classic Methods

World Models

CodeBase

(Model Free) Offline RL

Current Methods

Combined with Diffusion Models

Model Based Offline RL

Meta RL

Adversarial RL

Genaralisation in RL

Environments

Methods

RL with Transformer

Representation RL

Continual / Lifelong RL

Tutorial and Lesson

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages