This codebase accompanies the paper "Benefits of Assistance over Reward Learning". This branch can be used to reproduce the experiments from the paper, and uses deep RL solvers from Stable Baselines 2 (based on TensorFlow 1).
The sb3
branch is a port of our codebase to PyTorch-based Stable Baselines 3.
pip install -e .
python -m assistance_games.run
You should see the environment below:
Meal choice environment (Section 4.1):
# H comes home early
python -m assistance_games.run -e mealchoice -a pbvi
# H comes home late
python -m assistance_games.run -e mealchoice -k feedback_time:3 -a pbvi
Wormy apples environment (Section 4.2):
# Regular assistance
python -m assistance_games.run -e worms -a pbvi
# Two phase
python -m assistance_games.run -e worms -k two_phase:True -a pbvi
# With a lower discount
python -m assistance_games.run -e worms -a pbvi -k discount:0.9
python -m assistance_games.run -e worms -a pbvi -k two_phase:True,discount:0.9
Cake or pie environment (Section 4.3):
python -m assistance_games.run -e cake_or_pie -a dqn -nr -o pedagogic_human --seed 0
The five seeds used in the paper are 0 through 4.
Running headless: use the xvfb-run -a
command, e.g. for the cake or pie environment
xvfb-run -a python -m assistance_games.run -e cake_or_pie -a dqn -nr -o pedagogic_human --seed 0
Plotting deep RL training curves: use the src/assistance_games/plot_eval_stats.ipynb
notebook.
Contains files with the core classes, such as:
pomdp.py
: Abstract class for POMDPs, inherits fromgym.Env
assistance.py
: Defines an overarching abstract classes for assistance problems, as well as subclasses that provide more methods that can be used to achieve speedupsreduction.py
: Implements the reduction from assistance problems to POMDPs, as described in the paper. Multiple versions of the reduction are available, depending on how the resulting POMDP is meant to be used (e.g. do we need transition matrices, or just a method computes the next state given the current state and actions?)distributions.py
: Defines classes for probability distributions. We rolled our own implementation instead of using an existing one.
Contains various implementations of assistance problems and POMDPs.
Rendering utils for envs.
Implements POMDP solvers that can be used to solve the environments.
- exact_vi : Exact solver, can only solve very small environments
- pbvi : Approximate anytime solver, relatively fast for medium-sized environments
- deep_rl_solve : Simple wrapper around stable_baselines.PPO2
Parser for .pomdp files (mostly for testing/benchmarking solvers).
Simple script to run an environment with a specific solver and evaluate the resulting policy.
Some utils.
Very basic smoke tests.
If running PBVI on your task gives different results each time, you might want to increase the number of iterations to make sure it finds the optimal solution. Some of the environments are sensitive to this.
Normally, in POMDPs, we use probabilities O(o | s) of observations. However, for assistance problems, we are often dealing with a very particular type of POMDP, in which our original state space is fully observable and deterministic, with the only uncertainty being the reward; and since all the information about the reward is contained in the human's actions, we can instead treat just the human's actions as observations. This change is implemented in reduction.py
and will be automatically selected for you if you use run.py
and pass the appropriate fully_observable
and deterministic
flags during environment creation (as is done in our environments). This greatly reduces the complexity of the exact solver and PBVI. For example, in RedBlue, it reduces an exponent in the complexity from 24 to 2, changing it from intractable to solvable in seconds.