-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Adding stable baselines PPO agent #3
base: master
Are you sure you want to change the base?
Conversation
Hey Corey, thanks a lot for you contribution! The main problem with encapsulating sb3 algorithms is that we're not capable of directly exposing internals (e.g. loss) the way Avalanche does, thus limiting the customization capabilities of Plugins. This is why a2c and dqn have been implemented from scratch. Regarding your changes and rightful concerns:
Overall, I think if we do want to support sb3 algorithms we should make a separate class which addresses the problems you mentioned, specifying the limits in customization (e.g. no optim, criterion, loss fiddling..) and your implementation could serve as a base for that. On the other hand, I'm unsure whether having a less flexible SBStrategy could have advantages over re-writing PPO to fit RLBaseStrategy guidelines and possibly add missing features to our base class (like the Transition object). P.S. A workaround to the |
Hey @coreylowman, thanks for the PR! I need a bit of time to look into it but I can give some high-level comments.
Like @NickLucche said, SB3 doesn't give access to some attributes that we use in Avalanche. Therefore, we will probably need a custom Notice that we did some work in avalanche and we are experimenting with templates, which allow to define different training loops, with different events/callbacks, and internal state. I would prefer to limit the number of them since they add a bit of complexity, but I think RL is a place where we will need a lot of customization, which is now supported by the main avalanche branch.
Maybe a clean solution would be to provide all these attributes inside the experience and do the configuration before each experience. Alternatively, we can put the attributes in the benchmark object and pass that object to the strategy.
Of course, the ideal solution would be to have PPO implemented inside Avalanche so that we have full control of it and we can easily make it compatible with the rest of the library. The problem is that:
So, if the wrapper is not too expensive to mantain, I think it can be a good middle ground. Of course this only makes sense if we can support at least part of avalanche features with the wrapper, such as:
|
Hey all thanks for the feedback!
Since the strategy has access to the
I think so! We found passing in the vectorized transitions to a strategy and letting them handle it any way they want easier for both parties.
Yes the rollout function expects only vectorized environments now - I had changed the make_eval_env() to vectorize too. We should annotate the env passed into rollout as VectorEnv instead of Env
I believe its possible to make generic wrapper for on-policy and off policy algorithms separately, but not a base algorithm. This is due to the differences in their collect_rollout() functions. The code in
I think including the spaces in the experience would be useful, but most algorithms will need access to the spaces when they construct the model. In our framework we just pulled the spaces from the first experience in the scenario, which seemed to work. I don't think num_envs belongs there because that really depends on what computer you are running on.
Agree I think many will be incompatible at this point. We could probably extract the nn.Module from the underlying SB3 objects, but getting hooks into the SB3 training function may be difficult. |
Ok, I've taken a more in-depth look at the PR and I would propose a few modifications in order to comply with the current structure of RLStrategy and use that as superclass, while I would postpone the more "impactful" changes (such as Let me know if you want to help contribute to these changes:
for self.timestep in range(self.current_experience_steps.value):
self.before_rollout(**kwargs)
self.rollouts = self.rollout(
env=self.environment, n_rollouts=self.rollouts_per_step,
max_steps=self.max_steps_per_rollout)
self.after_rollout(**kwargs)
for self.update_step in range(self.updates_per_step):
self.update(self.rollouts)
# periodic evaluation
self._periodic_eval(eval_streams, do_final=False)
@coreylowman let me know whether this proposal makes sense to you and whether you see additional difficulties for PPO and other on-policy methods. We could have a small hierarchy made of SB3OnPolicy and SB3PPO subclass if necessary (although it would be cool to just implement the former and have all on-policy algorithms working). |
Do you have a sense for how easy it would be to rework existing algorithms into this SB3 structure? If you're thinking this SB3 structure is good, it'd probably be less work overall to just move existing algorithms to this SB3 structure now. Perhaps do that update first before any of the other updates for this PR? I think the rest of what you say makes sense! I do think just doing OnPolicy class would be straightforward. |
I think I'd rather keep sb3 and avalanche-rl algorithms separated, due to the limitations in flexibility of the former. |
I'll post an update asap on this ppo thread implementing the points I mentioned, hopefully we will be able to generalize that to an on-policy class |
Ok I've refactored your implementation in this branch https://github.com/ContinualAI/avalanche-rl/blob/pr/avalanche_rl/training/strategies/sb3_onpolicy_strategy.py, let me know what you think of it. There are many rough edges atm, but the core idea is the following: we can handle model and rollouts/transitions collection from the "outside", while we let sb3 compute the loss and optimization step internally. Things that should be worked out in order to have the bare-functionalities version are:
|
Here is a running example of stable baselines 3 PPO agent. It requires more custom strategy class, so this does not inherit from the
RLBaseStrategy
in the avalanche-rl repo.Main changes:
choose_actions(Observations) -> Actions
to be called in both training & evaluation to get actions for specific observations. This replacesRLBaseStrategy.sample_actions()
, which was only used during trainingreceive_transitions(Vectorized[Transitions])
to be called during rollout with the result of callingenv.step(actions)
. This replaces theRollout
object that was previously returned from rollout. This is also where strategies can put training logic.rollout()
to enable usage in bothtraining_exp()
andeval_exp()
. It now generates episode data rather than returning rollouts.Timestep
as experience limits and periodic eval frequencies.Things I'm unsure about:
BaseStrategy
? Will there be support for Strategies that manage those themselves rather than passing to BaseStrategy?My current thinking for way ahead is:
RLBaseStrategy
, or a new base strategy classchoose_actions()
andreceive_transitions()
would be abstract methods of the new base classchoose_actions()
andreceive_transitions()
.