What is the role of the actor network in the training of a PPO agent? #918

101AlexMartin · 2024-02-23T23:40:16Z

101AlexMartin
Feb 23, 2024

Hi,

This question might be already answered, but I was unable to spot it.
After checking the ppo_agent.py, it seems unclear to me what is the role of the actor network in the training of a PPO agent. I see the value network is important for the calculation of the advantages, but I don't see how it interacts with the actor network.

Thanks in advance for the answer :)

Answered by SMH17

Feb 25, 2024

Basically actor network is used to determine which actions the agent takes within its environment to increase advantage.

In doing this, it uses a series of policies, once the actor network selects actions based on the policy, the value network evaluates those actions by estimating the expected cumulative reward (value prediction see the comment at the top of ppo_agent.py) associated with those actions in order to do the more appropriate choices that increases the likelihood to maximize them adjusting the network parameters.

In summary

Actor Network (actor_net) selects actions within the environment taking the current state as input.
Value network estimates the expected future reward f…

View full answer

SMH17 · 2024-02-25T16:19:22Z

SMH17
Feb 25, 2024

Basically actor network is used to determine which actions the agent takes within its environment to increase advantage.

In doing this, it uses a series of policies, once the actor network selects actions based on the policy, the value network evaluates those actions by estimating the expected cumulative reward (value prediction see the comment at the top of ppo_agent.py) associated with those actions in order to do the more appropriate choices that increases the likelihood to maximize them adjusting the network parameters.

In summary

Actor Network (actor_net) selects actions within the environment taking the current state as input.
Value network estimates the expected future reward from that state, considering the agent's current policy.
Advantage for each action is then calculated as the difference between the actual reward received after taking the action and the value estimate from the value network using Generalized Advantage Estimation (GAE) implemented through the reinforcement learning loop within the PPO algorithm.
Policy parameters are updated according the computed advantage estimation (that dictates the agent's behavior indicating the likelihood of taking each action).

5 replies

101AlexMartin Feb 25, 2024
Author

Thanks for your answer.
If I'm not mistaken, the only trace of the actor network in the training step of the PPO agent are the values of the actions and consequently the rewards, is that right? What I didn't fully understand was the relation between the loss function and the actor network parameters (i.e. line 947 of ppo_agent.py). For instance, I see a clear application of the value network in the training, refer to line 770 of ppo_agent.py, but I don't see any call to the actor network.
From your answer I understand (correct me if I'm wrong) that these actor parameters will be intrinsically present in the value_loss via the actual return obtained for the action proposed by the actor, and in the policy_loss via the advantage term, is that right?

SMH17 Feb 25, 2024

Yes, actor network parameters are intrinsically involved through the value loss and policy loss calculations.

Taking in account current state, each state in turn depends on the parameters updated in previous state adjustments.
So, it isn't needed a direct call to actor network, they are intrinsically tied each other.

101AlexMartin Feb 25, 2024
Author

What is the difference between the value_loss and the policy_loss? I see both of them are proportional to the advantage

SMH17 Feb 25, 2024

Both are crucial for improving accuracy, but they focus on different aspects.

value_loss is used to compare the predicted value of state with the actual return obtained by the agent starting from it and following its current policy, and it is used to increase rewards associated with different states, and so minimize the error in estimating the value function.

policy_loss is used to check the difference between the actual reward received and the value estimate for a specific state-action pair to maximize rewards of selected actions (and avoid actions that could have negative reward).

SMH17 Mar 25, 2024

@101AlexMartin You have closed it but not marked as "answered". XD

SMH17 · 2024-03-25T17:15:32Z

SMH17
Mar 25, 2024

@101AlexMartin If you don't need further clarification about the role of the actor network in the training of a PPO, please mark question as answered. Thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the role of the actor network in the training of a PPO agent? #918

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What is the role of the actor network in the training of a PPO agent? #918

101AlexMartin Feb 23, 2024

Replies: 2 comments · 5 replies

SMH17 Feb 25, 2024

101AlexMartin Feb 25, 2024 Author

SMH17 Feb 25, 2024

101AlexMartin Feb 25, 2024 Author

SMH17 Feb 25, 2024

SMH17 Mar 25, 2024

SMH17 Mar 25, 2024

101AlexMartin
Feb 23, 2024

Replies: 2 comments 5 replies

SMH17
Feb 25, 2024

101AlexMartin Feb 25, 2024
Author

101AlexMartin Feb 25, 2024
Author

SMH17
Mar 25, 2024