Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the technology used in the CDT paper #24

Open
Eternity-Wang opened this issue Sep 13, 2024 · 4 comments
Open

Some questions about the technology used in the CDT paper #24

Eternity-Wang opened this issue Sep 13, 2024 · 4 comments

Comments

@Eternity-Wang
Copy link

Hi, can you please explain in detail the reason for relabelling infeasible target return pairs (Data augmentation by return relabeling in the paper)? I'm very confused about its relationship with outlier filtering, which is mentioned in the D.2 section of the paper.

@liuzuxin
Copy link
Owner

Hi @Eternity-Wang , the intuition behind it is that: during the agent execution procedure, there might be infeasible target reward return and cost return, which may lead to the model confusion -- should it satisfy the target reward by increasing the cost or satisfying the target cost by reducing the reward? We want the latter one, so we synthesize such kind of data to make the model safer in such cases. Regarding the outlier, they are real trajectories with corresponding outlier reward and cost returns sampled by the model (but rarely happen). Note that in the data augmentation phase, we do relabeling, meaning that these trajectories themself do not have outlier reward and cost returns.

@Eternity-Wang
Copy link
Author

Eternity-Wang commented Sep 14, 2024

Hi @liuzuxin, thanks for your quickly and helpful reply, my understanding based on your description is as follows.

  1. If the target reward return and target cost return set during the execution phase are infeasible, e.g., greater than the reward of RF points under the certain cost threshold, only then do we need to modify them with the help of data augmentation to ensure that they are safe to prioritize.
  2. To be able to use data augmentation, we need to have access to the training dataset at the time of execution, otherwise it is also not possible to look for the nearest safe trajectory.
  3. Outlier filtering occurs after the dataset has been constructed but before training, whereas data augmentation occurs after training.
    I hope my understanding matches what you're expected to express.

@Ja4822
Copy link
Collaborator

Ja4822 commented Sep 16, 2024

Hi @Eternity-Wang, it seems there may have been some misunderstandings. I just wanted to clarify that both outlier filtering and data augmentation are performed prior to training. During the execution phase, neither the target reward nor the cost return are modified. The user is able to set the values for the target reward and cost return, which are then passed directly to the trained agent to evaluate its performance.

@Eternity-Wang
Copy link
Author

Eternity-Wang commented Sep 17, 2024

Thank you for your helpful response, I seem to understand the stages and roles performed by outlier filtering and data augmentation. But I still have some questions about the data augmentation, hope you can help me to better understand the idea and insight you want to present:

  1. Why it can enable the policy to be the safety-first?
  2. Whether data augmentation can be interpreted as a modification of the original RF value at certain cost threshold k, since the algorithm seems to add an augmented trajectory in which the target cost reward is equal to the cost threshold k, but the target reward reward is greater than the original RF value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants