Some questions about the technology used in the CDT paper #24

Eternity-Wang · 2024-09-13T14:40:00Z

Hi, can you please explain in detail the reason for relabelling infeasible target return pairs (Data augmentation by return relabeling in the paper)? I'm very confused about its relationship with outlier filtering, which is mentioned in the D.2 section of the paper.

liuzuxin · 2024-09-13T16:58:28Z

Hi @Eternity-Wang , the intuition behind it is that: during the agent execution procedure, there might be infeasible target reward return and cost return, which may lead to the model confusion -- should it satisfy the target reward by increasing the cost or satisfying the target cost by reducing the reward? We want the latter one, so we synthesize such kind of data to make the model safer in such cases. Regarding the outlier, they are real trajectories with corresponding outlier reward and cost returns sampled by the model (but rarely happen). Note that in the data augmentation phase, we do relabeling, meaning that these trajectories themself do not have outlier reward and cost returns.

Eternity-Wang · 2024-09-14T04:28:49Z

Hi @liuzuxin, thanks for your quickly and helpful reply, my understanding based on your description is as follows.

If the target reward return and target cost return set during the execution phase are infeasible, e.g., greater than the reward of RF points under the certain cost threshold, only then do we need to modify them with the help of data augmentation to ensure that they are safe to prioritize.
To be able to use data augmentation, we need to have access to the training dataset at the time of execution, otherwise it is also not possible to look for the nearest safe trajectory.
Outlier filtering occurs after the dataset has been constructed but before training, whereas data augmentation occurs after training.
I hope my understanding matches what you're expected to express.

Ja4822 · 2024-09-16T22:40:51Z

Hi @Eternity-Wang, it seems there may have been some misunderstandings. I just wanted to clarify that both outlier filtering and data augmentation are performed prior to training. During the execution phase, neither the target reward nor the cost return are modified. The user is able to set the values for the target reward and cost return, which are then passed directly to the trained agent to evaluate its performance.

Eternity-Wang · 2024-09-17T08:37:01Z

Thank you for your helpful response, I seem to understand the stages and roles performed by outlier filtering and data augmentation. But I still have some questions about the data augmentation, hope you can help me to better understand the idea and insight you want to present:

Why it can enable the policy to be the safety-first?
Whether data augmentation can be interpreted as a modification of the original RF value at certain cost threshold k, since the algorithm seems to add an augmented trajectory in which the target cost reward is equal to the cost threshold k, but the target reward reward is greater than the original RF value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the technology used in the CDT paper #24

Some questions about the technology used in the CDT paper #24

Eternity-Wang commented Sep 13, 2024

liuzuxin commented Sep 13, 2024

Eternity-Wang commented Sep 14, 2024 •

edited

Loading

Ja4822 commented Sep 16, 2024

Eternity-Wang commented Sep 17, 2024 •

edited

Loading

Some questions about the technology used in the CDT paper #24

Some questions about the technology used in the CDT paper #24

Comments

Eternity-Wang commented Sep 13, 2024

liuzuxin commented Sep 13, 2024

Eternity-Wang commented Sep 14, 2024 • edited Loading

Ja4822 commented Sep 16, 2024

Eternity-Wang commented Sep 17, 2024 • edited Loading

Eternity-Wang commented Sep 14, 2024 •

edited

Loading

Eternity-Wang commented Sep 17, 2024 •

edited

Loading