Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep reinforcement learning from human preferences #49

Open
nagataka opened this issue Sep 17, 2023 · 0 comments
Open

Deep reinforcement learning from human preferences #49

nagataka opened this issue Sep 17, 2023 · 0 comments

Comments

@nagataka
Copy link
Owner

Summary

Link

Deep reinforcement learning from human preferences

Author/Institution

OpenAI/DeepMind

What is this

  • Propose a way to fit a reward function from human preference, and then perform RL on it to optimize the policy to maximize the cumulative rewards.

Comparison with previous researches. What are the novelties/good points?

Key points

  • segments generally begin from different states
    • (This is a the very important assumption to collect a diverse set of segments)
  • The human overseer is given a visualization of two trajectory segments, in the form of short movie clips. In all of our experiments, these clips are between 1 and 2 seconds long
    • one segment is preferable,
    • segments as equally preferable, or
    • incomparable
  • Compute a softmax (See eq.1)
    • 10% chance to be uniform. This simulate a human annotation error
  • Use an ensemble of predictors (ensemble of 3)
    • Calculate the STD of the predictions. Send highest variance clips to human rators
  • RL part itself is very basic/straightforward

How the author proved effectiveness of the proposal?

  • Experiments using OpenAI Gym (Simulated Robotics and Atari)
  • 700 queries to human rater for Simulated Robotics tasks
  • 5,500 for Atari

Any discussions?

What should I read next?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant