Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug in the document] dataset format for RewardTrainer #2164

Closed
1 of 4 tasks
yananchen1989 opened this issue Oct 3, 2024 · 25 comments
Closed
1 of 4 tasks

[bug in the document] dataset format for RewardTrainer #2164

yananchen1989 opened this issue Oct 3, 2024 · 25 comments
Assignees
Labels
🐛 bug Something isn't working 📚 documentation Improvements or additions to documentation ⏳ needs more info Additional information or clarification is required to proceed

Comments

@yananchen1989
Copy link

System Info

trl version > v0.11

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

In the official document https://huggingface.co/docs/trl/main/en/reward_trainer , The [RewardTrainer] requires a [implicit prompt preference dataset].

however, in the code script example: https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py#L18
I see the example is using trl-lib/ultrafeedback_binarized which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.

Could you look into this conflict ?
thanks.

Expected behavior

code and document alignment.

@yananchen1989 yananchen1989 added the 🐛 bug Something isn't working label Oct 3, 2024
@yananchen1989
Copy link
Author

here i test Anthropic/hh-rlhf and trl-lib/ultrafeedback_binarized in the dataset_name.
but neither works.

(i do not change anything in reward_modeling.py which is directly cloned from trl repo)

CUDA_VISIBLE_DEVICES=0 python ~/trl/examples/scripts/reward_modeling.py \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name  ${ds} \
    --output_dir Qwen2-0.5B-Reward-LoRA \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-4 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16

Traceback (most recent call last):
File "/workspace/trl/examples/scripts/reward_modeling.py", line 120, in
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2052, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2345, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 550, in iter
current_batch = next(dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/utils.py", line 362, in call
raise ValueError(
ValueError: The features should include input_ids_chosen, attention_mask_chosen, input_ids_rejected and attention_mask_rejected
0%| | 0/20100 [00:00<?, ?it/s]

@yananchen1989
Copy link
Author

on this pape. https://huggingface.co/docs/trl/v0.11.1/en/reward_trainer#reward-modeling

I see a conflict :
image
image

@qgallouedec qgallouedec self-assigned this Oct 3, 2024
@qgallouedec qgallouedec added 📚 documentation Improvements or additions to documentation and removed 🐛 bug Something isn't working labels Oct 3, 2024
@qgallouedec
Copy link
Member

In the official document https://huggingface.co/docs/trl/main/en/reward_trainer , The [RewardTrainer] requires a [implicit prompt preference dataset].
I see the example is using trl-lib/ultrafeedback_binarized which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.

trl-lib/ultrafeedback_binarized is implicit prompt since you don't have a prompt column. You can see that there is a common start ({'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}), this is the so called implicit prompt:

>>> from daataset import load_dataset
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'daataset'
>>> from datasets import load_dataset
>>> dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
>>> dataset.column_names
['chosen', 'rejected', 'score_chosen', 'score_rejected']
>>> dataset[0]
{'chosen': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': "Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! ...", 'role': 'assistant'}],
'rejected': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': 'Sure, here\'s an example of how to write a version of Snake game with a unique twist using the Pygame library:...', 'role': 'assistant'}], 'score_chosen': 6.0, 'score_rejected': 4.0}

@qgallouedec
Copy link
Member

qgallouedec commented Oct 4, 2024

here i test Anthropic/hh-rlhf and trl-lib/ultrafeedback_binarized in the dataset_name. but neither works.

The provided code works fine on my side:

python ../examples/scripts/reward_modeling.py \
     --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
     --dataset_name  trl-lib/ultrafeedback_binarized \
     --output_dir Qwen2-0.5B-Reward-LoRA \
     --per_device_train_batch_size 8 \
     --num_train_epochs 1 \
     --gradient_checkpointing True \
     --learning_rate 1.0e-4 \
     --logging_steps 25 \
     --eval_strategy steps \
     --eval_steps 50 \
     --max_length 2048 \
     --use_peft \
     --lora_r 32 \
     --lora_alpha 16

If the error persists, please provide your full system info (see bug issue template)

@qgallouedec
Copy link
Member

The reward trainer data support has been recently updated (#2102) . see the latest version of the doc for more info: https://huggingface.co/docs/trl/main/en/reward_trainer

@yananchen1989
Copy link
Author

In the official document https://huggingface.co/docs/trl/main/en/reward_trainer , The [RewardTrainer] requires a [implicit prompt preference dataset].
I see the example is using trl-lib/ultrafeedback_binarized which is not so-called "implicit prompt preference datase" as the prompt is explicitly provided in the dataset.

trl-lib/ultrafeedback_binarized is implicit prompt since you don't have a prompt column. You can see that there is a common start ({'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}), this is the so called implicit prompt:

>>> from daataset import load_dataset
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'daataset'
>>> from datasets import load_dataset
>>> dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
>>> dataset.column_names
['chosen', 'rejected', 'score_chosen', 'score_rejected']
>>> dataset[0]
{'chosen': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': "Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! ...", 'role': 'assistant'}],
'rejected': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': 'Sure, here\'s an example of how to write a version of Snake game with a unique twist using the Pygame library:...', 'role': 'assistant'}], 'score_chosen': 6.0, 'score_rejected': 4.0}

i see. got it. trl-lib/ultrafeedback_binarized and Anthropic/hh-rlhf are on the same boat.

@yananchen1989
Copy link
Author

CUDA_VISIBLE_DEVICES=0 python /home/ubuntu/trl/examples/scripts/reward_modeling.py \
     --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
     --dataset_name  trl-lib/ultrafeedback_binarized \
     --output_dir Qwen2-0.5B-Reward-LoRA \
     --per_device_train_batch_size 8 \
     --num_train_epochs 1 \
     --gradient_checkpointing True \
     --learning_rate 1.0e-4 \
     --logging_steps 25 \
     --eval_strategy steps \
     --eval_steps 50 \
     --max_length 2048 \
     --use_peft \
     --lora_r 16 \
     --lora_alpha 16

error:

Traceback (most recent call last):
File "/home/ubuntu/trl/examples/scripts/reward_modeling.py", line 120, in
trainer.train()
File "/opt/conda/envs/trl11/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/trl11/lib/python3.11/site-packages/transformers/trainer.py", line 2345, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/opt/conda/envs/trl11/lib/python3.11/site-packages/accelerate/data_loader.py", line 550, in iter
current_batch = next(dataloader_iter)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/trl11/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/trl11/lib/python3.11/site-packages/trl/trainer/utils.py", line 362, in call
raise ValueError(
ValueError: The features should include input_ids_chosen, attention_mask_chosen, input_ids_rejected and attention_mask_rejected
0%| | 0/7767 [00:00<?, ?it/s]

trl version: 0.11.1

by the way, trl env does not work:

Traceback (most recent call last):
File "/opt/conda/envs/trl11/bin/trl", line 8, in
sys.exit(main())
^^^^^^
File "/opt/conda/envs/trl11/lib/python3.11/site-packages/trl/commands/cli.py", line 38, in main
raise ValueError(
ValueError: Please use one of the supported commands, got env - supported commands are ['sft', 'dpo', 'chat', 'kto']

@yananchen1989
Copy link
Author

python version: 3.11.10

@qgallouedec
Copy link
Member

qgallouedec commented Oct 5, 2024

I've downgraded to v0.11.1 and I still can't reproduce the error.

by the way, trl env does not work:

trl env requires trl>=0.12. Can you run transformers-cli env instead?

Can you also confirm that you have not modified the codebase?

@yananchen1989
Copy link
Author

  • transformers version: 4.45.1
  • Platform: Linux-5.15.0-1061-aws-x86_64-with-glibc2.31
  • Python version: 3.11.10
  • Huggingface_hub version: 0.25.1
  • Safetensors version: 0.4.5
  • Accelerate version: 0.34.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A10G

@qgallouedec

@yananchen1989
Copy link
Author

i git pull in path /home/ubuntu/trl/

therefore everything is updated, including examples/scripts/reward_modeling.py

@yananchen1989
Copy link
Author

i install trl via pip install -U trl

@qgallouedec
Copy link
Member

I still can't reproduce, I tried to reinstall everything, but it still works. Can you try the same? Also, try clearing your cache.

python3.11 -m venv env
source env/bin/activate
pip install trl[peft]==0.11.1
curl -O https://raw.githubusercontent.com/huggingface/trl/86ad7a7e85dc65c79bd9759097709a27ad1a58dd/examples/scripts/reward_modeling.py
python reward_modeling.py \
     --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
     --dataset_name  trl-lib/ultrafeedback_binarized \
     --output_dir Qwen2-0.5B-Reward-LoRA \
     --per_device_train_batch_size 8 \
     --num_train_epochs 1 \
     --gradient_checkpointing True \
     --learning_rate 1.0e-4 \
     --logging_steps 25 \
     --eval_strategy steps \
     --eval_steps 50 \
     --max_length 2048 \
     --use_peft \
     --lora_r 32 \
     --lora_alpha 16
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/fsx/qgallouedec/trl/tmp/reward_modeling.py:108: UserWarning: You are using a `task_type` that is different than `SEQ_CLS` for PEFT. This will lead to silent bugs Make sure to pass --lora_task_type SEQ_CLS when using this script with PEFT.
  warnings.warn(
Filter: 100%|█████████████████████████████████████████████████████████████████| 62135/62135 [00:29<00:00, 2121.63 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1926.69 examples/s]
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/trl/trainer/reward_trainer.py:199: UserWarning: When using RewardDataCollatorWithPadding, you should set `remove_unused_columns=False` in your RewardConfig we have set it for you, but you should do it yourself in the future.
  warnings.warn(
  0%|                                                                                               | 0/7750 [00:00<?, ?it/s]You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2855: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/fsx/qgallouedec/trl/tmp/env/lib/python3.11/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 0.7755, 'grad_norm': 3.030179262161255, 'learning_rate': 9.967741935483872e-05, 'epoch': 0.0}                       
{'loss': 0.71, 'grad_norm': 4.013882160186768, 'learning_rate': 9.935483870967742e-05, 'epoch': 0.01}                        
  1%|▌                                                                                   | 50/7750 [00:49<2:23:07,  1.12s/it]┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ chosen_text                                       ┃ rejected_text                                      ┃ logits           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ <|im_start|>system                                │ <|im_start|>system                                 │ [0.5081, 0.4919] │
│ You are a helpful assistant.<|im_end|>            │ You are a helpful assistant.<|im_end|>             │                  │
│ <|im_start|>user                                  │ <|im_start|>user                                   │                  │
│ As an HR manager, you want to test a potential    │ As an HR manager, you want to test a potential     │                  │
│ employee's ability to solve puzzles to determine  │ employee's ability to solve puzzles to determine   │                  │
│ their suitability for a job. Write a Python       │ their suitability for a job. Write a Python script │                  │
│ script that generates a list of questions that    │ that generates a list of questions that require    │                  │
│ require logical reasoning to answer. Your list    │ logical reasoning to answer. Your list should      │                  │
│ should include questions related to mathematical  │ include questions related to mathematical puzzles, │                  │
│ puzzles, language puzzles, logic puzzles, lateral │ language puzzles, logic puzzles, lateral thinking  │                  │
│ thinking puzzles, and pattern recognition         │ puzzles, and pattern recognition puzzles. Use the  │                  │
│ puzzles. Use the following code as a starting     │ following code as a starting point:                │                  │
│ point:                                            │ questions = {                                      │                  │
│ questions = {                                     │     "Mathematical puzzles": ["If the value of x+y  │                  │
│     "Mathematical puzzles": ["If the value of x+y │ = 20 and x-y = 10, what is the value of x and y?", │                  │
│ = 20 and x-y = 10, what is the value of x and     │ "If a pizza has a radius of 8 inches and is cut    │                  │
│ y?", "If a pizza has a radius of 8 inches and is  │ into 6 equal slices, what is the area of each      │ 
...

@qgallouedec qgallouedec added 🐛 bug Something isn't working ⏳ needs more info Additional information or clarification is required to proceed labels Oct 7, 2024
@yananchen1989
Copy link
Author

@qgallouedec

reward_modeling.py from your https://raw.githubusercontent.com/huggingface/trl/86ad7a7e85dc65c79bd9759097709a27ad1a58dd/examples/scripts/reward_modeling.py does work fine.

but the script from https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py does not work.

I do see there is a lot of difference between them

@qgallouedec
Copy link
Member

The later is the script for the dev version. You can't use trl 0.11 with it

@yananchen1989
Copy link
Author

ok, then i will wait for the dev version to be released. thanks. @qgallouedec

@yananchen1989
Copy link
Author

hi. just reopen this ticket.

although trl-lib/ultrafeedback_binarized works fine for reward_modeling.py, in trl version 0.11.2
but I also see that there is something wrong when using dataset: Anthropic/hh-rlhf.

this dataset is used as an example in https://huggingface.co/docs/trl/v0.11.2/reward_trainer

error message:

Traceback (most recent call last):
File "/workspace/trl/examples/scripts/reward_modeling.py", line 140, in
dataset = dataset.map(
File "/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py", line 866, in map
{
File "/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py", line 867, in
k: dataset.map(
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 560, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3035, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3408, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3300, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/workspace/trl/examples/scripts/reward_modeling.py", line 141, in
lambda x: {"chosen": chosen_fn(x), "rejected": rejected_fn(x)}, num_proc=config.dataset_num_proc
File "/usr/local/lib/python3.10/dist-packages/trl/extras/dataset_formatting.py", line 43, in format_dataset
return tokenizer.apply_chat_template(examples[messages_field], tokenize=False)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 1875, in apply_chat_template
rendered_chat = compiled_template.render(
File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 1301, in render
self.environment.handle_exception()
File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 936, in handle_exception
raise rewrite_traceback_stack(source=source)
File "", line 4, in top-level template code
jinja2.exceptions.UndefinedError: 'str object' has no attribute 'role'

@yananchen1989 yananchen1989 reopened this Oct 9, 2024
@yananchen1989
Copy link
Author

so only chat-format preference dataset like trl-lib/ultrafeedback_binarized is supported in following versions ?

@qgallouedec
Copy link
Member

No, the following works fine:

python reward_modeling.py \
     --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
     --dataset_name  Anthropic/hh-rlhf \
     --output_dir Qwen2-0.5B-Reward-LoRA \
     --per_device_train_batch_size 8 \
     --num_train_epochs 1 \
     --gradient_checkpointing True \
     --learning_rate 1.0e-4 \
     --logging_steps 25 \
     --eval_strategy steps \
     --eval_steps 50 \
     --max_length 2048 \
     --use_peft \
     --lora_r 32 \
     --lora_alpha 16

@yananchen1989
Copy link
Author

@qgallouedec did you checkout the branch of v0.11-release

@yananchen1989
Copy link
Author

you checkout the branch and pip install -e . from the source of this branch ? and it works fine ?

@qgallouedec
Copy link
Member

Indeed in v0.11.2, the example assumes that the dataset is in conversational format.

@yananchen1989
Copy link
Author

ok, so plain-text format such as Anthropic/hh-rlhf is not supported anymore.

@qgallouedec
Copy link
Member

False. Previously it was not supported, now it is. dev is ahead of v0.11.2

@yananchen1989
Copy link
Author

yananchen1989 commented Oct 10, 2024

ok, i will wait the new release and test it in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working 📚 documentation Improvements or additions to documentation ⏳ needs more info Additional information or clarification is required to proceed
Projects
None yet
Development

No branches or pull requests

2 participants