Onlinedpo Support rm with different vocab size #368

vwxyzjn · 2024-09-25T17:12:48Z

To test it out run

python mason.py \
    --cluster ai2/jupiter-cirrascale-2 --image costah/online_dpo_rm2 --pure_docker_mode \
    --workspace ai2/tulu-3-dev \
    --priority high \
    --budget ai2/allennlp \
    --preemptible \
    --gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
    open_instruct/online_dpo_vllm_thread.py \
    --exp_name "online_dpo_vllm_thread_different_rm" \
    --dataset_mixer '{"HuggingFaceH4/no_robots": 9500, "AI-MO/NuminaMath-TIR": 72441}' \
    --dataset_train_splits train \
    --dataset_eval_mixer '{"HuggingFaceH4/no_robots": 1.0}' \
    --dataset_eval_splits test \
    --max_token_length 2048 \
    --max_prompt_token_lenth 2048 \
    --learning_rate 8e-7 \
    --output_dir /output/ \
    --chat_template tulu \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 64 \
    --local_rollout_forward_batch_size 1 \
    --vllm_device cuda:7 \
    --num_epochs 1 \
    --num_mini_batches 1 \
    --total_episodes 300000 \
    --model_name_or_path allenai/open_instruct_dev \
    --model_revision finetune__meta-llama_Meta-Llama-3.1-8B__42__1726352218 \
    --reward_model_path Skywork/Skywork-Reward-Llama-3.1-8B \
    --non_stop_penalty \
    --stop_token eos \
    --penalty_reward_value -10.0 \
    --beta 0.03 \
    --num_evals 3 \
    --seed 3 \
    --response_length 1536 \
    --gradient_checkpointing \
    --with_tracking \
    --push_to_hub

It seems to work properly

https://wandb.ai/ai2-llm/open_instruct_internal/reports/online-DPO-with-different-RM-tokenizer--Vmlldzo5NDk2OTE4

vwxyzjn added 4 commits September 23, 2024 13:02

Online DPO optimization

8249563

Add a note in the docs

361e95f

online DPO with RM of different vocab

0c0171a

deal with dataset processor

5342248

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onlinedpo Support rm with different vocab size #368

Onlinedpo Support rm with different vocab size #368

vwxyzjn commented Sep 25, 2024 •

edited

Loading

Onlinedpo Support rm with different vocab size #368

Are you sure you want to change the base?

Onlinedpo Support rm with different vocab size #368

Conversation

vwxyzjn commented Sep 25, 2024 • edited Loading

vwxyzjn commented Sep 25, 2024 •

edited

Loading