Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.




TL;DR
Adds SFT training to Torchtitan plus a small
greedy_packingaddition.Most of code borrowed from Verl and OpenRLHF
Changes
get_attention_masksto support SFT masks (only landed on Llama3 now)HFTokenizer, need to fix this laterinput_ids,labels(user tokens masked),attention_masks,position_idsTODO
Run
CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" NGPU=4 ./run_train.sh \ --training.running_sft_training \ --model.flavor=debugmodel_varlen_attn \ --training.dataset_path=openai/gsm8k \ --sft_data_config.dataset_subset=mainmore test
(

torch 2.10.0.dev20251124+cu129and i am using cudnn attention )compile does not work for no-padding because the seq-len for each training step keeps changing. We could pad the buffer to
seqlenwhen turning ongreedy_packing. (itspacking_on++) to make compile happy.