[QEff. Finetuning]: Adding FinetuningPipeline (finetune_experiemental.py) and related code#791
Conversation
414d6d0 to
47f4078
Compare
Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
47f4078 to
267ad3a
Compare
quic-akuruvil
left a comment
There was a problem hiding this comment.
Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.
It's already merged through Tanisha's #747 PR. |
Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it. |
Is losses converging now? Let's check that for at least one of the test configs(llama-1b +gsm8k) and match the metrics for train and eval loop. |
Please include the metrics values here for reference as well. |
Yes, the loss is converging. Following are the metrics of finetuning "meta-llama/Llama-3.2-1B for 5 epochs on single SOC: {"eval_loss":1.0224987268447876,"eval_runtime":484.8933,"eval_samples_per_second":2.72,"eval_steps_per_second":2.72,"eval_entropy":0.9871161538059735,"eval_num_tokens":6525025.0,"eval_mean_token_accuracy":0.7452040632806826,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365} Training loss at the start of training :1.5146, |
@quic-swatia This looks good. Training entropy is not logged, please add that. And check for multi soc convergence as well. |
….py) and related code (quic#791) 1) Added FinetuningPipeline (finetune_experiemental.py) which integrates all the components added for HF-trainer and enable running fine tuning through it. 2) Added files to handle PEFT and training config. 3) Made changes in the config_manager and callbacks files. 4) Added unit tests for the FinetuningPipeline (test_finetune.py) 5) Updated tests in test_callback and test_config_manager based on above changes. Finetuning on openai/gsm8k for 5 epochs on single SOC gave the following numbers: {"eval_loss":1.0224987268447876,"eval_runtime":484.8933,"eval_samples_per_second":2.72,"eval_steps_per_second":2.72,"eval_entropy":0.9871161538059735,"eval_num_tokens":6525025.0,"eval_mean_token_accuracy":0.7452040632806826,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365} {"train_runtime":32856.1501,"train_samples_per_second":1.137,"train_steps_per_second":1.137,"total_flos":3.8132170931712e+16,"train_loss":1.0178058738101043,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365} Training loss at the start of training :1.5146, Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com> Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Uh oh!
There was an error while loading. Please reload this page.