[QEff. Finetuning]: Adding FinetuningPipeline (finetune_experiemental.py) and related code by quic-swatia · Pull Request #791 · quic/efficient-transformers

quic-swatia · 2026-02-11T08:49:27Z

Added FinetuningPipeline (finetune_experiemental.py) which integrates all the components added for HF-trainer and enable running fine tuning through it.
Added files to handle PEFT and training config.
Made changes in the config_manager and callbacks files.
Added unit tests for the FinetuningPipeline (test_finetune.py)
Updated tests in test_callback and test_config_manager based on above changes.

Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>

quic-akuruvil

Please update the documentation also, with sample commands for the user, on how to use the Finetune Api, with and without the sample config file

quic-akuruvil

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

quic-swatia · 2026-02-11T14:39:43Z

Please update the documentation also, with sample commands for the user, on how to use the Finetune Api, with and without the sample config file

It's already merged through Tanisha's #747 PR.

quic-swatia · 2026-02-11T14:41:26Z

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

quic-akuruvil · 2026-02-11T14:48:32Z

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

Is losses converging now? Let's check that for at least one of the test configs(llama-1b +gsm8k) and match the metrics for train and eval loop.

quic-akuruvil · 2026-02-11T14:49:19Z

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

Is losses converging now? Let's check that for at least one of the test configs and match the metrics for train and eval loop.

Please include the metrics values here for reference as well.

quic-swatia · 2026-02-15T16:29:15Z

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

Is losses converging now? Let's check that for at least one of the test configs and match the metrics for train and eval loop.

Please include the metrics values here for reference as well.

Yes, the loss is converging. Following are the metrics of finetuning "meta-llama/Llama-3.2-1B for 5 epochs on single SOC: {"eval_loss":1.0224987268447876,"eval_runtime":484.8933,"eval_samples_per_second":2.72,"eval_steps_per_second":2.72,"eval_entropy":0.9871161538059735,"eval_num_tokens":6525025.0,"eval_mean_token_accuracy":0.7452040632806826,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365}
{"train_runtime":32856.1501,"train_samples_per_second":1.137,"train_steps_per_second":1.137,"total_flos":3.8132170931712e+16,"train_loss":1.0178058738101043,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365}

Training loss at the start of training :1.5146,

quic-akuruvil · 2026-02-16T05:56:01Z

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

Is losses converging now? Let's check that for at least one of the test configs and match the metrics for train and eval loop.

Please include the metrics values here for reference as well.

Yes, the loss is converging. Following are the metrics of finetuning "meta-llama/Llama-3.2-1B for 5 epochs on single SOC: {"eval_loss":1.0224987268447876,"eval_runtime":484.8933,"eval_samples_per_second":2.72,"eval_steps_per_second":2.72,"eval_entropy":0.9871161538059735,"eval_num_tokens":6525025.0,"eval_mean_token_accuracy":0.7452040632806826,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365} {"train_runtime":32856.1501,"train_samples_per_second":1.137,"train_steps_per_second":1.137,"total_flos":3.8132170931712e+16,"train_loss":1.0178058738101043,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365}

Training loss at the start of training :1.5146,

@quic-swatia This looks good. Training entropy is not logged, please add that. And check for multi soc convergence as well.

….py) and related code (quic#791) 1) Added FinetuningPipeline (finetune_experiemental.py) which integrates all the components added for HF-trainer and enable running fine tuning through it. 2) Added files to handle PEFT and training config. 3) Made changes in the config_manager and callbacks files. 4) Added unit tests for the FinetuningPipeline (test_finetune.py) 5) Updated tests in test_callback and test_config_manager based on above changes. Finetuning on openai/gsm8k for 5 epochs on single SOC gave the following numbers: {"eval_loss":1.0224987268447876,"eval_runtime":484.8933,"eval_samples_per_second":2.72,"eval_steps_per_second":2.72,"eval_entropy":0.9871161538059735,"eval_num_tokens":6525025.0,"eval_mean_token_accuracy":0.7452040632806826,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365} {"train_runtime":32856.1501,"train_samples_per_second":1.137,"train_steps_per_second":1.137,"total_flos":3.8132170931712e+16,"train_loss":1.0178058738101043,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365} Training loss at the start of training :1.5146, Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com> Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>

quic-swatia requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners February 11, 2026 08:49

quic-swatia force-pushed the HFTrainer-MainPipeline branch from 414d6d0 to 47f4078 Compare February 11, 2026 09:40

quic-swatia self-assigned this Feb 11, 2026

quic-swatia requested a review from quic-meetkuma February 11, 2026 09:40

Adding FinetuningPipeline and related code

267ad3a

Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>

quic-swatia force-pushed the HFTrainer-MainPipeline branch from 47f4078 to 267ad3a Compare February 11, 2026 09:43

quic-akuruvil reviewed Feb 11, 2026

View reviewed changes

quic-swatia merged commit ab5918e into quic:ft_experimental Feb 15, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[QEff. Finetuning]: Adding FinetuningPipeline (finetune_experiemental.py) and related code#791

[QEff. Finetuning]: Adding FinetuningPipeline (finetune_experiemental.py) and related code#791
quic-swatia merged 1 commit intoquic:ft_experimentalfrom
quic-swatia:HFTrainer-MainPipeline

quic-swatia commented Feb 11, 2026 •

edited

Loading

Uh oh!

quic-akuruvil left a comment •

edited

Loading

Uh oh!

quic-akuruvil left a comment

Uh oh!

quic-swatia commented Feb 11, 2026

Uh oh!

quic-swatia commented Feb 11, 2026

Uh oh!

quic-akuruvil commented Feb 11, 2026 •

edited

Loading

Uh oh!

quic-akuruvil commented Feb 11, 2026

Uh oh!

Uh oh!

quic-swatia commented Feb 15, 2026

Uh oh!

quic-akuruvil commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

quic-swatia commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quic-akuruvil left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-akuruvil left a comment

Choose a reason for hiding this comment

Uh oh!

quic-swatia commented Feb 11, 2026

Uh oh!

quic-swatia commented Feb 11, 2026

Uh oh!

quic-akuruvil commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quic-akuruvil commented Feb 11, 2026

Uh oh!

Uh oh!

quic-swatia commented Feb 15, 2026

Uh oh!

quic-akuruvil commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

quic-swatia commented Feb 11, 2026 •

edited

Loading

quic-akuruvil left a comment •

edited

Loading

quic-akuruvil commented Feb 11, 2026 •

edited

Loading

quic-akuruvil commented Feb 16, 2026 •

edited

Loading