Skip to content

Comments

[QEff. Finetuning]: Adding FinetuningPipeline (finetune_experiemental.py) and related code#791

Merged
quic-swatia merged 1 commit intoquic:ft_experimentalfrom
quic-swatia:HFTrainer-MainPipeline
Feb 15, 2026
Merged

[QEff. Finetuning]: Adding FinetuningPipeline (finetune_experiemental.py) and related code#791
quic-swatia merged 1 commit intoquic:ft_experimentalfrom
quic-swatia:HFTrainer-MainPipeline

Conversation

@quic-swatia
Copy link
Contributor

@quic-swatia quic-swatia commented Feb 11, 2026

  1. Added FinetuningPipeline (finetune_experiemental.py) which integrates all the components added for HF-trainer and enable running fine tuning through it.
  2. Added files to handle PEFT and training config.
  3. Made changes in the config_manager and callbacks files.
  4. Added unit tests for the FinetuningPipeline (test_finetune.py)
  5. Updated tests in test_callback and test_config_manager based on above changes.

Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
@quic-swatia quic-swatia force-pushed the HFTrainer-MainPipeline branch from 47f4078 to 267ad3a Compare February 11, 2026 09:43
Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the documentation also, with sample commands for the user, on how to use the Finetune Api, with and without the sample config file

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

@quic-swatia
Copy link
Contributor Author

Please update the documentation also, with sample commands for the user, on how to use the Finetune Api, with and without the sample config file

It's already merged through Tanisha's #747 PR.

@quic-swatia
Copy link
Contributor Author

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

@quic-akuruvil
Copy link
Contributor

quic-akuruvil commented Feb 11, 2026

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

Is losses converging now? Let's check that for at least one of the test configs(llama-1b +gsm8k) and match the metrics for train and eval loop.

@quic-akuruvil
Copy link
Contributor

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

Is losses converging now? Let's check that for at least one of the test configs and match the metrics for train and eval loop.

Please include the metrics values here for reference as well.

@quic-swatia quic-swatia merged commit ab5918e into quic:ft_experimental Feb 15, 2026
3 checks passed
@quic-swatia
Copy link
Contributor Author

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

Is losses converging now? Let's check that for at least one of the test configs and match the metrics for train and eval loop.

Please include the metrics values here for reference as well.

Yes, the loss is converging. Following are the metrics of finetuning "meta-llama/Llama-3.2-1B for 5 epochs on single SOC: {"eval_loss":1.0224987268447876,"eval_runtime":484.8933,"eval_samples_per_second":2.72,"eval_steps_per_second":2.72,"eval_entropy":0.9871161538059735,"eval_num_tokens":6525025.0,"eval_mean_token_accuracy":0.7452040632806826,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365}
{"train_runtime":32856.1501,"train_samples_per_second":1.137,"train_steps_per_second":1.137,"total_flos":3.8132170931712e+16,"train_loss":1.0178058738101043,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365}

Training loss at the start of training :1.5146,

@quic-akuruvil
Copy link
Contributor

quic-akuruvil commented Feb 16, 2026

Also when we tested earlier, the training pipeline was failing due to model and inputs being at different devices. Both model and inputs need to be in QAIC. Is that fixed, tested and verified here.

Tanisha observed this error due to incorrect accelerate package version. Error got resolved for her with the correct version. Inside the docker since accelerate is already installed, she didn't observe it.

Is losses converging now? Let's check that for at least one of the test configs and match the metrics for train and eval loop.

Please include the metrics values here for reference as well.

Yes, the loss is converging. Following are the metrics of finetuning "meta-llama/Llama-3.2-1B for 5 epochs on single SOC: {"eval_loss":1.0224987268447876,"eval_runtime":484.8933,"eval_samples_per_second":2.72,"eval_steps_per_second":2.72,"eval_entropy":0.9871161538059735,"eval_num_tokens":6525025.0,"eval_mean_token_accuracy":0.7452040632806826,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365} {"train_runtime":32856.1501,"train_samples_per_second":1.137,"train_steps_per_second":1.137,"total_flos":3.8132170931712e+16,"train_loss":1.0178058738101043,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365}

Training loss at the start of training :1.5146,

@quic-swatia This looks good. Training entropy is not logged, please add that. And check for multi soc convergence as well.

quic-akuruvil pushed a commit to quic-akuruvil/efficient_transformers that referenced this pull request Feb 16, 2026
….py) and related code (quic#791)

1) Added FinetuningPipeline (finetune_experiemental.py) which integrates
all the components added for HF-trainer and enable running fine tuning
through it.
2) Added files to handle PEFT and training config. 
3) Made changes in the config_manager and callbacks files. 
4) Added unit tests for the FinetuningPipeline (test_finetune.py)
5) Updated tests in test_callback and test_config_manager based on above
changes.

Finetuning on openai/gsm8k for 5 epochs on single SOC gave the following
numbers:

{"eval_loss":1.0224987268447876,"eval_runtime":484.8933,"eval_samples_per_second":2.72,"eval_steps_per_second":2.72,"eval_entropy":0.9871161538059735,"eval_num_tokens":6525025.0,"eval_mean_token_accuracy":0.7452040632806826,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365}

{"train_runtime":32856.1501,"train_samples_per_second":1.137,"train_steps_per_second":1.137,"total_flos":3.8132170931712e+16,"train_loss":1.0178058738101043,"epoch":5.0,"num_input_tokens_seen":6525025,"global_step":37365}

Training loss at the start of training :1.5146,

Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants