Skip to content

['QEff.finetuning'] Changing some params from training config to model config#747

Merged
quic-akuruvil merged 28 commits intoquic:ft_experimentalfrom
tchawada:ft_config
Feb 5, 2026
Merged

['QEff.finetuning'] Changing some params from training config to model config#747
quic-akuruvil merged 28 commits intoquic:ft_experimentalfrom
tchawada:ft_config

Conversation

@tchawada
Copy link
Contributor

@tchawada tchawada commented Jan 21, 2026

This PR contain:
1.documentation for new finetune experimental stack.
2. Updates inconfig_manager.py

@quic-swatia
Copy link
Contributor

Please correct the title of the PR. It has been renamed to one of the commit changes it seems.

* **completion\_template**: string pattern that tells the fine-tuning pipeline which part of the dataset should be treated as the target output (completion) for the model to learn.

**Note** : completion_func and completion_template cannot be used together. Please specify only one of these options at a time.
* **dataset_subset**: `default = "default"` → The subset of the dataset to use (useful for multi-configuration datasets).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give more description for this, how to use this, give a sample value as example.

dataset_name: "yahma/alpaca-cleaned"
train_split: "train"
test_split: "test"
max_seq_length: 512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only Alpaca dataset has max_seq_len ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the max_seq_len has default value is it not necessary to provide in every config, I can still add it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, what if alpaca dataset has some sample which has seq_len > 512. On what basis is it set to 512 here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an example, user can modify it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any default we set, (for each dataset) should not be any random value. It should be an almost best value. It should be dependent on length of samples in the dataset, or based on which is the max the hardware can support etc. Might need an analysis on the samples of the dataset. We might lose some samples if its length exceeds 512.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, we can work on setting defaults after further analysis later. Now this would be okay to complete the end to end testing. But please note this down for future resolution.

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a sample working config in configs/ folder.

Copy link
Contributor

@quic-meetkuma quic-meetkuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, further polishing is needed. Let us close this at the earliest.

PS: add description to the PR.

errors,
n_epochs <= 0 and max_steps <= 0,
n_epochs <= 0,
"Either training.num_train_epochs > 0 or training.max_steps > 0 must be set.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why max_steps is removed? If it is not needed update the comment as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's default value is -1 for full steps, that's why I removed it

def test_config(config_path):
master_config = parse_arguments(args=[])
config_manager = ConfigManager(master_config)
master_config = parse_arguments()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per proposed flow, just pass config_path and pass None inplace of master_config.

early_stopping:
early_stopping_patience: 3
early_stopping_threshold: 0.001
tensorboard:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a mistake , will add it back



def parse_arguments(config_path: Optional[str] = None, args: Optional[List[str]] = None) -> MasterConfig:
def parse_arguments() -> MasterConfig:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need of this function as it is not doing anything. Argument parsing happens inside of ConfigManager.

"""Manages configuration loading, validation, and updates."""

def __init__(self, config: MasterConfig):
def __init__(self, config: MasterConfig, config_path: Optional[str] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init should take config only if user wants to override its values and pass it to ConfigManager. Same way any use cases where user want to use a config stored at config_path, then config_path argument is used.

For our use case where we are invoking finetuning from CLI, the ConfigManager should not be given anything because it parses CLI argument within its init.

Accordingly, changes should be made in #731.

CC: @quic-swatia


**Single device using CLI flags**
```bash
python finetune_experimental.py --device qaic --lora_r 16 --target_modules q_proj, v_proj --gradient_checkpointing True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--enable-pp and --enable-ddp arguments should also be included and tested for its functionality


**Single device using yaml file**
```bash
python finetune_experimental.py --config configs/sample_config.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -m QEfficient.cloud.finetune_experimental --config configs/sample_config.yaml. I tried above command and it fails due to path issue, if executed from Qefficient base dir location. Use this command with absolute path instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test and verify all the commands specified here for any potential breakages.

ochougul and others added 17 commits February 5, 2026 09:25
carry over patch   quic#693

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com>
Added step wise instructions for MULTI NODE Finetuning.

---------

Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com>
Add support for multi-node Distributed Data Parallel (DDP) training to
the QEfficient finetuning pipeline. This enables scaling training across
multiple nodes while keeping the existing single-node behavior
unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

---------

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Handled the edge case where num samples in a dataset are less than 20.
Corrected the dataset link in grammar_dataset.py

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Added default NPI file for Gemma3.

1. Eliminates the need to provide NPI file as an extra argument by user.
NPI file added as default, no need to provide it explicitly in the
example script

---------

Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com>
Signed-off-by: Ann Kuruvilla <quic_akuruvil@quicinc.com>
Removed OpenGVLab/InternVL2_5-1B and OpenGVLab/InternVL3_5-1B test due
to a compiler issue to unblock the CI

---------

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Updated Qeff version to mainline

---------

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
The SW issue came with prompt + generation length > SW.

Fix
1. Cache updated with HybridSlidingWindowCache in cache utils

---------

Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
tchawada and others added 9 commits February 5, 2026 09:54
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Added Readme file for the parameters used in sample config.

---------

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com>
Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com>
Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Ann Kuruvilla <quic_akuruvil@quicinc.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: meetkuma <meetkuma@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Co-authored-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: Mohit Soni <quic_mohisoni@quicinc.com>
Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com>
Co-authored-by: vjanfaza <vjanfaza@qti.qualcomm.com>
Co-authored-by: Ann Kuruvilla <quic_akuruvil@quicinc.com>
Co-authored-by: smedhe <smedhe@qti.qualcomm.com>
Co-authored-by: asmigosw <asmigosw@qti.qualcomm.com>
Co-authored-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Co-authored-by: Amit Raj <amitraj@qti.qualcomm.com>
Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>
Co-authored-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Co-authored-by: Meet Patel <meetkuma@qti.qualcomm.com>
Co-authored-by: Swati Allabadi <quic_sallabad@quicinc.com>
Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
@quic-akuruvil quic-akuruvil merged commit b78efe6 into quic:ft_experimental Feb 5, 2026
2 of 3 checks passed
quic-akuruvil added a commit to quic-akuruvil/efficient_transformers that referenced this pull request Feb 9, 2026
…l config (quic#747)

This PR contain:
1.documentation for new finetune experimental stack.
2. Updates inconfig_manager.py

---------

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com>
Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com>
Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Signed-off-by: Ann Kuruvilla <quic_akuruvil@quicinc.com>
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: meetkuma <meetkuma@qti.qualcomm.com>
Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Co-authored-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: vjanfaza <vjanfaza@qti.qualcomm.com>
Co-authored-by: Ann Kuruvilla <quic_akuruvil@quicinc.com>
Co-authored-by: smedhe <smedhe@qti.qualcomm.com>
Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>
Co-authored-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Co-authored-by: Mohit Soni <quic_mohisoni@quicinc.com>
Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com>
Co-authored-by: asmigosw <asmigosw@qti.qualcomm.com>
Co-authored-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Co-authored-by: Amit Raj <amitraj@qti.qualcomm.com>
Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Co-authored-by: Meet Patel <meetkuma@qti.qualcomm.com>
Co-authored-by: Swati Allabadi <quic_sallabad@quicinc.com>
Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>
quic-akuruvil added a commit to quic-akuruvil/efficient_transformers that referenced this pull request Feb 16, 2026
…l config (quic#747)

This PR contain:
1.documentation for new finetune experimental stack.
2. Updates inconfig_manager.py

---------

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com>
Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com>
Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Signed-off-by: Ann Kuruvilla <quic_akuruvil@quicinc.com>
Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: meetkuma <meetkuma@qti.qualcomm.com>
Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Co-authored-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com>
Co-authored-by: vjanfaza <vjanfaza@qti.qualcomm.com>
Co-authored-by: Ann Kuruvilla <quic_akuruvil@quicinc.com>
Co-authored-by: smedhe <smedhe@qti.qualcomm.com>
Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com>
Co-authored-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Co-authored-by: Mohit Soni <quic_mohisoni@quicinc.com>
Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com>
Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com>
Co-authored-by: asmigosw <asmigosw@qti.qualcomm.com>
Co-authored-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Co-authored-by: Amit Raj <amitraj@qti.qualcomm.com>
Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Co-authored-by: Meet Patel <meetkuma@qti.qualcomm.com>
Co-authored-by: Swati Allabadi <quic_sallabad@quicinc.com>
Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>
ddp_broadcast_buffers: null
ddp_timeout: 1800
```
- **FSDP**: Fully Sharded Data Parallelism (FSDP) is supported for model sharding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this from here. We have not done any experiments or added any support for FSDP in the pipeline yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants

Comments