['QEff.finetuning'] Changing some params from training config to model config#747
['QEff.finetuning'] Changing some params from training config to model config#747quic-akuruvil merged 28 commits intoquic:ft_experimentalfrom
Conversation
|
Please correct the title of the PR. It has been renamed to one of the commit changes it seems. |
docs/source/hf_finetune.md
Outdated
| * **completion\_template**: string pattern that tells the fine-tuning pipeline which part of the dataset should be treated as the target output (completion) for the model to learn. | ||
|
|
||
| **Note** : completion_func and completion_template cannot be used together. Please specify only one of these options at a time. | ||
| * **dataset_subset**: `default = "default"` → The subset of the dataset to use (useful for multi-configuration datasets). |
There was a problem hiding this comment.
Give more description for this, how to use this, give a sample value as example.
docs/source/hf_finetune.md
Outdated
| dataset_name: "yahma/alpaca-cleaned" | ||
| train_split: "train" | ||
| test_split: "test" | ||
| max_seq_length: 512 |
There was a problem hiding this comment.
Why only Alpaca dataset has max_seq_len ?
There was a problem hiding this comment.
As the max_seq_len has default value is it not necessary to provide in every config, I can still add it
There was a problem hiding this comment.
okay, what if alpaca dataset has some sample which has seq_len > 512. On what basis is it set to 512 here
There was a problem hiding this comment.
This is just an example, user can modify it.
There was a problem hiding this comment.
I think any default we set, (for each dataset) should not be any random value. It should be an almost best value. It should be dependent on length of samples in the dataset, or based on which is the max the hardware can support etc. Might need an analysis on the samples of the dataset. We might lose some samples if its length exceeds 512.
There was a problem hiding this comment.
Anyways, we can work on setting defaults after further analysis later. Now this would be okay to complete the end to end testing. But please note this down for future resolution.
| errors, | ||
| n_epochs <= 0 and max_steps <= 0, | ||
| n_epochs <= 0, | ||
| "Either training.num_train_epochs > 0 or training.max_steps > 0 must be set.", |
There was a problem hiding this comment.
Why max_steps is removed? If it is not needed update the comment as well.
There was a problem hiding this comment.
it's default value is -1 for full steps, that's why I removed it
| def test_config(config_path): | ||
| master_config = parse_arguments(args=[]) | ||
| config_manager = ConfigManager(master_config) | ||
| master_config = parse_arguments() |
There was a problem hiding this comment.
As per proposed flow, just pass config_path and pass None inplace of master_config.
| early_stopping: | ||
| early_stopping_patience: 3 | ||
| early_stopping_threshold: 0.001 | ||
| tensorboard: |
There was a problem hiding this comment.
I think its a mistake , will add it back
|
|
||
|
|
||
| def parse_arguments(config_path: Optional[str] = None, args: Optional[List[str]] = None) -> MasterConfig: | ||
| def parse_arguments() -> MasterConfig: |
There was a problem hiding this comment.
No need of this function as it is not doing anything. Argument parsing happens inside of ConfigManager.
| """Manages configuration loading, validation, and updates.""" | ||
|
|
||
| def __init__(self, config: MasterConfig): | ||
| def __init__(self, config: MasterConfig, config_path: Optional[str] = None): |
There was a problem hiding this comment.
The init should take config only if user wants to override its values and pass it to ConfigManager. Same way any use cases where user want to use a config stored at config_path, then config_path argument is used.
For our use case where we are invoking finetuning from CLI, the ConfigManager should not be given anything because it parses CLI argument within its init.
Accordingly, changes should be made in #731.
CC: @quic-swatia
|
|
||
| **Single device using CLI flags** | ||
| ```bash | ||
| python finetune_experimental.py --device qaic --lora_r 16 --target_modules q_proj, v_proj --gradient_checkpointing True |
There was a problem hiding this comment.
--enable-pp and --enable-ddp arguments should also be included and tested for its functionality
docs/source/hf_finetune.md
Outdated
|
|
||
| **Single device using yaml file** | ||
| ```bash | ||
| python finetune_experimental.py --config configs/sample_config.yaml |
There was a problem hiding this comment.
python -m QEfficient.cloud.finetune_experimental --config configs/sample_config.yaml. I tried above command and it fails due to path issue, if executed from Qefficient base dir location. Use this command with absolute path instead.
There was a problem hiding this comment.
Please test and verify all the commands specified here for any potential breakages.
carry over patch quic#693 Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com>
Added step wise instructions for MULTI NODE Finetuning. --------- Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com>
Add support for multi-node Distributed Data Parallel (DDP) training to the QEfficient finetuning pipeline. This enables scaling training across multiple nodes while keeping the existing single-node behavior unchanged. Commands for DDP across 2 servers: For the Master Addr or the Primary Machine, use node-rank as 0: QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50 For Node 1, use node-rank as 1: QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50 --------- Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Handled the edge case where num samples in a dataset are less than 20. Corrected the dataset link in grammar_dataset.py Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Added default NPI file for Gemma3. 1. Eliminates the need to provide NPI file as an extra argument by user. NPI file added as default, no need to provide it explicitly in the example script --------- Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com> Signed-off-by: Ann Kuruvilla <quic_akuruvil@quicinc.com>
Removed OpenGVLab/InternVL2_5-1B and OpenGVLab/InternVL3_5-1B test due to a compiler issue to unblock the CI --------- Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Updated Qeff version to mainline --------- Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>
Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com>
The SW issue came with prompt + generation length > SW. Fix 1. Cache updated with HybridSlidingWindowCache in cache utils --------- Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Added Readme file for the parameters used in sample config. --------- Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com> Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com> Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com> Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com> Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: Ann Kuruvilla <quic_akuruvil@quicinc.com> Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com> Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com> Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com> Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Signed-off-by: meetkuma <meetkuma@qti.qualcomm.com> Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com> Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com> Co-authored-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: Mohit Soni <quic_mohisoni@quicinc.com> Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com> Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com> Co-authored-by: vjanfaza <vjanfaza@qti.qualcomm.com> Co-authored-by: Ann Kuruvilla <quic_akuruvil@quicinc.com> Co-authored-by: smedhe <smedhe@qti.qualcomm.com> Co-authored-by: asmigosw <asmigosw@qti.qualcomm.com> Co-authored-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com> Co-authored-by: Amit Raj <amitraj@qti.qualcomm.com> Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Co-authored-by: Meet Patel <meetkuma@qti.qualcomm.com> Co-authored-by: Swati Allabadi <quic_sallabad@quicinc.com> Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com>
…l config (quic#747) This PR contain: 1.documentation for new finetune experimental stack. 2. Updates inconfig_manager.py --------- Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com> Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com> Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com> Signed-off-by: Ann Kuruvilla <quic_akuruvil@quicinc.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com> Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com> Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com> Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com> Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com> Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: meetkuma <meetkuma@qti.qualcomm.com> Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com> Co-authored-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: vjanfaza <vjanfaza@qti.qualcomm.com> Co-authored-by: Ann Kuruvilla <quic_akuruvil@quicinc.com> Co-authored-by: smedhe <smedhe@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Co-authored-by: Mohit Soni <quic_mohisoni@quicinc.com> Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com> Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com> Co-authored-by: asmigosw <asmigosw@qti.qualcomm.com> Co-authored-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com> Co-authored-by: Amit Raj <amitraj@qti.qualcomm.com> Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Co-authored-by: Meet Patel <meetkuma@qti.qualcomm.com> Co-authored-by: Swati Allabadi <quic_sallabad@quicinc.com> Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>
…l config (quic#747) This PR contain: 1.documentation for new finetune experimental stack. 2. Updates inconfig_manager.py --------- Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com> Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com> Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com> Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com> Signed-off-by: Ann Kuruvilla <quic_akuruvil@quicinc.com> Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com> Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Signed-off-by: abhishek-singh591 <sabhis@qti.qualcomm.com> Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Signed-off-by: Tanisha Chawada <tchawada@qti.qualcomm.com> Signed-off-by: Mohit Soni <mohisoni@qti.qualcomm.com> Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com> Signed-off-by: Amit Raj <amitraj@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com> Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: meetkuma <meetkuma@qti.qualcomm.com> Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com> Co-authored-by: Onkar Chougule <168134249+ochougul@users.noreply.github.com> Co-authored-by: vjanfaza <vjanfaza@qti.qualcomm.com> Co-authored-by: Ann Kuruvilla <quic_akuruvil@quicinc.com> Co-authored-by: smedhe <smedhe@qti.qualcomm.com> Co-authored-by: Rishin Raj <rishinr@qti.qualcomm.com> Co-authored-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com> Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com> Co-authored-by: Mohit Soni <quic_mohisoni@quicinc.com> Co-authored-by: Mohit Soni <mohisoni@qti.qualcomm.com> Co-authored-by: vtirumal <vtirumal@qti.qualcomm.com> Co-authored-by: asmigosw <asmigosw@qti.qualcomm.com> Co-authored-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com> Co-authored-by: Amit Raj <amitraj@qti.qualcomm.com> Co-authored-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Co-authored-by: Meet Patel <meetkuma@qti.qualcomm.com> Co-authored-by: Swati Allabadi <quic_sallabad@quicinc.com> Co-authored-by: Swati Allabadi <sallabad@qti.qualcomm.com>
| ddp_broadcast_buffers: null | ||
| ddp_timeout: 1800 | ||
| ``` | ||
| - **FSDP**: Fully Sharded Data Parallelism (FSDP) is supported for model sharding. |
There was a problem hiding this comment.
Please remove this from here. We have not done any experiments or added any support for FSDP in the pipeline yet.
This PR contain:
1.documentation for new finetune experimental stack.
2. Updates inconfig_manager.py