Skip to content

Commit f22161a

Browse files
ko3n1gcuichenxakoumpayaoyu-33chtruong814
authored
build: Bump PyT to 25.01 (#11973)
* f Signed-off-by: oliver könig <[email protected]> * f Signed-off-by: oliver könig <[email protected]> * try to fix checkpoint loading Signed-off-by: Chen Cui <[email protected]> * update ci checkpoint Signed-off-by: Chen Cui <[email protected]> * fix ckpt loading for nemo1 and t5 tests Signed-off-by: Chen Cui <[email protected]> * fix Signed-off-by: oliver könig <[email protected]> * f Signed-off-by: oliver könig <[email protected]> * f Signed-off-by: oliver könig <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * f Signed-off-by: oliver könig <[email protected]> * bump Signed-off-by: oliver könig <[email protected]> * Rename neva datamodule (#12121) * Rename dataset Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Update Signed-off-by: yaoyu-33 <[email protected]> * pylink Signed-off-by: yaoyu-33 <[email protected]> * fix f string Signed-off-by: yaoyu-33 <[email protected]> * fix intern vit default factory Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Signed-off-by: Charlie Truong <[email protected]> * tests: Run FSDP2 on dual-gpu (#12145) Signed-off-by: oliver könig <[email protected]> * fix cmd Signed-off-by: Chen Cui <[email protected]> * Revert "build: Force re-install VCS dependencies (#12155)" (#12163) This reverts commit 4b19ade. Signed-off-by: Charlie Truong <[email protected]> * fix tests Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * remove unsed variable Signed-off-by: dimapihtar <[email protected]> * Fix nsys callback tests (#12177) * Fix nsys callback tests Signed-off-by: Hemil Desai <[email protected]> * Simplify tests Signed-off-by: Hemil Desai <[email protected]> * Fix Signed-off-by: Hemil Desai <[email protected]> --------- Signed-off-by: Hemil Desai <[email protected]> * Fixes for bumping pyt to 25.01 (#12165) * fix attention impl config Signed-off-by: Maanu Grover <[email protected]> * safeguard against empty key added by mcore Signed-off-by: Maanu Grover <[email protected]> * docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * noqa? Signed-off-by: Alexandros Koumparoulis <[email protected]> * noqa? Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused import Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: akoumpa <[email protected]> * Use pip --no-deps --force-reinstall when building the test container (#12175) * Use pip --no-deps --force-reinstall when building the test container Signed-off-by: Charlie Truong <[email protected]> * Explicitly reinstall nvidia-resiliency Signed-off-by: Charlie Truong <[email protected]> * Do not have pytest capture output in lightning unit tests Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: Charlie Truong <[email protected]> * Set weights_only=False in torch.load in EMA callback and AdapterMixin (#12198) * Set weights_only=False in torch.load within EMA callbck Signed-off-by: Charlie Truong <[email protected]> * Set weights_only=False in torch.load call in adapter_mixins Signed-off-by: Charlie Truong <[email protected]> * Apply isort and black reformatting Signed-off-by: chtruong814 <[email protected]> * Fix lint errors in ema and adapter_mixins Signed-off-by: Charlie Truong <[email protected]> * Apply isort and black reformatting Signed-off-by: chtruong814 <[email protected]> * Fix undefined name error in adapter_mixins Signed-off-by: Charlie Truong <[email protected]> * Apply isort and black reformatting Signed-off-by: chtruong814 <[email protected]> * Remove unnecessary import of __futures__ in adapter_mixins Signed-off-by: Charlie Truong <[email protected]> * Fix more locations where weights_only=False needs to be passed to torch.load Signed-off-by: Charlie Truong <[email protected]> * Fix lint issues in nemo_model_checkpoint Signed-off-by: Charlie Truong <[email protected]> * Apply isort and black reformatting Signed-off-by: chtruong814 <[email protected]> * Correct noqa F821 line in nemo_model_checkpoint Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: chtruong814 <[email protected]> Co-authored-by: chtruong814 <[email protected]> * Fix nemo-run stdin exception (#12197) * Use pip --no-deps --force-reinstall when building the test container Signed-off-by: Charlie Truong <[email protected]> * Explicitly reinstall nvidia-resiliency Signed-off-by: Charlie Truong <[email protected]> * Do not have pytest capture output in lightning unit tests Signed-off-by: Charlie Truong <[email protected]> * mock invoke context Signed-off-by: Maanu Grover <[email protected]> * Revert "Do not have pytest capture output in lightning unit tests" This reverts commit 162c730. Signed-off-by: Maanu Grover <[email protected]> * tweak Signed-off-by: Maanu Grover <[email protected]> * more patches needed Signed-off-by: Maanu Grover <[email protected]> --------- Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Co-authored-by: Charlie Truong <[email protected]> * Add legacy_ckpt arg to scripts/llm/gpt_distillation.py Signed-off-by: Charlie Truong <[email protected]> * Revert "Do not have pytest capture output in lightning unit tests" (#12202) This reverts commit 162c730. Signed-off-by: Maanu Grover <[email protected]> * Ckpt fixes pytorch update (#12228) * Fix checkpoint loading for ptq tests Signed-off-by: Charlie Truong <[email protected]> * Pass --ckpt_load_strictness log_all to L2_NeMo_2_SSM_Finetuning test Signed-off-by: Charlie Truong <[email protected]> * Fix tests in L2_NMT_Attention_is_All_You_Need_Finetuning by using default_factory Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: Charlie Truong <[email protected]> * bump modelopt Signed-off-by: oliver könig <[email protected]> * fix Signed-off-by: oliver könig <[email protected]> * fix Signed-off-by: oliver könig <[email protected]> * fix Signed-off-by: oliver könig <[email protected]> * Fix 2D bucketing test on Python 3.12 (#12265) * ci: Bump release workflows (#12259) Signed-off-by: oliver könig <[email protected]> * Fix 2D bucketing test on Python 3.12 Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: oliver könig <[email protected]> * remove retro test Signed-off-by: oliver könig <[email protected]> * L2_VLM_HF_Transformer_SFT_FSDP2 Signed-off-by: oliver könig <[email protected]> * Fix distillation state-dict loading bug (#12270) * ci: Bump release workflows (#12259) Signed-off-by: oliver könig <[email protected]> * Add default param dtype in mistral configs (#12186) Signed-off-by: Alexandros Koumparoulis <[email protected]> * ci: Fix pypi link of dry-run (#12267) Signed-off-by: oliver könig <[email protected]> * Fix tiny missing asterisk bug Signed-off-by: Asha Anoosheh <[email protected]> --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Asha Anoosheh <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> * Add optimizer fix (#12253) * Add optimizer fix Signed-off-by: Boxiang Wang <[email protected]> * Update hf_auto_model_for_causal_lm.py Signed-off-by: BoxiangW <[email protected]> --------- Signed-off-by: Boxiang Wang <[email protected]> Signed-off-by: BoxiangW <[email protected]> * num gpus Signed-off-by: oliver könig <[email protected]> * Build bitsandbytes (#12279) Co-authored-by: Charlie Truong <[email protected]> * Fixing error when loading T5 checkpoint created with TE<1.13 (#12264) * fix error when loading checkpoint created with TE<1.13 * fix formatting * fix formatting * fix formatting * Apply isort and black reformatting Signed-off-by: huvunvidia <[email protected]> --------- Signed-off-by: huvunvidia <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: huvunvidia <[email protected]> * no cancel for manual Signed-off-by: oliver könig <[email protected]> * Set L2_Speech_Batch_Size_OOMptimizer_Canary to be optional (#12299) Signed-off-by: Charlie Truong <[email protected]> * make expressions r-strings Signed-off-by: Alexandros Koumparoulis <[email protected]> * r-string Signed-off-by: Alexandros Koumparoulis <[email protected]> * r-string Signed-off-by: Alexandros Koumparoulis <[email protected]> * r-string Signed-off-by: Alexandros Koumparoulis <[email protected]> * r-string Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove parallelize_fn Signed-off-by: Alexandros Koumparoulis <[email protected]> * rstring Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * r-strings Signed-off-by: Alexandros Koumparoulis <[email protected]> * reduce grad-acc-steps Signed-off-by: Alexandros Koumparoulis <[email protected]> * r-string Signed-off-by: Alexandros Koumparoulis <[email protected]> * update tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * force-switch from MegatronCheckpointIO to HFCheckpointIO Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix setup_environment call Signed-off-by: Alexandros Koumparoulis <[email protected]> * update test Signed-off-by: Alexandros Koumparoulis <[email protected]> * update verify_sft_checkpoint_structure to handle sharded checkpoints Signed-off-by: Alexandros Koumparoulis <[email protected]> * make io_bytes.pt optional Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * set broken tests to optional Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Signed-off-by: Hemil Desai <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: akoumpa <[email protected]> Signed-off-by: chtruong814 <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Asha Anoosheh <[email protected]> Signed-off-by: Boxiang Wang <[email protected]> Signed-off-by: BoxiangW <[email protected]> Signed-off-by: huvunvidia <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Yu Yao <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Charlie Truong <[email protected]> Co-authored-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Co-authored-by: Hemil Desai <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Co-authored-by: akoumpa <[email protected]> Co-authored-by: chtruong814 <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Co-authored-by: Asha Anoosheh <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: BoxiangW <[email protected]> Co-authored-by: Huy Vu <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: huvunvidia <[email protected]>
1 parent 4043a6a commit f22161a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+553
-337
lines changed

.github/workflows/_test_template.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ jobs:
188188
include-hidden-files: true
189189

190190
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
191-
if: failure() && inputs.IS_OPTIONAL == false && !contains(github.event.pull_request.labels.*.name, 'no-fail-fast')
191+
if: failure() && inputs.IS_OPTIONAL == false && github.event_name == 'pull_request' && !contains(github.event.pull_request.labels.*.name, 'no-fail-fast')
192192
- name: after_script
193193
if: always() && inputs.AFTER_SCRIPT != ':'
194194
run: |

.github/workflows/cicd-main.yml

Lines changed: 48 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1455,16 +1455,17 @@ jobs:
14551455
AFTER_SCRIPT: |
14561456
rm -rf nemo_experiments
14571457
1458-
L2_VLM_HF_Transformer_SFT_FSDP2:
1458+
Optional_L2_VLM_HF_Transformer_SFT_FSDP2:
14591459
needs: [pre-flight, cicd-test-container-build]
14601460
uses: ./.github/workflows/_test_template.yml
1461-
if: contains(fromJSON(needs.pre-flight.outputs.test_to_run), 'L2_VLM_HF_Transformer_SFT_FSDP2')
1461+
if: contains(fromJSON(needs.pre-flight.outputs.test_to_run), 'Optional_L2_VLM_HF_Transformer_SFT_FSDP2')
14621462
with:
1463-
RUNNER: self-hosted-azure-gpus-1
1463+
RUNNER: self-hosted-azure
14641464
SCRIPT: |
14651465
TRANSFORMERS_OFFLINE=1 python tests/collections/vlm/hf/sft_fsdp2.py --model /home/TestData/vlm/qwen2-2b/ --max-steps 3
14661466
AFTER_SCRIPT: |
14671467
rm -rf nemo_experiments
1468+
IS_OPTIONAL: true
14681469

14691470
L2_HF_Transformer_PEFT_notebook:
14701471
needs: [pre-flight, cicd-test-container-build]
@@ -1603,16 +1604,17 @@ jobs:
16031604
AFTER_SCRIPT: |
16041605
rm -rf nemo_experiments
16051606
1606-
L2_HF_Transformer_SFT_FSDP2_2gpu:
1607+
Optional_L2_HF_Transformer_SFT_FSDP2_2gpu:
16071608
needs: [pre-flight, cicd-test-container-build]
16081609
uses: ./.github/workflows/_test_template.yml
1609-
if: contains(fromJSON(needs.pre-flight.outputs.test_to_run), 'L2_HF_Transformer_SFT_FSDP2_2gpu')
1610+
if: contains(fromJSON(needs.pre-flight.outputs.test_to_run), 'Optional_L2_HF_Transformer_SFT_FSDP2_2gpu')
16101611
with:
16111612
RUNNER: self-hosted-azure
16121613
SCRIPT: |
16131614
TRANSFORMERS_OFFLINE=1 python tests/collections/llm/hf/sft_fsdp2.py --model /home/TestData/nlp/hf_gemma/hf_gemma_2b --max-steps 10 --devices 2
16141615
AFTER_SCRIPT: |
16151616
rm -rf nemo_experiments
1617+
IS_OPTIONAL: true
16161618

16171619
L2_HF_Transformer_PT_2gpu:
16181620
needs: [pre-flight, cicd-test-container-build]
@@ -1696,16 +1698,17 @@ jobs:
16961698
AFTER_SCRIPT: |
16971699
rm -rf nemo_experiments
16981700
1699-
L2_HF_Transformer_SFT_TE_Acceleration:
1701+
Optional_L2_HF_Transformer_SFT_TE_Acceleration:
17001702
needs: [pre-flight, cicd-test-container-build]
17011703
uses: ./.github/workflows/_test_template.yml
1702-
if: contains(fromJSON(needs.pre-flight.outputs.test_to_run), 'L2_HF_Transformer_SFT_TE_Acceleration')
1704+
if: contains(fromJSON(needs.pre-flight.outputs.test_to_run), 'Optional_L2_HF_Transformer_SFT_TE_Acceleration')
17031705
with:
17041706
RUNNER: self-hosted-azure-gpus-1
17051707
SCRIPT: |
17061708
TRANSFORMERS_OFFLINE=1 python tests/collections/llm/hf/sft.py --model /home/TestData/akoumparouli/hf_mixtral_2l/ --model-accelerator te --max-steps 3
17071709
AFTER_SCRIPT: |
17081710
rm -rf nemo_experiments
1711+
IS_OPTIONAL: true
17091712

17101713
L2_HF_Transformer_PT_TE_Acceleration:
17111714
needs: [pre-flight, cicd-test-container-build]
@@ -2115,7 +2118,8 @@ jobs:
21152118
--devices 1 \
21162119
--max-steps 10 \
21172120
--experiment-dir /tmp/nlp_megatron_mamba_nemo-ux-mamba_cicd_test_sft/${{ github.run_id }} \
2118-
--model-path /home/TestData/nlp/megatron_mamba/model_optim_rng.pt
2121+
--model-path /home/TestData/nlp/megatron_mamba/model_optim_rng.pt \
2122+
--ckpt_load_strictness log_all
21192123
21202124
L2_NeMo_2_HF_MODEL_IMPORT:
21212125
needs: [pre-flight, cicd-test-container-build]
@@ -2253,7 +2257,7 @@ jobs:
22532257
SCRIPT: |
22542258
22552259
python tests/collections/llm/gpt_finetuning.py \
2256-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2260+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
22572261
--devices 2 \
22582262
--max_steps 3 \
22592263
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2263,7 +2267,7 @@ jobs:
22632267
--mbs 1
22642268
22652269
python tests/collections/llm/gpt_finetuning.py \
2266-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2270+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
22672271
--devices 2 \
22682272
--max_steps 6 \
22692273
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2281,7 +2285,7 @@ jobs:
22812285
SCRIPT: |
22822286
22832287
python tests/collections/llm/gpt_finetuning.py \
2284-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2288+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
22852289
--devices 2 \
22862290
--max_steps 3 \
22872291
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2291,7 +2295,7 @@ jobs:
22912295
--mbs 2
22922296
22932297
python tests/collections/llm/gpt_finetuning.py \
2294-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2298+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
22952299
--devices 2 \
22962300
--max_steps 6 \
22972301
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2309,7 +2313,7 @@ jobs:
23092313
SCRIPT: |
23102314
23112315
python tests/collections/llm/gpt_finetuning.py \
2312-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2316+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
23132317
--devices 2 \
23142318
--max_steps 3 \
23152319
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2319,7 +2323,7 @@ jobs:
23192323
--mbs 2
23202324
23212325
python tests/collections/llm/gpt_finetuning.py \
2322-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2326+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
23232327
--devices 2 \
23242328
--max_steps 6 \
23252329
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2337,7 +2341,7 @@ jobs:
23372341
SCRIPT: |
23382342
23392343
python tests/collections/llm/gpt_finetuning.py \
2340-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2344+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
23412345
--devices 2 \
23422346
--max_steps 3 \
23432347
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2347,7 +2351,7 @@ jobs:
23472351
--mbs 2
23482352
23492353
python tests/collections/llm/gpt_finetuning.py \
2350-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2354+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
23512355
--devices 2 \
23522356
--max_steps 6 \
23532357
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2365,7 +2369,7 @@ jobs:
23652369
SCRIPT: |
23662370
23672371
python tests/collections/llm/gpt_finetuning.py \
2368-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2372+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
23692373
--devices 2 \
23702374
--max_steps 3 \
23712375
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2375,7 +2379,7 @@ jobs:
23752379
--mbs 1 --packed
23762380
23772381
python tests/collections/llm/gpt_finetuning.py \
2378-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2382+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
23792383
--devices 2 \
23802384
--max_steps 6 \
23812385
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2393,7 +2397,7 @@ jobs:
23932397
SCRIPT: |
23942398
23952399
python tests/collections/llm/gpt_finetuning.py \
2396-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2400+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
23972401
--devices 2 \
23982402
--max_steps 3 \
23992403
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2403,7 +2407,7 @@ jobs:
24032407
--mbs 1
24042408
24052409
python tests/collections/llm/gpt_finetuning.py \
2406-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2410+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
24072411
--devices 2 \
24082412
--max_steps 6 \
24092413
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2421,7 +2425,7 @@ jobs:
24212425
SCRIPT: |
24222426
24232427
python tests/collections/llm/gpt_finetuning.py \
2424-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2428+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
24252429
--devices 2 \
24262430
--max_steps 3 \
24272431
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2431,7 +2435,7 @@ jobs:
24312435
--mbs 2
24322436
24332437
python tests/collections/llm/gpt_finetuning.py \
2434-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2438+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
24352439
--devices 2 \
24362440
--max_steps 6 \
24372441
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2449,7 +2453,7 @@ jobs:
24492453
SCRIPT: |
24502454
24512455
python tests/collections/llm/gpt_finetuning.py \
2452-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2456+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
24532457
--devices 2 \
24542458
--max_steps 3 \
24552459
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2459,7 +2463,7 @@ jobs:
24592463
--mbs 2
24602464
24612465
python tests/collections/llm/gpt_finetuning.py \
2462-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2466+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
24632467
--devices 2 \
24642468
--max_steps 6 \
24652469
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2477,7 +2481,7 @@ jobs:
24772481
SCRIPT: |
24782482
24792483
python tests/collections/llm/gpt_finetuning.py \
2480-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2484+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
24812485
--devices 2 \
24822486
--max_steps 3 \
24832487
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2487,7 +2491,7 @@ jobs:
24872491
--mbs 2
24882492
24892493
python tests/collections/llm/gpt_finetuning.py \
2490-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2494+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
24912495
--devices 2 \
24922496
--max_steps 6 \
24932497
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2505,7 +2509,7 @@ jobs:
25052509
SCRIPT: |
25062510
25072511
python tests/collections/llm/gpt_finetuning.py \
2508-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2512+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
25092513
--devices 2 \
25102514
--max_steps 3 \
25112515
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2515,7 +2519,7 @@ jobs:
25152519
--mbs 1 --packed
25162520
25172521
python tests/collections/llm/gpt_finetuning.py \
2518-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2522+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
25192523
--devices 2 \
25202524
--max_steps 6 \
25212525
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2533,7 +2537,7 @@ jobs:
25332537
SCRIPT: |
25342538
25352539
python tests/collections/llm/gpt_finetuning.py \
2536-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2540+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
25372541
--devices 2 \
25382542
--max_steps 3 \
25392543
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2543,7 +2547,7 @@ jobs:
25432547
--mbs 1 --packed
25442548
25452549
python tests/collections/llm/gpt_finetuning.py \
2546-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2550+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
25472551
--devices 2 \
25482552
--max_steps 6 \
25492553
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2560,7 +2564,7 @@ jobs:
25602564
RUNNER: self-hosted-azure
25612565
SCRIPT: |
25622566
python tests/collections/llm/gpt_finetuning.py \
2563-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2567+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
25642568
--devices 2 \
25652569
--max_steps 3 \
25662570
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2570,7 +2574,7 @@ jobs:
25702574
--mbs 1 --packed
25712575
25722576
python tests/collections/llm/gpt_finetuning.py \
2573-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2577+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
25742578
--devices 2 \
25752579
--max_steps 6 \
25762580
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2588,7 +2592,7 @@ jobs:
25882592
SCRIPT: |
25892593
25902594
python tests/collections/llm/gpt_finetuning.py \
2591-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2595+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
25922596
--devices 2 \
25932597
--max_steps 3 \
25942598
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2599,7 +2603,7 @@ jobs:
25992603
--dataset chat
26002604
26012605
python tests/collections/llm/gpt_finetuning.py \
2602-
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v2 \
2606+
--restore_path /home/TestData/nemo2_ckpt/llama_68M_v3 \
26032607
--devices 2 \
26042608
--max_steps 6 \
26052609
--experiment_dir /tmp/nemo2_gpt_finetune/${{ github.run_id }} \
@@ -2726,7 +2730,8 @@ jobs:
27262730
27272731
python tests/collections/llm/peft/lora_merge.py \
27282732
--lora_checkpoint_path=/home/TestData/nemo2_ckpt/llama_lora_ci_checkpoint_v2/ \
2729-
--output_path=/tmp/nemo2_lora_merge/${{ github.run_id }}
2733+
--output_path=/tmp/nemo2_lora_merge/${{ github.run_id }} \
2734+
--legacy_ckpt
27302735
27312736
L2_NEMO_2_LoRA_Export:
27322737
needs: [pre-flight, cicd-test-container-build]
@@ -2755,7 +2760,8 @@ jobs:
27552760
--devices 1 \
27562761
--top_p 0.0 \
27572762
--top_k 1 \
2758-
--num_tokens_to_generate 3
2763+
--num_tokens_to_generate 3 \
2764+
--legacy_ckpt
27592765
27602766
L2_NeMo_2_NeMo_Mcore_Mixtral_bitexact:
27612767
needs: [pre-flight, cicd-test-container-build]
@@ -2775,7 +2781,7 @@ jobs:
27752781
SCRIPT: |
27762782
python tests/collections/llm/test_hf_import.py --hf_model /home/TestData/nlp/megatron_llama/llama-ci-hf --output_path /tmp/nemo2_ckpt
27772783
2778-
python scripts/llm/ptq.py -nc /tmp/nemo2_ckpt -algo fp8 -out /tmp/nemo2_ptq_engine
2784+
python scripts/llm/ptq.py -nc /tmp/nemo2_ckpt -algo fp8 -out /tmp/nemo2_ptq_engine --ckpt_load_strictness log_all
27792785
27802786
AFTER_SCRIPT: |
27812787
rm -rf /tmp/nemo2_ckpt
@@ -2809,7 +2815,8 @@ jobs:
28092815
--warmup_steps 1 \
28102816
--val_check_interval 5 \
28112817
--log_interval 5 \
2812-
--limit_val_batches 2
2818+
--limit_val_batches 2 \
2819+
--legacy_ckpt
28132820
28142821
AFTER_SCRIPT: |
28152822
rm -rf /tmp/nemo2_ckpt
@@ -3058,9 +3065,9 @@ jobs:
30583065
- L2_VLM_HF_Transformer_PEFT
30593066
- L2_VLM_HF_Transformer_PEFT_FSDP
30603067
- L2_VLM_HF_Transformer_PEFT_4bit
3061-
- L2_VLM_HF_Transformer_SFT_FSDP2
3068+
# - Optional_L2_VLM_HF_Transformer_SFT_FSDP2
30623069
- L2_HF_Transformer_SFT_2gpu_nemorun
3063-
- L2_HF_Transformer_SFT_TE_Acceleration
3070+
# - Optional_L2_HF_Transformer_SFT_TE_Acceleration
30643071
- L2_HF_Transformer_PT
30653072
- L2_HF_Transformer_PT_nemorun
30663073
- L2_HF_Transformer_PT_2gpu
@@ -3110,7 +3117,7 @@ jobs:
31103117
- L2_NeMo_2_Export_In_Framework
31113118
- L2_NeMo_2_jit_callback
31123119
- L2_NeMo_2_LLAVA_NEXT_MOCK_TRAINING
3113-
- L2_HF_Transformer_SFT_FSDP2_2gpu
3120+
# - Optional_L2_HF_Transformer_SFT_FSDP2_2gpu
31143121
- L2_HF_Transformer_SFT_2gpu_nemorun_fsdp2
31153122
- L2_NeMo_2_VLLM_EXPORT
31163123
- L2_NeMo_2_EVAL

0 commit comments

Comments
 (0)