update network interface env #1319

lizamd · 2024-12-12T00:47:49Z

only set ifname envs for multi-node training

Upstream sync

* rope enable * add conditional check for TE * add command line argument for rope enablement * move rope enablement to rope utils * enable fused rope by transformer engine

Co-authored-by: Manjunath Siddaiah <[email protected]>

unit tests rocm

fix docker filename

rocm docker and test scripts

* Use mi300 for ci pipeline Use mi300 and implemented the build only node and test node functionality. * correct the artifact * make improvements * add unique docker image id * update

* Enabled TEGroupedMLP test. * Revert "Enabled TEGroupedMLP test." This reverts commit 93fb204. * Enabled TEGroupedMLP test to run on ROCm.

* Updated training script for llama2 and llama3. Added README. * modification based on review feedback * removed experiement/ folder * modified scripts further * updated readme * last update * modifed output format * add copyright --------- Co-authored-by: Gurpreet Singh Dhami <[email protected]>

* Update train_llama3.sh * Update train_llama3.sh

* Update train_llama2.sh * Update train_llama2.sh

gurpreet-dhami and others added 30 commits October 8, 2024 21:04

rocm docker and scripts

aeed2cd

update mock data

12952bf

update

9713598

add jenkins pipeline

717e7cd

fix runtime error for rocm (#3)

0c4709e

update

6dbde6a

Merge remote-tracking branch 'upstream/main' into upstream_sync

e8c077c

Merge pull request #6 from ROCm/upstream_sync

42b34ba

Upstream sync

rope enable through TE (#4)

2b855ba

* rope enable * add conditional check for TE * add command line argument for rope enablement * move rope enablement to rope utils * enable fused rope by transformer engine

address review comments

2a3af13

update

967673b

Enable HuggingFaceTokenizer in preprocessing (#10)

a2d0bdf

Co-authored-by: Manjunath Siddaiah <[email protected]>

markers for failing tests on rocm

0d5c01e

mark failing tests on mi250

1efaebf

enable ci pipeline

8be4fae

address review comments

955bdfa

address review comments

328fe81

update dockerfile

b142a98

update script

a533cce

Merge pull request #12 from ROCm/unit_tests_rocm

399c8d9

unit tests rocm

fix docker filename

5ce68ce

Merge pull request #16 from ROCm/ci_pipeline_fix

3d9f229

fix docker filename

Merge branch 'rocm_dev' into rocm_megatron_lm_upstream_rocm_docker

90dbbfd

update

77113cc

remove commented lines

1e64046

Merge pull request #5 from ROCm/rocm_megatron_lm_upstream_rocm_docker

d9a6c85

rocm docker and test scripts

Use mi300 for ci pipeline (#21)

106519d

* Use mi300 for ci pipeline Use mi300 and implemented the build only node and test node functionality. * correct the artifact * make improvements * add unique docker image id * update

fix string error (#25)

d24a364

Enabled TEGroupedMLP test. (#22)

b0d08df

* Enabled TEGroupedMLP test. * Revert "Enabled TEGroupedMLP test." This reverts commit 93fb204. * Enabled TEGroupedMLP test to run on ROCm.

wangye805 and others added 6 commits November 21, 2024 21:17

[ROCm] remove extra_state in state dict for TE DPA (#14)

190213a

changing repo from rocm/megatron-lm to rocm/megatron-lm-private (#28)

3b50a40

Update train_llama3.sh (#30)

8b5551e

* Update train_llama3.sh * Update train_llama3.sh

Update train_llama2.sh (#31)

23b9ff1

* Update train_llama2.sh * Update train_llama2.sh

Update readme.md (#29)

0b9998a

update network interface

20292f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update network interface env #1319

update network interface env #1319

lizamd commented Dec 12, 2024

update network interface env #1319

Are you sure you want to change the base?

update network interface env #1319

Conversation

lizamd commented Dec 12, 2024