Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding #1209

Open
wants to merge 51 commits into
base: 24.09-beta
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
63be779
ADLR/megatron-lm!2033 - Online eval
trintamaki Sep 19, 2024
69d4c44
ADLR/megatron-lm!1973 - MMMU multi-image support
trintamaki Sep 19, 2024
7754f56
ADLR/megatron-lm!2113 - build: Use multi-stage for parallel builds
ko3n1g Sep 20, 2024
121d05e
ADLR/megatron-lm!2126 - Only print warning when relevant
deepakn94 Sep 21, 2024
3a0ca4b
ADLR/megatron-lm!2124 - tests: Fix location of megatron
ko3n1g Sep 21, 2024
ca219ed
ADLR/megatron-lm!2127 - ci: Bump sha
ko3n1g Sep 21, 2024
1899bb7
ADLR/megatron-lm!2128 - ci: Improve cherry pick workflow
ko3n1g Sep 22, 2024
26f5c32
ADLR/megatron-lm!2034 - ci: Introduce JET Python SDK
ko3n1g Sep 22, 2024
d626aeb
ADLR/megatron-lm!2130 - ci: Improve cherry pick MR description
ko3n1g Sep 22, 2024
ea83faa
ADLR/megatron-lm!2119 - Huvu/t5 te10 fix nemoci pr482
huvunvidia Sep 23, 2024
ae1bffb
ADLR/megatron-lm!2134 - ci: Set author and milestone for cherry-picks
ko3n1g Sep 23, 2024
baad0ad
ADLR/megatron-lm!2135 - ci: Send alerts on unit-tests-extended
ko3n1g Sep 23, 2024
460c6a9
ADLR/megatron-lm!2133 - tests: Minor improvements to JET
ko3n1g Sep 23, 2024
a0799f4
ADLR/megatron-lm!2136 - tests: Fix GPT test
ko3n1g Sep 23, 2024
4c3e06a
ADLR/megatron-lm!2139 - ci: Fix cherry-pick strings
ko3n1g Sep 23, 2024
71b2aa1
ADLR/megatron-lm!2110 - Use torch dataloader in multimodal evaluation
trintamaki Sep 23, 2024
3ab6da9
ADLR/megatron-lm!2137 - ci: Enable dev container for new features
ko3n1g Sep 23, 2024
ff89e91
ADLR/megatron-lm!2005 - Fix performance regression brought by torch.b…
xxuwenc Sep 24, 2024
6543004
ADLR/megatron-lm!2073 - Multimodal batched bug fix
trintamaki Sep 24, 2024
29793cf
ADLR/megatron-lm!1581 - Add MLA support into MCore
BoxiangW Sep 24, 2024
754e0f0
ADLR/megatron-lm!1995 - Add freeze options to pretrain_vlm
trintamaki Sep 25, 2024
10350b6
ADLR/megatron-lm!2145 - Improve logging when decreasing batch size
deepakn94 Sep 25, 2024
f54686a
ADLR/megatron-lm!2148 - Add model.eval() to run_text_generation_serve…
mathemakitten Sep 25, 2024
b301e5f
ADLR/megatron-lm!2111 - Mcore llama3.1 support
jon-barker Sep 26, 2024
dcdf804
ADLR/megatron-lm!2151 - ci: Run experimental UTs on dev image
ko3n1g Sep 26, 2024
32b395e
ADLR/megatron-lm!1953 - Mcore export to export models to TRTLLM (GPU …
shanmugamr1992 Sep 26, 2024
8fbb30c
ADLR/megatron-lm!2154 - ci: Prune docker cache of `mcore-docker-node-…
ko3n1g Sep 26, 2024
9f06f06
ADLR/megatron-lm!2155 - Resolve release test failure caused by Groupe…
xxuwenc Sep 26, 2024
015bffc
ADLR/megatron-lm!2156 - tests: Set better name for Wandb logging
ko3n1g Sep 26, 2024
34f0f98
ADLR/megatron-lm!1950 - Remove pkg_resources package
ksivaman Sep 27, 2024
8103c4c
ADLR/megatron-lm!2142 - ci: Onboard CW
ko3n1g Sep 27, 2024
30aafee
ADLR/megatron-lm!2158 - Small changes to export
shanmugamr1992 Sep 28, 2024
30445f8
ADLR/megatron-lm!2152 - Fix rope backward compatibility
BoxiangW Sep 30, 2024
c6a0ec8
ADLR/megatron-lm!2140 - [Bug fix] Don't trace graphs during inference
jiemingz Oct 1, 2024
d90956c
ADLR/megatron-lm!2109 - Adding more MR tests for T5 (e.g., transforme…
huvunvidia Oct 1, 2024
77f62d8
ADLR/megatron-lm!2164 - ci: Download artifacts
ko3n1g Oct 1, 2024
cf25e49
ADLR/megatron-lm!2165 - ci: Bump version
ko3n1g Oct 2, 2024
a792575
ADLR/megatron-lm!2153 - Add the interface to set TP communication boo…
erhoo82 Oct 3, 2024
74d4b1a
ADLR/megatron-lm!2095 - Add support for SigLIP vision encoder to mult…
Oct 3, 2024
d819b9c
ADLR/megatron-lm!2175 - adding cu_seqlens_padded support in MCore
Oct 4, 2024
4ded7ce
ADLR/megatron-lm!2181 - Fixing attention mask dimenions to support TE…
shanmugamr1992 Oct 4, 2024
c302697
ADLR/megatron-lm!2180 - rotary_scaling fix for llama3.1 and 3.2
yueshen2016 Oct 4, 2024
7619780
ADLR/megatron-lm!2185 - chore: Improve generator for launch scripts
ko3n1g Oct 4, 2024
52699ce
ADLR/megatron-lm!2160 - Adding Inference pipeline for T5
huvunvidia Oct 5, 2024
691b323
ADLR/megatron-lm!2182 - ci: Group runs by model
ko3n1g Oct 5, 2024
ce67659
ADLR/megatron-lm!1862 - Cpu init te
wdykas Oct 5, 2024
73ef715
ADLR/megatron-lm!2186 - ci: Run script after export
ko3n1g Oct 5, 2024
329c1d7
ADLR/megatron-lm!2089 - Fix upcycling issues.
RayWang96 Oct 7, 2024
99f63e8
ADLR/megatron-lm!2189 - tests: Fix ENV export
ko3n1g Oct 7, 2024
fd20cda
rebase to 24.09
Oct 8, 2024
93dab11
Merge branch 'embedding' of https://github.com/rachitgarg91/Megatron-…
Oct 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 25 additions & 25 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,28 @@ workflow:
FUNCTIONAL_TEST: "no"
- if: $CI_MERGE_REQUEST_LABELS =~ /Run tests/ && $CI_MERGE_REQUEST_TARGET_BRANCH_SHA != ""
variables:
FUNCTIONAL_TEST: "yes"
FUNCTIONAL_TEST_SCOPE: mr
UNIT_TEST_REPEAT: 5
UNIT_TEST_TIMEOUT: 50
FUNCTIONAL_TEST: "yes"
FUNCTIONAL_TEST_SCOPE: mr
FUNCTIONAL_TEST_CLUSTER_A100: ""
FUNCTIONAL_TEST_CLUSTER_H100: ""
- if: $CI_MERGE_REQUEST_LABELS =~ /Run nightly/ && $CI_MERGE_REQUEST_TARGET_BRANCH_SHA != ""
variables:
FUNCTIONAL_TEST: "yes"
FUNCTIONAL_TEST_SCOPE: nightly
UNIT_TEST_REPEAT: 5
UNIT_TEST_TIMEOUT: 50
FUNCTIONAL_TEST: "yes"
FUNCTIONAL_TEST_SCOPE: nightly
FUNCTIONAL_TEST_CLUSTER_A100: ""
FUNCTIONAL_TEST_CLUSTER_H100: ""
- if: $CI_MERGE_REQUEST_LABELS =~ /Run weekly/ && $CI_MERGE_REQUEST_TARGET_BRANCH_SHA != ""
variables:
FUNCTIONAL_TEST: "yes"
FUNCTIONAL_TEST_SCOPE: weekly
UNIT_TEST_REPEAT: 5
UNIT_TEST_TIMEOUT: 50
FUNCTIONAL_TEST: "yes"
FUNCTIONAL_TEST_SCOPE: weekly
FUNCTIONAL_TEST_CLUSTER_A100: ""
FUNCTIONAL_TEST_CLUSTER_H100: ""
- if: $CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_SHA != ""
variables:
FUNCTIONAL_TEST: "no"
Expand Down Expand Up @@ -58,29 +64,23 @@ variables:
- "mr"
- "nightly"
- "weekly"
- "pre-release"
- "release"
description: "Testsuite to run (only for FUNCTIONAL_TEST=yes)"
FUNCTIONAL_TEST_CLUSTER:
FUNCTIONAL_TEST_CLUSTER_A100:
value: "dgxa100_dracooci"
options:
- "dgxa100_dracooci"
- "dgxa100_dracooci-ord"
- "dgxh100_eos"
description: '"dgxa100_dracooci" for OCI-IAD, "dgxh100_eos" for EOS'
CONVERGENCE_TEST:
value: "no"
description: 'Cluster for A100 workloads'
FUNCTIONAL_TEST_CLUSTER_H100:
value: "dgxh100_eos"
options:
- "yes"
- "no"
description: To run a convergence test
CONVERGENCE_TEST_SCOPE:
value: "release"
options:
- "release"
- "pre-release"
description: "Test suite to run (only for CONVERGENCE_TEST=yes)"
CONVERGENCE_TEST_RUN_NAME:
value: "pre-release-$$CI_PIPELINE_ID"
description: "Run directory of convergence test"
- "dgxh100_coreweave"
- "dgxh100_eos"
description: 'Cluster for H100 workloads'
FUNCTIONAL_TEST_NAME:
description: "Name of functional test run (only for pre-release and release)"
PUBLISH:
value: "no"
options:
Expand All @@ -96,6 +96,7 @@ variables:

# CI wide variables
CI_MCORE_IMAGE: ${GITLAB_ENDPOINT}:5005/adlr/megatron-lm/mcore_ci
CI_MCORE_DEV_IMAGE: ${GITLAB_ENDPOINT}:5005/adlr/megatron-lm/mcore_ci_dev
CI_NEMO_IMAGE: ${GITLAB_ENDPOINT}:5005/adlr/megatron-lm/nemo_ci
LINTING_IMAGE: ${GITLAB_ENDPOINT}:5005/adlr/megatron-lm/mcore_linting
UNIT_TEST_TIMEOUT: 15
Expand All @@ -105,5 +106,4 @@ include:
- .gitlab/stages/00.pre.yml
- .gitlab/stages/01.tests.yml
- .gitlab/stages/02.functional-tests.yml
- .gitlab/stages/03.convergence-tests.yml
- .gitlab/stages/04.publish.yml
- .gitlab/stages/03.publish.yml
19 changes: 14 additions & 5 deletions .gitlab/stages/00.pre.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,10 @@ clean_docker_node:
matrix:
- node: 8xL40S
- node: mcore-docker-node-small
- node: mcore-docker-node-jet
script:
- export DOCKER_HOST='unix:///var/run/docker.sock'
- docker system prune -a --filter "until=48h" -f || true
- docker system prune -a --filter "until=36h" -f || true

maybe_cherry_pick_commit:
rules:
Expand All @@ -101,8 +102,13 @@ maybe_cherry_pick_commit:
- git config --global user.email "[email protected]"
- git config --global user.name "Mcore Bot"
- |
LABELS=$(curl --header "PRIVATE-TOKEN: ${PROJECT_ACCESS_TOKEN_MCORE}" --url "https://${GITLAB_ENDPOINT}/api/v4/projects/${CI_PROJECT_ID}/merge_requests/${MR_ID}" | jq '.labels | join(",")' | tr -d '"')

MR=$(curl --header "PRIVATE-TOKEN: ${PROJECT_ACCESS_TOKEN_MCORE}" --url "https://${GITLAB_ENDPOINT}/api/v4/projects/${CI_PROJECT_ID}/merge_requests/${MR_ID}")

LABELS=$(echo -E $MR | jq '.labels | join(",")' | tr -d '"')
AUTHOR_ID=$(echo -E $MR | jq '.author.id' | tr -d '"')
AUTHOR_NAME=$(echo -E $MR | jq '.author.username' | tr -d '"')
TITLE=$(echo -E $MR | jq '.title' | tr -d '"')
MILESTONE_ID=$(echo -E $MR | jq '.milestone.id' | tr -d '"')
TARGET_BRANCHES=$(echo "$LABELS" | grep -o 'core_[^,]*')

if [[ $TARGET_BRANCHES == "" ]]; then
Expand Down Expand Up @@ -134,8 +140,11 @@ maybe_cherry_pick_commit:
--url https://${GITLAB_ENDPOINT}/api/v4/projects/${CI_PROJECT_ID}/merge_requests \
-d "source_branch=cherry-pick-$MR_ID-$RELEASE_BRANCH" \
-d "target_branch=$RELEASE_BRANCH" \
-d "title=Cherry-pick $MR_ID into $RELEASE_BRANCH" \
-d "labels=cherry-pick"
-d "title=Cherry pick \`$TITLE ($MR_ID)\` into \`$RELEASE_BRANCH\`" \
-d "labels=cherry-pick" \
-d "reviewer_ids=$AUTHOR_ID" \
-d "milestone_id=$MILESTONE_ID" \
-d "description=[🤖]: Hi @$AUTHOR_NAME 👋,<br><br>we've cherry picked \`$TITLE ($MR_ID)\` into \`$RELEASE_BRANCH\` for you! 🚀<br><br>Please review and approve this cherry pick by your convenience\!"

else
URL=https://${GITLAB_ENDPOINT}/ADLR/megatron-lm/-/merge_requests/$MR_ID
Expand Down
119 changes: 73 additions & 46 deletions .gitlab/stages/01.tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ build_image:
FILE: Dockerfile.ci
BASE_IMAGE: nvcr.io/nvidia/pytorch:24.01-py3
TAG: mcore-docker-node-large
- IMAGE: CI_MCORE_DEV_IMAGE
FILE: Dockerfile.ci.dev
BASE_IMAGE: nvcr.io/nvidia/pytorch:24.01-py3
TAG: mcore-docker-node-large
- IMAGE: CI_NEMO_IMAGE
FILE: Dockerfile.ci
BASE_IMAGE: nvcr.io/nvidian/nemo:nightly
Expand All @@ -35,48 +39,44 @@ build_image:
variables:
STAGE: main
script:
- apk add bash
- |
set -x
env
eval "IMAGE=\$$IMAGE"

docker system prune -a --filter "until=24h" -f || true

if [[ "$CI_COMMIT_BRANCH" == "$CI_DEFAULT_BRANCH" ]]; then
ADDITIONAL_PARAMS="--pull"
fi

docker pull ${IMAGE}:${CI_PIPELINE_ID} || true
docker pull ${IMAGE}:${CI_MERGE_REQUEST_IID:-noop} || true
docker pull ${IMAGE}:buildcache || true

docker build \
--secret id=JET_INDEX_URLS \
--target $STAGE \
-f $FILE \
-t ${IMAGE}:${CI_PIPELINE_ID} \
-t ${IMAGE}:${CI_MERGE_REQUEST_IID:-noop} \
--build-arg CACHEBUST=$(cat /proc/sys/kernel/random/uuid) \
--cache-to type=inline \
--cache-from type=registry,ref=${IMAGE}:buildcache \
--cache-from type=registry,ref=${IMAGE}:${CI_PIPELINE_ID} \
--cache-from type=registry,ref=${IMAGE}:${CI_MERGE_REQUEST_IID:-noop} \
--build-arg FROM_IMAGE_NAME=$BASE_IMAGE \
${ADDITIONAL_PARAMS} .

docker push ${IMAGE}:${CI_PIPELINE_ID}
docker push ${IMAGE}:${CI_MERGE_REQUEST_IID:-noop}

if [[ "$CI_COMMIT_BRANCH" == "ci-nightly-a100" ]]; then
docker tag ${IMAGE}:${CI_PIPELINE_ID} ${IMAGE}:nightly
docker push ${IMAGE}:nightly
fi
bash -c '
set -x
env
eval "IMAGE=\$$IMAGE"

docker system prune -a --filter "until=24h" -f || true

docker buildx create --name container --driver=docker-container

ADDITIONAL_PARAMS=()

if [[ "$CI_COMMIT_BRANCH" == "$CI_DEFAULT_BRANCH" ]]; then
ADDITIONAL_PARAMS+=("--pull")
ADDITIONAL_PARAMS+=("--cache-to type=registry,ref=${IMAGE}-buildcache:main")
fi

if [[ "$CI_COMMIT_BRANCH" == "$CI_DEFAULT_BRANCH" ]]; then
docker tag ${IMAGE}:${CI_PIPELINE_ID} ${IMAGE}:buildcache
docker push ${IMAGE}:buildcache
fi
if [[ "$CI_COMMIT_BRANCH" == "ci-nightly-a100" ]]; then
ADDITIONAL_PARAMS+=("-t ${IMAGE}:nightly")
fi

DOCKER_BUILDKIT=1 docker build \
--secret id=JET_INDEX_URLS \
--target $STAGE \
-f $FILE \
-t ${IMAGE}:${CI_PIPELINE_ID} \
--builder=container \
--build-arg CACHEBUST=$(cat /proc/sys/kernel/random/uuid) \
--cache-to type=registry,ref=${IMAGE}-buildcache:${CI_PIPELINE_ID} \
--cache-to type=registry,ref=${IMAGE}-buildcache:${CI_MERGE_REQUEST_IID:-noop} \
--cache-from type=registry,ref=${IMAGE}-buildcache:main \
--cache-from type=registry,ref=${IMAGE}-buildcache:${CI_PIPELINE_ID} \
--cache-from type=registry,ref=${IMAGE}-buildcache:${CI_MERGE_REQUEST_IID:-noop} \
--build-arg FROM_IMAGE_NAME=$BASE_IMAGE \
--push \
${ADDITIONAL_PARAMS[@]} .
'
retry:
max: 2

Expand All @@ -85,13 +85,17 @@ unit_tests:
# the current code. This is a form of backwards compatibility testing
# and helps in providing stable interfaces.
extends: [.test_mr_rules]
image: ${CI_MCORE_IMAGE}:${CI_PIPELINE_ID}
image: ${IMAGE}:${CI_PIPELINE_ID}
needs: [build_image]
timeout: 180m
parallel:
matrix:
- TAG: latest
- TAG: 8fc755388a03bae05cb740857008b8916e01a63c
IMAGE: ${CI_MCORE_IMAGE}
# - TAG: latest
# IMAGE: ${CI_MCORE_DEV_IMAGE}
- TAG: core_r0.9.0
IMAGE: ${CI_MCORE_IMAGE}
tags: [8xL40S]
variables:
GIT_STRATEGY: clone
Expand All @@ -112,11 +116,14 @@ unit_tests:

for i in $(seq $UNIT_TEST_REPEAT); do
SEED=$((RANDOM % 9000 + 1000));
SKIPPED=()
ARGS=()
if [[ $TAG != latest ]]; then
SKIPPED+=(-m "not internal")
ARGS+=(-m "not internal")
fi
if [[ $IMAGE == ${CI_MCORE_DEV_IMAGE} ]]; then
ARGS+=(-m "experimental")
fi
timeout ${UNIT_TEST_TIMEOUT}m torchrun --nproc_per_node=8 -m pytest --random-order --random-order-seed ${SEED} -xvs --cov-report=term --cov-report=html --cov=megatron/core --no-cov-on-fail "${SKIPPED[@]}" tests/unit_tests
timeout ${UNIT_TEST_TIMEOUT}m torchrun --nproc_per_node=8 -m pytest --random-order --random-order-seed ${SEED} -xvs --cov-report=term --cov-report=html --cov=megatron/core --no-cov-on-fail "${ARGS[@]}" tests/unit_tests
done
artifacts:
paths:
Expand All @@ -125,10 +132,30 @@ unit_tests:
- if: $CI_PIPELINE_SOURCE == 'merge_request_event' && $CI_MERGE_REQUEST_TARGET_BRANCH_PROTECTED != "true"
allow_failure: true
when: always
- if: '$TAG != "latest"'
allow_failure: true
- when: always

unit-tests-results-notify:
extends: [.test_mr_rules]
image: ${CI_MCORE_IMAGE}:${CI_PIPELINE_ID}
needs: [unit_tests]
tags:
- mcore-docker-node-small
script:
- env
- export WEBHOOK_URL=${MCORE_NOTIFICATION_HOOK}
- export RO_API_TOKEN=${PROJECT_ACCESS_TOKEN_MCORE}
- export GITLAB_ENDPOINT
- export DATE=$(date +"%Y-%m-%d")
- bash tests/functional_tests/shell_test_utils/notify_unit_tests.sh ${CI_PIPELINE_ID}
artifacts:
when: always
paths:
- scripts
rules:
- if: $CI_PIPELINE_SOURCE == "schedule" && $CI_COMMIT_BRANCH == "ci-unit-test-extended"
when: always
- when: never

docs_build_test:
extends: [.test_mr_rules]
image: ${CI_MCORE_IMAGE}:${CI_PIPELINE_ID}
Expand Down
Loading