Skip to content

Commit 9bc50ea

Browse files
akashveramdhuydhn
andauthored
Implement ciflow/rocm on Torchtitan (#2114)
In this PR, I implemented ciflow/rocm on Torchtitan. The changes are part of integration_test_8gpu_features.yaml. The workflow still supports running on pull_request (without any PR label) for CUDA. However, along with push to main and cron schedule, with the ciflow/8gpu label added to PR, the workflow runs for both CUDA & ROCm. --------- Co-authored-by: Huy Do <[email protected]>
1 parent 995154f commit 9bc50ea

File tree

4 files changed

+88
-27
lines changed

4 files changed

+88
-27
lines changed

.github/labeler.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
"ciflow/8gpu":
2+
- .ci/docker/**
3+
- .github/workflows/**
4+
- scripts/**
5+
- tests/**
6+
- torchtitan/**

.github/pytorch-probot.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
ciflow_push_tags:
2+
- ciflow/8gpu
3+
labeler_config: labeler.yml

.github/workflows/integration_test_8gpu_features.yaml

Lines changed: 3 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@ name: 8 GPU Feature Tests
33
on:
44
push:
55
branches: [ main ]
6+
tags:
7+
- ciflow/8gpu/*
68
paths-ignore:
79
- 'torchtitan/experiments/**'
810
pull_request:
@@ -27,33 +29,7 @@ permissions:
2729
jobs:
2830
# Step 1: Dynamically compute the matrix based on conditions
2931
set-matrix:
30-
runs-on: ubuntu-latest
31-
outputs:
32-
matrix: ${{ steps.set.outputs.matrix }}
33-
steps:
34-
- id: set
35-
run: |
36-
# Decide which matrix entries to include based on event type
37-
if [[ "${{ github.event_name }}" == "push" && "${{ github.ref }}" == "refs/heads/main" ]] || [[ "${{ github.event_name }}" == "schedule" ]]; then
38-
# Include both CUDA and ROCm
39-
echo '{"include":[
40-
{"name":"cuda","runner":"linux.g5.48xlarge.nvidia.gpu","gpu-arch-type":"cuda","gpu-arch-version":"12.6","docker-image":"torchtitan-ubuntu-20.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/cu126"},
41-
{"name":"rocm","runner":"linux.rocm.gpu.gfx942.8","gpu-arch-type":"rocm","gpu-arch-version":"7.0","docker-image":"torchtitan-rocm-ubuntu-22.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/rocm7.0"}
42-
]}' > matrix.json
43-
else
44-
# Include only CUDA
45-
echo '{"include":[
46-
{"name":"cuda","runner":"linux.g5.48xlarge.nvidia.gpu","gpu-arch-type":"cuda","gpu-arch-version":"12.6","docker-image":"torchtitan-ubuntu-20.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/cu126"}
47-
]}' > matrix.json
48-
fi
49-
50-
# Export matrix to job outputs
51-
{
52-
echo 'matrix<<EOF'
53-
cat matrix.json
54-
echo 'EOF'
55-
} >> $GITHUB_OUTPUT
56-
32+
uses: ./.github/workflows/set-matrix.yaml
5733

5834
# Step 2: Use the dynamic matrix in the build-test job
5935
build-test:

.github/workflows/set-matrix.yaml

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
name: Set Matrix
2+
3+
on:
4+
workflow_call:
5+
outputs:
6+
matrix:
7+
description: dynamically set matrix
8+
value: ${{ jobs.set.outputs.matrix }}
9+
10+
jobs:
11+
set:
12+
runs-on: ubuntu-latest
13+
outputs:
14+
matrix: ${{ steps.set.outputs.matrix }}
15+
env:
16+
# Event flags evaluated by github actions before the step runs:
17+
IS_MAIN_PUSH: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
18+
IS_SCHEDULE: ${{ github.event_name == 'schedule' }}
19+
IS_8GPU_TAG: ${{ startsWith(github.ref, 'refs/tags/ciflow/8gpu/') }}
20+
TRIGGERED_8GPU_LABEL: ${{ github.event_name == 'pull_request' && github.event.action == 'labeled' }}
21+
22+
steps:
23+
- id: set
24+
run: |
25+
# Define ROCm matrix
26+
ROCM_MATRIX='{
27+
"name": "rocm",
28+
"runner": "linux.rocm.gpu.gfx942.8",
29+
"gpu-arch-type": "rocm",
30+
"gpu-arch-version": "7.0",
31+
"docker-image": "torchtitan-rocm-ubuntu-22.04-clang12",
32+
"index-url": "https://download.pytorch.org/whl/nightly/rocm7.0"
33+
}'
34+
35+
# Define CUDA matrix
36+
CUDA_MATRIX='{
37+
"name": "cuda",
38+
"runner": "linux.g5.48xlarge.nvidia.gpu",
39+
"gpu-arch-type": "cuda",
40+
"gpu-arch-version": "12.6",
41+
"docker-image": "torchtitan-ubuntu-20.04-clang12",
42+
"index-url": "https://download.pytorch.org/whl/nightly/cu126"
43+
}'
44+
45+
# Use default value as 'false' for unset environment variables
46+
IS_MAIN_PUSH="${IS_MAIN_PUSH:-false}"
47+
IS_SCHEDULE="${IS_SCHEDULE:-false}"
48+
IS_8GPU_TAG="${IS_8GPU_TAG:-false}"
49+
TRIGGERED_8GPU_LABEL="${TRIGGERED_8GPU_LABEL:-false}"
50+
51+
# Decide which matrix entries to include based on event type
52+
# Runs ROCm only for push tag OR when PR label gets triggered
53+
if [[ "$IS_8GPU_TAG" == "true" || "$TRIGGERED_8GPU_LABEL" == "true" ]]; then
54+
cat > matrix.json <<JSON
55+
{"include": [$ROCM_MATRIX]}
56+
JSON
57+
58+
# Runs CUDA and ROCm for normal PR (if PR label is present) OR for push to main, cron schedule
59+
elif [[ ("$IS_MAIN_PUSH" == "true" || "$IS_SCHEDULE" == "true") ]]; then
60+
cat > matrix.json <<JSON
61+
{"include": [$CUDA_MATRIX,$ROCM_MATRIX]}
62+
JSON
63+
64+
# Runs CUDA only as default (includes normal PR, if PR label is NOT present)
65+
else
66+
cat > matrix.json <<JSON
67+
{"include": [$CUDA_MATRIX]}
68+
JSON
69+
fi
70+
71+
# Export matrix to job outputs
72+
{
73+
echo 'matrix<<EOF'
74+
cat matrix.json
75+
echo 'EOF'
76+
} >> $GITHUB_OUTPUT

0 commit comments

Comments
 (0)