Skip to content

Conversation

@Deepam02
Copy link

@Deepam02 Deepam02 commented Dec 4, 2025

What this PR does / why we need it:

This PR optimizes the CI pipeline by extracting Docker image building into a centralized reusable workflow. Previously, each E2E test job built images independently, resulting in redundant builds and wasted resources.

Changes:

  • Created images.yaml as a reusable workflow that builds all Volcano Docker images once
  • Saves built images as artifacts
  • Updated all 10 E2E workflows to download and load pre-built images instead of rebuilding
  • Uses standard docker save/docker load commands for artifact handling

Benefits:

  • Eliminates redundant image builds across parallel jobs
  • Significantly reduces CI execution time and resource usage
  • Faster feedback loop when retriggering failed pipelines
  • Images are built once and reused across all test jobs

Which issue(s) this PR fixes:

Fixes #4766

Special notes for your reviewer:

The implementation follows GitHub's reusable workflow pattern. All E2E workflows now have a needs: build-images dependency and download artifacts before running tests. The existing test infrastructure remains unchanged - only the image provisioning mechanism has been optimized.

Copilot AI review requested due to automatic review settings December 4, 2025 19:31
@gemini-code-assist
Copy link

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

@volcano-sh-bot
Copy link
Contributor

Welcome @Deepam02! It looks like this is your first PR to volcano-sh/volcano 🎉

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 4, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the CI pipeline by introducing a centralized, reusable workflow for Docker image building, eliminating redundant image builds across parallel E2E test jobs. The optimization reduces CI execution time and resource consumption by building all Volcano images once and distributing them as artifacts.

Key Changes:

  • Created images.yaml reusable workflow that builds all four Volcano Docker images once and uploads them as artifacts with 5-day retention
  • Updated 10 E2E workflow files to depend on the centralized build job and load pre-built images instead of rebuilding
  • Modified test execution commands to use the common hack/run-e2e-kind.sh script with E2E_TYPE parameters

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
.github/workflows/images.yaml New reusable workflow that builds and saves all Volcano Docker images as artifacts
.github/workflows/e2e_vcctl.yaml Added build-images dependency, downloads artifacts, and updated test execution to use run-e2e-kind.sh
.github/workflows/e2e_spark.yaml Added build-images dependency, downloads artifacts, modified image loading for minikube environment
.github/workflows/e2e_sequence.yaml Added build-images dependency, downloads artifacts, and updated test execution to use run-e2e-kind.sh
.github/workflows/e2e_scheduling_basic.yaml Added build-images dependency, downloads artifacts, and updated test execution to use run-e2e-kind.sh
.github/workflows/e2e_scheduling_actions.yaml Added build-images dependency, downloads artifacts, and updated test execution to use run-e2e-kind.sh
.github/workflows/e2e_parallel_jobs.yaml Added build-images dependency, downloads artifacts, and updated test execution to use run-e2e-kind.sh
.github/workflows/e2e_hypernode.yaml Added build-images dependency, downloads artifacts, and updated test execution to use run-e2e-kind.sh
.github/workflows/e2e_dra.yml Added build-images dependency, downloads artifacts, and updated test execution to use run-e2e-kind.sh with DRA feature gate
.github/workflows/e2e_cronjob.yaml Added build-images dependency, downloads artifacts, and updated test execution to use run-e2e-kind.sh
.github/workflows/e2e_admission.yaml Added build-images dependency and downloads artifacts for both admission policy and webhook test jobs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 103 to 106
# Use git SHA as TAG to match pre-built images
export TAG=$(git rev-parse --verify HEAD)
make TAG=${TAG} update-development-yaml
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The TAG is being set here using git rev-parse --verify HEAD, but the workflow still has an earlier step (lines 39-48) that sets TAG from .release-version or defaults to 'latest'. While that earlier step's output is now unused (overridden here), consider removing that obsolete step in a future cleanup to avoid confusion about where TAG comes from.

Copilot uses AI. Check for mistakes.
@JesseStutler
Copy link
Member

/cc @hajnalmt

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @Deepam02 !

Please find my comments below and let's see this in action if you managed to address them😀
Also please squash your commits, there is a merge commit at end of your commit history. Rebase your local branch on the master and do not merge it directly.

Thank you once more!

with:
name: volcano-images
path: images
retention-days: 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this is maybe not enough, but let's stay with 5, you are probably right.

@Deepam02
Copy link
Author

Deepam02 commented Dec 8, 2025

Hi @hajnalmt
I have made all the suggested changes, squashed the commits, and rebased.
Sorry for the delay in getting this done, please review it again when you get time.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Thank you for the change!
I had some minor comments still. Please resolve them too, but we can start the testing I think.

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Dec 8, 2025
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

@Deepam02
As the CI started to run I realized that this current structure is not enough as now every workflow triggers it's own image building.

What we want to achieve is that every workflow shall use the output artifact of the images workflow. This means that we need a separate e2e_test workflow that calls the current e2e workflows one by one (paralelly) after the images workflow.

This also means the modification of the current e2e workflows to be a reusable ones.
so we need to modify the on part to be:

on:
  workflow_call:

instead of:

on:
  push:
    branches:
      - master
    tags:
  pull_request:

And this new e2e_test workflow shall have this trigger instead.

I hope it's not a big a change. I completely missed this in the original specification.

@volcano-sh-bot volcano-sh-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 8, 2025
@Deepam02
Copy link
Author

Deepam02 commented Dec 8, 2025

@hajnalmt

I've restructured the workflows as suggested:

  • Created a new e2e.yaml orchestrator that calls images.yaml first, then triggers all E2E workflows in parallel

CI should be green now. Please review when you have a chance.

@hajnalmt
Copy link
Contributor

hajnalmt commented Dec 8, 2025

/unhold

Thank you! Let's see :)

@volcano-sh-bot volcano-sh-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 8, 2025
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test all

@volcano-sh-bot
Copy link
Contributor

@hajnalmt: No presubmit jobs available for volcano-sh/volcano@master

Details

In response to this:

/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hajnalmt
Copy link
Contributor

hajnalmt commented Dec 8, 2025

/retest

@hajnalmt
Copy link
Contributor

hajnalmt commented Dec 8, 2025

/ok-to-test cancel

@hajnalmt
Copy link
Contributor

hajnalmt commented Dec 8, 2025

/cc @JesseStutler

Can you trigger testing somehow on this one? We probably need to remove the ok-to-test label and readd it since the workflow structure changed.

@hajnalmt
Copy link
Contributor

hajnalmt commented Dec 8, 2025

/rerun-all

@JesseStutler
Copy link
Member

/cc @JesseStutler

Can you trigger testing somehow on this one? We probably need to remove the ok-to-test label and readd it since the workflow structure changed.

/ok-to-test label is enough

with:
go-version: 1.24.x

- name: Install musl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently we don't need musl now? @Deepam02 @hajnalmt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently not using musl at all.

The Dockefiles are not processing the CC variable as arguments so it will use gcc.
scheduler Dockerfile

It's not even passed to the images target:
images target
It was just blindly copied each time a new CI job was introduced I guess.

Probably when Monokaix introduced the custom plugins:
https://github.com/volcano-sh/volcano/tree/master/example/custom-plugin
It was added everywhere for some reason, but It's hard to understand the reason behind this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so should we keep the musl if we don't know the reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not using it, we should delete the step I think. It just occupies the space and CI time for no reason.

- master
tags:
pull_request:
workflow_call:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not so familiar with workflow_call, so all the CIs can stil be triggered throught the e2e.yaml, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes hopefully, if we do a retrigger with retest it will only retrigger the failed tests too.
But some testing probably needs to be done on this to be sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image But I found that we still have these old pending checks waiting to be reported, but we already have execute these CIs, we should not execute them twice, right?

Copy link
Contributor

@hajnalmt hajnalmt Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is there because the master branch triggers them. It should disappear I think after we manage to merge this PR.

@JesseStutler
Copy link
Member

Seems that all the CIs have triggered but some CIs failed such as e2e-spark and e2e-vcctl, some failed CIs event can't load images @Deepam02 But why we still have pending checks here?
image
Shouldn't we already have some CIs passed and don't need to execute the independent old CIs again?

@JesseStutler
Copy link
Member

What's more, you still have built all the images in each workflow:
image
But what we want is that we just need to build the images only once, and load to the kind clusters in each workflow. @Deepam02

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the images are rebuilt because we checkout the code in the e2e_* again. Basically we are checking everything out twice, so all the timestamps differ from the ones in the build_images step. You shall move everything to the e2e.yaml which can be done at the invoking job.

- master
tags:
pull_request:
workflow_call:
Copy link
Contributor

@hajnalmt hajnalmt Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is there because the master branch triggers them. It should disappear I think after we manage to merge this PR.

tar -xf musl-1.2.1.tar.gz && cd musl-1.2.1
./configure
make && sudo make install
- name: Checkout code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkout code and Install dependencies can be moved to e2e.yaml too! We don't need them in every e2e test now.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wangyang0616 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


images:
for name in controller-manager scheduler webhook-manager agent; do\
docker buildx build -t "${IMAGE_PREFIX}/vc-$$name:$(TAG)" . -f ./installer/dockerfile/$$name/Dockerfile --output=type=${BUILDX_OUTPUT_TYPE} --platform ${DOCKER_PLATFORMS} --build-arg APK_MIRROR=${APK_MIRROR} --build-arg OPEN_EULER_IMAGE_TAG=${OPEN_EULER_IMAGE_TAG}; \
Copy link
Contributor

@hajnalmt hajnalmt Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively to not rebuild the images, you can add a check here, to not rebuild if the tag already exists.
You can introduce a new variable to force the image build (like FORCE_REBUILD) which is true by default and start the target with false make images FORCE_REBUILD=false in the e2e tests, so you are backwards compatible with the old make target.

@JesseStutler JesseStutler added the area/ci Issues or PRs related to volcano CIs label Dec 10, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file name be more explicit docker_images.yaml for example ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci Issues or PRs related to volcano CIs ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve ci with reusable workflows

5 participants