-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Remove the Training Operator V1 Source Code #2389
Open
andreyvelich
wants to merge
16
commits into
kubeflow:master
Choose a base branch
from
andreyvelich:remove-v1-code
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+659
−120,525
Open
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
1008fb6
Remove Training Operator V1 files
andreyvelich 135cf63
Refactor README and Makefile
andreyvelich 18997d6
Remove v1 GitHub workflows
andreyvelich d95e4c7
Remove finish action from Go
andreyvelich df9e872
Rename Kubeflow Training to Kubeflow Trainer
andreyvelich 90b6a72
Rename comment
andreyvelich 0ce4fdd
Add steps for free-up space
andreyvelich 7a7486e
Update Python tests
andreyvelich 7f22e3c
Remove integration workflow
andreyvelich 1a8bb05
Install SDK in Python integration tests
andreyvelich 5f603a5
Remove generate file
andreyvelich 6365757
Add Kubeflow Trainer logo
andreyvelich 6d8d277
Add tech diagram
andreyvelich ea38c98
Exclude images from pre-commit
andreyvelich 0c3ce17
Fix git diff in CI
andreyvelich 1dbb39b
Run go mod tidy
andreyvelich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,12 @@ | ||
blank_issues_enabled: true | ||
|
||
contact_links: | ||
- name: Training Operator Documentation | ||
url: https://www.kubeflow.org/docs/components/training/ | ||
- name: Kubeflow Trainer Documentation | ||
url: https://www.kubeflow.org/docs/components/trainer/ | ||
about: Much help can be found in the docs | ||
- name: Kubeflow Training Operator Slack Channel | ||
- name: Kubeflow Trainer Slack Channel | ||
url: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels | ||
about: Ask the Training Operator community on CNCF Slack | ||
- name: Kubeflow Training Operator Community Meeting | ||
about: Ask the Kubeflow Trainer community on CNCF Slack | ||
- name: Kubeflow Training and AutoML WG Community Meeting | ||
url: https://bit.ly/2PWVCkV | ||
about: Join the Kubeflow Training working group meeting |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
name: Build and Publish Images | ||
|
||
on: | ||
- push | ||
- pull_request | ||
|
||
jobs: | ||
build-and-publish: | ||
name: Build and Publish Images | ||
runs-on: ubuntu-latest | ||
|
||
strategy: | ||
fail-fast: false | ||
matrix: | ||
include: | ||
- component-name: training-operator-v2 | ||
dockerfile: cmd/training-operator.v2alpha1/Dockerfile | ||
platforms: linux/amd64,linux/arm64,linux/ppc64le | ||
tag-prefix: v2alpha1 | ||
- component-name: model-initializer-v2 | ||
dockerfile: cmd/initializer_v2/model/Dockerfile | ||
platforms: linux/amd64,linux/arm64 | ||
tag-prefix: v2 | ||
- component-name: dataset-initializer-v2 | ||
dockerfile: cmd/initializer_v2/dataset/Dockerfile | ||
platforms: linux/amd64,linux/arm64 | ||
tag-prefix: v2 | ||
|
||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
|
||
- name: Docker Login | ||
# Trigger workflow only for kubeflow/training-operator repository with specific branch (master, release-*) or tag (v.*). | ||
if: >- | ||
github.repository == 'kubeflow/training-operator' && | ||
(github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release-') || startsWith(github.ref, 'refs/tags/v')) | ||
uses: docker/login-action@v3 | ||
with: | ||
username: ${{ secrets.DOCKERHUB_USERNAME }} | ||
password: ${{ secrets.DOCKERHUB_TOKEN }} | ||
|
||
- name: Publish Component ${{ matrix.component-name }} | ||
# Trigger workflow only for kubeflow/training-operator repository with specific branch (master, release-*) or tag (v.*). | ||
if: >- | ||
github.repository == 'kubeflow/training-operator' && | ||
(github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release-') || startsWith(github.ref, 'refs/tags/v')) | ||
id: publish | ||
uses: ./.github/workflows/template-publish-image | ||
with: | ||
image: docker.io/kubeflow/${{ matrix.component-name }} | ||
dockerfile: ${{ matrix.dockerfile }} | ||
platforms: ${{ matrix.platforms }} | ||
context: ${{ matrix.context }} | ||
tag-prefix: ${{ matrix.tag-prefix }} | ||
push: true | ||
|
||
- name: Test Build For Component ${{ matrix.component-name }} | ||
if: steps.publish.outcome == 'skipped' | ||
uses: ./.github/workflows/template-publish-image | ||
with: | ||
image: docker.io/kubeflow/${{ matrix.component-name }} | ||
dockerfile: ${{ matrix.dockerfile }} | ||
platforms: ${{ matrix.platforms }} | ||
context: ${{ matrix.context }} | ||
tag-prefix: ${{ matrix.tag-prefix }} | ||
push: false |
This file was deleted.
Oops, something went wrong.
File renamed without changes.
File renamed without changes.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Electronic-Waste @kubeflow/wg-training-leads @kannon92 @astefanutti What do we think about renaming the image to:
That will keep us consistent with other controller managers like:
Kubernetes - https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
JobSet - https://github.com/kubernetes-sigs/jobset/blob/main/config/components/manager/manager.yaml#L17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?
E.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Do you indicate that we replace those resources API versions with v1alpha1 instead of v2alpha1?
In that case, it's fair. Actually, those resources have completely different concepts with v1 Operator APIs like PyTorchJob.
The only concern is how we can explain the corresponding API version and Operator version.
If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2. Can this easily understand what is happening for the end users? Because the users can recognize only the API version, and the operator and project version are only in Documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean if we make releases of this repo starting from v2.0.0 for TrainJob APIs, right ?
Actually, I had the same concerns that it might confuse users.
Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at
release-1.9
branch.Do we have any other ideas here @tenzen-y @johnugeorge @terrytangyuan @Electronic-Waste @franciscojavierarceo @seanlaii @astefanutti @kannon92 @vsoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
I suggest that we use v2.0.0 for TrainJob APIs, since it will be more clear and straightforward. Users can have a simple impression on Kubeflow Trainer: v1 has CRDs like PyTorchJob and v2 has TrainJob and TrainingRuntime.
However, we are not ready for TrainJob APIs yet (it's not mature now). From my perspective, it's okay to release 1.9 with both v1 and v2alpha1 APIs. But we'd better switch to v2.0.0 once we deprecate the v1 APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Kubernetes, it is not standard practice to introduce a new CRD API starting at the
v2
version. Typically, a new API begins withv1alpha1
Looking ahead, once we release stable versions of TrainJob and TrainingRuntime, any breaking changes to the API would require introducing a
v3
release if we proceed withv2alpha1
now.To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.
@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Maybe we need to let users know about the difference between
v1alpha1
andv1
APIs. But as @tenzen-y suggested, it's not straightforward and might be hard for end users to understand what had happened if we deprecatev1
APIs and switch tov1alpha1
.In my impression, we propose TrainJob APIs with "Kubeflow Training V2" slogan, which means a new start for training-operator (maybe known as Kubeflow Trainer in the future). So I prefer switching to v2.0.0 when the TrainJob APIs are stable. And I guess it will also be more straightforward for end users:)
Also cc👀 if you're interested in it @Doris-xm @truc0