Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Remove the Training Operator V1 Source Code #2389

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/bug_report.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
name: Bug Report
description: Tell us about a problem you are experiencing with Training Operator
description: Tell us about a problem you are experiencing with Kubeflow Trainer
labels: ["kind/bug", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Training Operator bug report!
Thanks for taking the time to fill out this Kubeflow Trainer bug report!
- type: textarea
id: problem
attributes:
Expand Down
10 changes: 5 additions & 5 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
blank_issues_enabled: true

contact_links:
- name: Training Operator Documentation
url: https://www.kubeflow.org/docs/components/training/
- name: Kubeflow Trainer Documentation
url: https://www.kubeflow.org/docs/components/trainer/
about: Much help can be found in the docs
- name: Kubeflow Training Operator Slack Channel
- name: Kubeflow Trainer Slack Channel
url: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels
about: Ask the Training Operator community on CNCF Slack
- name: Kubeflow Training Operator Community Meeting
about: Ask the Kubeflow Trainer community on CNCF Slack
- name: Kubeflow Training and AutoML WG Community Meeting
url: https://bit.ly/2PWVCkV
about: Join the Kubeflow Training working group meeting
8 changes: 4 additions & 4 deletions .github/ISSUE_TEMPLATE/feature_request.yaml
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
name: Feature Request
description: Suggest an idea for Training Operator
description: Suggest an idea for Kubeflow Trainer
labels: ["kind/feature", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Training Operator feature request!
Thanks for taking the time to fill out this Kubeflow Trainer feature request!
- type: textarea
id: feature
attributes:
label: What you would like to be added?
description: |
A clear and concise description of what you want to add to Training Operator.
Please consider to write Training Operator enhancement proposal if it is a large feature request.
A clear and concise description of what you want to add to Kubeflow Trainer.
Please consider to write Kubeflow Enhancement Proposal (KEP) if it is a large feature request.
validations:
required: true
- type: textarea
Expand Down
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ Fixes #

**Checklist:**

- [ ] [Docs](https://www.kubeflow.org/docs/components/training/) included if any changes are user facing
- [ ] [Docs](https://www.kubeflow.org/docs/components/trainer/) included if any changes are user facing
5 changes: 0 additions & 5 deletions .github/issue_label_bot.yaml

This file was deleted.

75 changes: 0 additions & 75 deletions .github/workflows/build-and-publish-images.yaml

This file was deleted.

67 changes: 67 additions & 0 deletions .github/workflows/build-and-push-images.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: Build and Publish Images

on:
- push
- pull_request

jobs:
build-and-publish:
name: Build and Publish Images
runs-on: ubuntu-latest

strategy:
fail-fast: false
matrix:
include:
- component-name: training-operator-v2
dockerfile: cmd/training-operator.v2alpha1/Dockerfile
platforms: linux/amd64,linux/arm64,linux/ppc64le
tag-prefix: v2alpha1
Comment on lines +16 to +19
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Electronic-Waste @kubeflow/wg-training-leads @kannon92 @astefanutti What do we think about renaming the image to:

docker.io/kubeflow/trainer-controller-manager:v2-<SHA>

That will keep us consistent with other controller managers like:
Kubernetes - https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
JobSet - https://github.com/kubernetes-sigs/jobset/blob/main/config/components/manager/manager.yaml#L17

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?
E.g.

TrainJob
TrainingRuntime
ClusterTrainingRuntime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about renaming the image to:

SGTM

@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?

Do you indicate that we replace those resources API versions with v1alpha1 instead of v2alpha1?
In that case, it's fair. Actually, those resources have completely different concepts with v1 Operator APIs like PyTorchJob.

The only concern is how we can explain the corresponding API version and Operator version.
If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2. Can this easily understand what is happening for the end users? Because the users can recognize only the API version, and the operator and project version are only in Documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2.

Do you mean if we make releases of this repo starting from v2.0.0 for TrainJob APIs, right ?
Actually, I had the same concerns that it might confuse users.

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

Do we have any other ideas here @tenzen-y @johnugeorge @terrytangyuan @Electronic-Waste @franciscojavierarceo @seanlaii @astefanutti @kannon92 @vsoch?

Copy link
Member

@Electronic-Waste Electronic-Waste Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about renaming the image to

SGTM

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

I suggest that we use v2.0.0 for TrainJob APIs, since it will be more clear and straightforward. Users can have a simple impression on Kubeflow Trainer: v1 has CRDs like PyTorchJob and v2 has TrainJob and TrainingRuntime.

However, we are not ready for TrainJob APIs yet (it's not mature now). From my perspective, it's okay to release 1.9 with both v1 and v2alpha1 APIs. But we'd better switch to v2.0.0 once we deprecate the v1 APIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kubernetes, it is not standard practice to introduce a new CRD API starting at the v2 version. Typically, a new API begins with v1alpha1

Looking ahead, once we release stable versions of TrainJob and TrainingRuntime, any breaking changes to the API would require introducing a v3 release if we proceed with v2alpha1 now.
To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.

@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.

SGTM. Maybe we need to let users know about the difference between v1alpha1 and v1 APIs. But as @tenzen-y suggested, it's not straightforward and might be hard for end users to understand what had happened if we deprecate v1 APIs and switch to v1alpha1.

@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?

In my impression, we propose TrainJob APIs with "Kubeflow Training V2" slogan, which means a new start for training-operator (maybe known as Kubeflow Trainer in the future). So I prefer switching to v2.0.0 when the TrainJob APIs are stable. And I guess it will also be more straightforward for end users:)

Also cc👀 if you're interested in it @Doris-xm @truc0

- component-name: model-initializer-v2
dockerfile: cmd/initializer_v2/model/Dockerfile
platforms: linux/amd64,linux/arm64
tag-prefix: v2
- component-name: dataset-initializer-v2
dockerfile: cmd/initializer_v2/dataset/Dockerfile
platforms: linux/amd64,linux/arm64
tag-prefix: v2

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Docker Login
# Trigger workflow only for kubeflow/training-operator repository with specific branch (master, release-*) or tag (v.*).
if: >-
github.repository == 'kubeflow/training-operator' &&
(github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release-') || startsWith(github.ref, 'refs/tags/v'))
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Publish Component ${{ matrix.component-name }}
# Trigger workflow only for kubeflow/training-operator repository with specific branch (master, release-*) or tag (v.*).
if: >-
github.repository == 'kubeflow/training-operator' &&
(github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release-') || startsWith(github.ref, 'refs/tags/v'))
id: publish
uses: ./.github/workflows/template-publish-image
with:
image: docker.io/kubeflow/${{ matrix.component-name }}
dockerfile: ${{ matrix.dockerfile }}
platforms: ${{ matrix.platforms }}
context: ${{ matrix.context }}
tag-prefix: ${{ matrix.tag-prefix }}
push: true

- name: Test Build For Component ${{ matrix.component-name }}
if: steps.publish.outcome == 'skipped'
uses: ./.github/workflows/template-publish-image
with:
image: docker.io/kubeflow/${{ matrix.component-name }}
dockerfile: ${{ matrix.dockerfile }}
platforms: ${{ matrix.platforms }}
context: ${{ matrix.context }}
tag-prefix: ${{ matrix.tag-prefix }}
push: false
61 changes: 0 additions & 61 deletions .github/workflows/e2e-test-train-api.yaml

This file was deleted.

File renamed without changes.
106 changes: 0 additions & 106 deletions .github/workflows/integration-tests.yaml

This file was deleted.

14 changes: 0 additions & 14 deletions .github/workflows/pre-commit.yaml

This file was deleted.

Loading
Loading