[WIP] Remove the Training Operator V1 Source Code #2389

andreyvelich · 2025-01-16T16:49:55Z

Part of: #2170

I removed the Training Operator v1 source code from the master branch, so we can keep our git history only with V2 changes and release the 2.0.
Also, I updated the Makefile, scripts, and ROADMAP.

TODO list:

Update the GitHub actions
Finish the README

/cc @kubeflow/wg-training-leads @Electronic-Waste @saileshd1402 @astefanutti @franciscojavierarceo @saileshd1402 @seanlaii

google-oss-prow · 2025-01-16T16:50:02Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: seanlaii, saileshd1402, astefanutti.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Part of: #2170

I removed the Training Operator v1 source code from the master branch, so we can keep our git history only with V2 changes and release the 2.0.
Also, I updated the Makefile, scripts, and ROADMAP.

TODO list:

Update the GitHub actions

Finish the README

/cc @kubeflow/wg-training-leads @Electronic-Waste @saileshd1402 @astefanutti @franciscojavierarceo @saileshd1402 @seanlaii

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2025-01-16T16:50:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

review-notebook-app · 2025-01-16T16:50:03Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

andreyvelich · 2025-01-16T16:54:54Z

Makefile

@@ -1,8 +1,3 @@
-# Image URL to use all building/pushing image targets


I refactored Makefile to keep only instructions that we use today.
We can add more of them, if we require them in the future.

Also, I install envtest-setup and controller-gen to the local bin, similar like in other k8s projects.

@kubeflow/wg-training-leads Please let me know if that looks good.

andreyvelich · 2025-01-16T16:56:10Z

README.md

-| `latest` (master HEAD) | `v1`        | 1.27+              |
-
-## Reference
+## Kubeflow Training Operator V1


We should work on this message to explain that users can still use Training Operator v1 and we will resolve any bugs on the release branch. cc @kubeflow/wg-training-leads

andreyvelich · 2025-01-16T20:47:25Z

.github/workflows/template-publish-image/action.yaml

@@ -22,7 +22,7 @@ inputs:
  tag-prefix:
    required: false
    default: v1
-    description: Prefix for the image tag, e.g. v1 or v2alpha1
+    description: Prefix for the image tag, e.g. v2alpha1


@kubeflow/wg-training-leads @Electronic-Waste What do we think about these templates ?
Do we really need them if we only use them in a single GitHub action ?

I think, reserving these templates will be great because they could keep the CI test modularized and easy to read:)

It would be nice if we could explore options to re-use them for various projects (e.g. Katib).
Currently, this is only used in build-and-push-images.yaml

andreyvelich · 2025-01-16T20:55:06Z

.github/workflows/test-go.yaml

+        run: |
+          make test-integration ENVTEST_K8S_VERSION=${{ matrix.kubernetes-version }}
+
+      - name: Coveralls report


Our Coveralls report doesn't work, so we might need to re-visit this: https://coveralls.io/github/kubeflow/training-operator?branch=master

andreyvelich · 2025-01-18T00:10:00Z

.github/workflows/build-and-push-images.yaml

+          - component-name: training-operator-v2
+            dockerfile: cmd/training-operator.v2alpha1/Dockerfile
+            platforms: linux/amd64,linux/arm64,linux/ppc64le
+            tag-prefix: v2alpha1


@Electronic-Waste @kubeflow/wg-training-leads @kannon92 @astefanutti What do we think about renaming the image to:

docker.io/kubeflow/trainer-controller-manager:v2-<SHA>

That will keep us consistent with other controller managers like:
Kubernetes - https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
JobSet - https://github.com/kubernetes-sigs/jobset/blob/main/config/components/manager/manager.yaml#L17

@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?
E.g.

TrainJob TrainingRuntime ClusterTrainingRuntime

What do we think about renaming the image to:

SGTM

@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?

Do you indicate that we replace those resources API versions with v1alpha1 instead of v2alpha1?
In that case, it's fair. Actually, those resources have completely different concepts with v1 Operator APIs like PyTorchJob.

The only concern is how we can explain the corresponding API version and Operator version.
If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2. Can this easily understand what is happening for the end users? Because the users can recognize only the API version, and the operator and project version are only in Documents.

If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2.

Do you mean if we make releases of this repo starting from v2.0.0 for TrainJob APIs, right ?
Actually, I had the same concerns that it might confuse users.

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

Do we have any other ideas here @tenzen-y @johnugeorge @terrytangyuan @Electronic-Waste @franciscojavierarceo @seanlaii @astefanutti @kannon92 @vsoch?

What do we think about renaming the image to

SGTM

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

I suggest that we use v2.0.0 for TrainJob APIs, since it will be more clear and straightforward. Users can have a simple impression on Kubeflow Trainer: v1 has CRDs like PyTorchJob and v2 has TrainJob and TrainingRuntime.

However, we are not ready for TrainJob APIs yet (it's not mature now). From my perspective, it's okay to release 1.9 with both v1 and v2alpha1 APIs. But we'd better switch to v2.0.0 once we deprecate the v1 APIs.

In Kubernetes, it is not standard practice to introduce a new CRD API starting at the v2 version. Typically, a new API begins with v1alpha1

Looking ahead, once we release stable versions of TrainJob and TrainingRuntime, any breaking changes to the API would require introducing a v3 release if we proceed with v2alpha1 now.
To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.

@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.

SGTM. Maybe we need to let users know about the difference between v1alpha1 and v1 APIs. But as @tenzen-y suggested, it's not straightforward and might be hard for end users to understand what had happened if we deprecate v1 APIs and switch to v1alpha1.

@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?

In my impression, we propose TrainJob APIs with "Kubeflow Training V2" slogan, which means a new start for training-operator (maybe known as Kubeflow Trainer in the future). So I prefer switching to v2.0.0 when the TrainJob APIs are stable. And I guess it will also be more straightforward for end users:)

Also cc👀 if you're interested in it @Doris-xm @truc0

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow bot added the do-not-merge/work-in-progress label Jan 16, 2025

google-oss-prow bot requested review from franciscojavierarceo, a team and Electronic-Waste January 16, 2025 16:49

google-oss-prow bot added the size/XXL label Jan 16, 2025

andreyvelich commented Jan 16, 2025

View reviewed changes

andreyvelich force-pushed the remove-v1-code branch from 6ad98f8 to 04c0a95 Compare January 16, 2025 17:18

andreyvelich commented Jan 16, 2025

View reviewed changes

andreyvelich force-pushed the remove-v1-code branch from 2996292 to b1098a5 Compare January 17, 2025 23:59

andreyvelich commented Jan 18, 2025

View reviewed changes

andreyvelich added 8 commits January 18, 2025 18:38

Remove Training Operator V1 files

584686a

Signed-off-by: Andrey Velichkevich <[email protected]>

Refactor README and Makefile

212b7f0

Signed-off-by: Andrey Velichkevich <[email protected]>

Remove v1 GitHub workflows

181a5b2

Signed-off-by: Andrey Velichkevich <[email protected]>

Remove finish action from Go

bd52b4e

Signed-off-by: Andrey Velichkevich <[email protected]>

Rename Kubeflow Training to Kubeflow Trainer

0959e9e

Signed-off-by: Andrey Velichkevich <[email protected]>

Rename comment

aae7fbe

Signed-off-by: Andrey Velichkevich <[email protected]>

Add steps for free-up space

2d3a72a

Signed-off-by: Andrey Velichkevich <[email protected]>

Update Python tests

100cb5b

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the remove-v1-code branch from f745c14 to 100cb5b Compare January 18, 2025 19:26

andreyvelich added 3 commits January 18, 2025 19:27

Remove integration workflow

f1372c8

Signed-off-by: Andrey Velichkevich <[email protected]>

Install SDK in Python integration tests

1da6379

Signed-off-by: Andrey Velichkevich <[email protected]>

Remove generate file

29cf414

Signed-off-by: Andrey Velichkevich <[email protected]>

This was referenced Jan 20, 2025

[WIP] Training: Initial Documentation for Kubeflow Trainer V2 kubeflow/website#3958

Open

Add Changelog for Training Operator v1.9.0 #2397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Remove the Training Operator V1 Source Code #2389

[WIP] Remove the Training Operator V1 Source Code #2389

andreyvelich commented Jan 16, 2025 •

edited

Loading

google-oss-prow bot commented Jan 16, 2025

google-oss-prow bot commented Jan 16, 2025

review-notebook-app bot commented Jan 16, 2025

andreyvelich Jan 16, 2025

andreyvelich Jan 16, 2025

andreyvelich Jan 16, 2025

Electronic-Waste Jan 17, 2025

andreyvelich Jan 17, 2025

andreyvelich Jan 16, 2025

andreyvelich Jan 18, 2025

andreyvelich Jan 18, 2025

tenzen-y Jan 20, 2025

andreyvelich Jan 20, 2025

Electronic-Waste Jan 21, 2025 •

edited

Loading

andreyvelich Jan 21, 2025

Electronic-Waste Jan 21, 2025

		@@ -1,8 +1,3 @@
		# Image URL to use all building/pushing image targets

[WIP] Remove the Training Operator V1 Source Code #2389

Are you sure you want to change the base?

[WIP] Remove the Training Operator V1 Source Code #2389

Conversation

andreyvelich commented Jan 16, 2025 • edited Loading

google-oss-prow bot commented Jan 16, 2025

google-oss-prow bot commented Jan 16, 2025

review-notebook-app bot commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Electronic-Waste Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Jan 16, 2025 •

edited

Loading

Electronic-Waste Jan 21, 2025 •

edited

Loading