Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Remove the Training Operator V1 Source Code #2389

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jan 16, 2025

Part of: #2170

I removed the Training Operator v1 source code from the master branch, so we can keep our git history only with V2 changes and release the 2.0.
Also, I updated the Makefile, scripts, and ROADMAP.

TODO list:

  • Update the GitHub actions
  • Finish the README

/cc @kubeflow/wg-training-leads @Electronic-Waste @saileshd1402 @astefanutti @franciscojavierarceo @saileshd1402 @seanlaii

Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: seanlaii, saileshd1402, astefanutti.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Part of: #2170

I removed the Training Operator v1 source code from the master branch, so we can keep our git history only with V2 changes and release the 2.0.
Also, I updated the Makefile, scripts, and ROADMAP.

TODO list:

  • Update the GitHub actions
  • Finish the README

/cc @kubeflow/wg-training-leads @Electronic-Waste @saileshd1402 @astefanutti @franciscojavierarceo @saileshd1402 @seanlaii

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -1,8 +1,3 @@
# Image URL to use all building/pushing image targets
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored Makefile to keep only instructions that we use today.
We can add more of them, if we require them in the future.

Also, I install envtest-setup and controller-gen to the local bin, similar like in other k8s projects.

@kubeflow/wg-training-leads Please let me know if that looks good.

| `latest` (master HEAD) | `v1` | 1.27+ |

## Reference
## Kubeflow Training Operator V1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should work on this message to explain that users can still use Training Operator v1 and we will resolve any bugs on the release branch. cc @kubeflow/wg-training-leads

@@ -22,7 +22,7 @@ inputs:
tag-prefix:
required: false
default: v1
description: Prefix for the image tag, e.g. v1 or v2alpha1
description: Prefix for the image tag, e.g. v2alpha1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kubeflow/wg-training-leads @Electronic-Waste What do we think about these templates ?
Do we really need them if we only use them in a single GitHub action ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, reserving these templates will be great because they could keep the CI test modularized and easy to read:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could explore options to re-use them for various projects (e.g. Katib).
Currently, this is only used in build-and-push-images.yaml

run: |
make test-integration ENVTEST_K8S_VERSION=${{ matrix.kubernetes-version }}

- name: Coveralls report
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our Coveralls report doesn't work, so we might need to re-visit this: https://coveralls.io/github/kubeflow/training-operator?branch=master

Comment on lines +16 to +19
- component-name: training-operator-v2
dockerfile: cmd/training-operator.v2alpha1/Dockerfile
platforms: linux/amd64,linux/arm64,linux/ppc64le
tag-prefix: v2alpha1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Electronic-Waste @kubeflow/wg-training-leads @kannon92 @astefanutti What do we think about renaming the image to:

docker.io/kubeflow/trainer-controller-manager:v2-<SHA>

That will keep us consistent with other controller managers like:
Kubernetes - https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
JobSet - https://github.com/kubernetes-sigs/jobset/blob/main/config/components/manager/manager.yaml#L17

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?
E.g.

TrainJob
TrainingRuntime
ClusterTrainingRuntime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about renaming the image to:

SGTM

@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?

Do you indicate that we replace those resources API versions with v1alpha1 instead of v2alpha1?
In that case, it's fair. Actually, those resources have completely different concepts with v1 Operator APIs like PyTorchJob.

The only concern is how we can explain the corresponding API version and Operator version.
If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2. Can this easily understand what is happening for the end users? Because the users can recognize only the API version, and the operator and project version are only in Documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2.

Do you mean if we make releases of this repo starting from v2.0.0 for TrainJob APIs, right ?
Actually, I had the same concerns that it might confuse users.

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

Do we have any other ideas here @tenzen-y @johnugeorge @terrytangyuan @Electronic-Waste @franciscojavierarceo @seanlaii @astefanutti @kannon92 @vsoch?

Copy link
Member

@Electronic-Waste Electronic-Waste Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about renaming the image to

SGTM

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

I suggest that we use v2.0.0 for TrainJob APIs, since it will be more clear and straightforward. Users can have a simple impression on Kubeflow Trainer: v1 has CRDs like PyTorchJob and v2 has TrainJob and TrainingRuntime.

However, we are not ready for TrainJob APIs yet (it's not mature now). From my perspective, it's okay to release 1.9 with both v1 and v2alpha1 APIs. But we'd better switch to v2.0.0 once we deprecate the v1 APIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kubernetes, it is not standard practice to introduce a new CRD API starting at the v2 version. Typically, a new API begins with v1alpha1

Looking ahead, once we release stable versions of TrainJob and TrainingRuntime, any breaking changes to the API would require introducing a v3 release if we proceed with v2alpha1 now.
To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.

@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?

Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.

SGTM. Maybe we need to let users know about the difference between v1alpha1 and v1 APIs. But as @tenzen-y suggested, it's not straightforward and might be hard for end users to understand what had happened if we deprecate v1 APIs and switch to v1alpha1.

@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?

In my impression, we propose TrainJob APIs with "Kubeflow Training V2" slogan, which means a new start for training-operator (maybe known as Kubeflow Trainer in the future). So I prefer switching to v2.0.0 when the TrainJob APIs are stable. And I guess it will also be more straightforward for end users:)

Also cc👀 if you're interested in it @Doris-xm @truc0

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants