-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Remove the Training Operator V1 Source Code #2389
base: master
Are you sure you want to change the base?
Conversation
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: seanlaii, saileshd1402, astefanutti. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -1,8 +1,3 @@ | |||
# Image URL to use all building/pushing image targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored Makefile to keep only instructions that we use today.
We can add more of them, if we require them in the future.
Also, I install envtest-setup and controller-gen to the local bin, similar like in other k8s projects.
@kubeflow/wg-training-leads Please let me know if that looks good.
| `latest` (master HEAD) | `v1` | 1.27+ | | ||
|
||
## Reference | ||
## Kubeflow Training Operator V1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should work on this message to explain that users can still use Training Operator v1 and we will resolve any bugs on the release branch. cc @kubeflow/wg-training-leads
6ad98f8
to
04c0a95
Compare
@@ -22,7 +22,7 @@ inputs: | |||
tag-prefix: | |||
required: false | |||
default: v1 | |||
description: Prefix for the image tag, e.g. v1 or v2alpha1 | |||
description: Prefix for the image tag, e.g. v2alpha1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kubeflow/wg-training-leads @Electronic-Waste What do we think about these templates ?
Do we really need them if we only use them in a single GitHub action ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, reserving these templates will be great because they could keep the CI test modularized and easy to read:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could explore options to re-use them for various projects (e.g. Katib).
Currently, this is only used in build-and-push-images.yaml
run: | | ||
make test-integration ENVTEST_K8S_VERSION=${{ matrix.kubernetes-version }} | ||
|
||
- name: Coveralls report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our Coveralls report doesn't work, so we might need to re-visit this: https://coveralls.io/github/kubeflow/training-operator?branch=master
2996292
to
b1098a5
Compare
- component-name: training-operator-v2 | ||
dockerfile: cmd/training-operator.v2alpha1/Dockerfile | ||
platforms: linux/amd64,linux/arm64,linux/ppc64le | ||
tag-prefix: v2alpha1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Electronic-Waste @kubeflow/wg-training-leads @kannon92 @astefanutti What do we think about renaming the image to:
docker.io/kubeflow/trainer-controller-manager:v2-<SHA>
That will keep us consistent with other controller managers like:
Kubernetes - https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
JobSet - https://github.com/kubernetes-sigs/jobset/blob/main/config/components/manager/manager.yaml#L17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?
E.g.
TrainJob
TrainingRuntime
ClusterTrainingRuntime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about renaming the image to:
SGTM
@tenzen-y Another question, do we really need to keep V2 version of our Kubernetes CRDs if it is a brand new API for Kubeflow Trainer project?
Do you indicate that we replace those resources API versions with v1alpha1 instead of v2alpha1?
In that case, it's fair. Actually, those resources have completely different concepts with v1 Operator APIs like PyTorchJob.
The only concern is how we can explain the corresponding API version and Operator version.
If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2. Can this easily understand what is happening for the end users? Because the users can recognize only the API version, and the operator and project version are only in Documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reversion TrainJob with v1alpha1, we will say that the project (operator) version is v1, but API version is v2.
Do you mean if we make releases of this repo starting from v2.0.0 for TrainJob APIs, right ?
Actually, I had the same concerns that it might confuse users.
Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9
branch.
Do we have any other ideas here @tenzen-y @johnugeorge @terrytangyuan @Electronic-Waste @franciscojavierarceo @seanlaii @astefanutti @kannon92 @vsoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about renaming the image to
SGTM
Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.
I suggest that we use v2.0.0 for TrainJob APIs, since it will be more clear and straightforward. Users can have a simple impression on Kubeflow Trainer: v1 has CRDs like PyTorchJob and v2 has TrainJob and TrainingRuntime.
However, we are not ready for TrainJob APIs yet (it's not mature now). From my perspective, it's okay to release 1.9 with both v1 and v2alpha1 APIs. But we'd better switch to v2.0.0 once we deprecate the v1 APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Kubernetes, it is not standard practice to introduce a new CRD API starting at the v2
version. Typically, a new API begins with v1alpha1
Looking ahead, once we release stable versions of TrainJob and TrainingRuntime, any breaking changes to the API would require introducing a v3
release if we proceed with v2alpha1
now.
To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.
@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?
Another option could be to release v1.10.0 which includes this new APIs (TrainJob and TrainingRuntime).
In that case we will keep CRD APIs and this repo releases consistent.
The PyTorchJob, TFJob, and other APIs for Training Operator will stay at release-1.9 branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To maintain consistency and follow conventions, starting with v1alpha1 would be the better approach, I believe.
SGTM. Maybe we need to let users know about the difference between v1alpha1
and v1
APIs. But as @tenzen-y suggested, it's not straightforward and might be hard for end users to understand what had happened if we deprecate v1
APIs and switch to v1alpha1
.
@Electronic-Waste @kubeflow/wg-training-leads Why you don't like this idea ?
In my impression, we propose TrainJob APIs with "Kubeflow Training V2" slogan, which means a new start for training-operator (maybe known as Kubeflow Trainer in the future). So I prefer switching to v2.0.0 when the TrainJob APIs are stable. And I guess it will also be more straightforward for end users:)
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
f745c14
to
100cb5b
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Part of: #2170
I removed the Training Operator v1 source code from the
master
branch, so we can keep our git history only with V2 changes and release the 2.0.Also, I updated the Makefile, scripts, and ROADMAP.
TODO list:
/cc @kubeflow/wg-training-leads @Electronic-Waste @saileshd1402 @astefanutti @franciscojavierarceo @saileshd1402 @seanlaii