Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(sdk): check that requested memory less than or equals according limits #11240

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ntny
Copy link
Contributor

@ntny ntny commented Sep 23, 2024

Description of your changes: raise an error with a description if the requested resources (memory/CPU) exceed the selected limit while defining the pipeline.

examples:

from kfp.dsl import pipeline_task
task = pipeline_task.PipelineTask(...).set_memory_request('2M').set_memory_limit('1M')

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Hi @ntny. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ntny ntny force-pushed the validate-that-requested-resource-meet-limits branch from 2887cb9 to 0d429d2 Compare September 25, 2024 09:05
@hbelmiro
Copy link
Contributor

/ok-to-test
/rerun-all

@rimolive
Copy link
Member

rimolive commented Oct 1, 2024

@ntny Please take a look at https://github.com/kubeflow/pipelines/blob/master/sdk/CONTRIBUTING.md#code-style to fix the failling YAPF tests.

@HumairAK HumairAK added this to the KFP SDK 2.10 milestone Oct 10, 2024
@HumairAK
Copy link
Collaborator

Looks like there are sdk ci failures, @ntny can you take a look?

@HumairAK HumairAK modified the milestones: KFP SDK 2.10, KFP SDK 2.11 Oct 10, 2024
@ntny
Copy link
Contributor Author

ntny commented Oct 10, 2024

Looks like there are sdk ci failures, @ntny can you take a look?

Sure, in a few days.

@ntny ntny force-pushed the validate-that-requested-resource-meet-limits branch from 0d429d2 to 80df2b0 Compare October 14, 2024 15:12
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Oct 14, 2024
@ntny ntny changed the title chore(sdk): check that requested cpu/memory less than or equals according limits chore(sdk): check that requested memory less than or equals according limits Oct 14, 2024
@ntny
Copy link
Contributor Author

ntny commented Oct 14, 2024

/rerun-all

@ntny ntny force-pushed the validate-that-requested-resource-meet-limits branch from 80df2b0 to 15aef12 Compare October 14, 2024 15:33
@ntny
Copy link
Contributor Author

ntny commented Oct 14, 2024

/rerun-all

1 similar comment
@ntny
Copy link
Contributor Author

ntny commented Oct 14, 2024

/rerun-all

@ntny
Copy link
Contributor Author

ntny commented Oct 14, 2024

For validation purposes, I had to restore the code fragment for converting requested resource memory labels to float (for comparing limits and requests).

@ntny ntny force-pushed the validate-that-requested-resource-meet-limits branch from 15aef12 to e760a06 Compare October 14, 2024 17:52
@ntny
Copy link
Contributor Author

ntny commented Oct 14, 2024

@HumairAK Latest CI Check failed due to timeout.
I suppose it might be unrelated to MR

@ntny
Copy link
Contributor Author

ntny commented Oct 14, 2024

/retest

@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Oct 14, 2024
@gregsheremeta
Copy link
Contributor

gregsheremeta commented Oct 14, 2024

Thanks for the PR.

This kind of thing -- baking Kubernetes logic into KFP -- makes me a bit nervous. I already don't like that the kfp-kubernetes SDK basically "chases" the official python client, and the slope here seems slippery.

Using a different example, today we allow mounting a secret as a volume. Is it KFP's responsibility to make sure that the secret name is a valid Kubernetes name, or should we just rely on the Kubernetes server to deal with that? Going further, should we be pre-checking that the secret exists, or should we just let Kubernetes tell us that the volume mount failed at pipeline run time?

I've been mostly thinking we should just leave these sorts of things up to Kubernetes, and treat kfp-kubernetes as basically a "dumb" yaml generator.

This change seems fairly innocuous, but it's more code to maintain, and we're on the hook if the Kubernetes behavior changes someday.

I'd be a little more receptive to the change if we could somehow use the official python client. Did you by any chance look into that?

What do others think?

@ntny
Copy link
Contributor Author

ntny commented Oct 15, 2024

Thanks for the PR.

This kind of thing -- baking Kubernetes logic into KFP -- makes me a bit nervous. I already don't like that the kfp-kubernetes SDK basically "chases" the official python client, and the slope here seems slippery.

Using a different example, today we allow mounting a secret as a volume. Is it KFP's responsibility to make sure that the secret name is a valid Kubernetes name, or should we just rely on the Kubernetes server to deal with that? Going further, should we be pre-checking that the secret exists, or should we just let Kubernetes tell us that the volume mount failed at pipeline run time?

I've been mostly thinking we should just leave these sorts of things up to Kubernetes, and treat kfp-kubernetes as basically a "dumb" yaml generator.

This change seems fairly innocuous, but it's more code to maintain, and we're on the hook if the Kubernetes behavior changes someday.

I'd be a little more receptive to the change if we could somehow use the official python client. Did you by any chance look into that?

What do others think?

Thanks, i agree with your points. My reasoning for this PR:

  1. pipeline_task already performs similar memory checks on input parameters and it is also k8s specific validation https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/pipeline_task.py#L457 . I am not sure my check adds anything different.
  2. Requesting resources beyond the limits looks abnormal, and this behavior isn’t specific to Kubernetes in my view. It seems to me that any API providing request/limit options should handle this validation.

But overall, I agree with your arguments about maintaining the code, and it’s up to you to decide here.
Regarding the Python-K8s client, thanks, I will take a look at it

@gregsheremeta
Copy link
Contributor

@chensun do you have any thoughts on the above two comments? I'm not sure how much "Kubernetes logic" (for lack of a better term) we should be baking into KFP. Anton has some good points.

@ntny
Copy link
Contributor Author

ntny commented Nov 2, 2024

Hi @gregsheremeta!
Regarding the use of the official Python client for validation at the KFP sdk level, I might be mistaken, but in my view, it is not suitable here. This is because sdk generates its own specification (or worked directly with Argo in previous versions) and does not operate within the Kubernetes context.
The official client, on the other hand, works with Kubernetes entities.
Plus official Python client supports a dry-run pattern as an alternative for validation. I'm not sure if this would be useful for the entire pipeline, but it should be implemented in the KFP driver for my account rather than in the SDK, as a consequence, validation will occur at a different stage of the run..

@chensun
Copy link
Member

chensun commented Nov 6, 2024

I've been mostly thinking we should just leave these sorts of things up to Kubernetes, and treat kfp-kubernetes as basically a "dumb" yaml generator.

I personally lean towards keeping SDK as a "dumb" yaml generator, and not baking too much logic into it. Otherwise, SDK might become a bottleneck, and we would need to keep chasing trivial changes from Kubernetes side.

pipeline_task already performs similar memory checks on input parameters and it is also k8s specific validation https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/pipeline_task.py#L457 . I am not sure my check adds anything different.

This check is something we inherited from KFP v1 codebase, it's probably much easier to keep it as-is then reasoning the other way around.
While I agree that client side validation is nice to have and catching error earlier is often desirable UX, I do want to point out that such check has limited usage and the potential downside could be inconsistent UX. For input parameter validation, SDK can only check when the value is available at compilation or submission time, which is not always the case. For example,

some_component().set_memory_limit('8G') # SDK validation can check and throw early error on incorrect format

upstream_task = upstream()
downstream().set_memory_limit(upstream_task.output) # SDK validation cannot do anything with the runtime value

I personally prefer having consistent behavior and error message on both cases.

@chensun chensun closed this Nov 6, 2024
@chensun chensun reopened this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-passed All CI tests on a pull request have passed ok-to-test size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants