-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(sdk): check that requested memory less than or equals according limits #11240
base: master
Are you sure you want to change the base?
chore(sdk): check that requested memory less than or equals according limits #11240
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @ntny. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2887cb9
to
0d429d2
Compare
/ok-to-test |
@ntny Please take a look at https://github.com/kubeflow/pipelines/blob/master/sdk/CONTRIBUTING.md#code-style to fix the failling YAPF tests. |
Looks like there are sdk ci failures, @ntny can you take a look? |
Sure, in a few days. |
0d429d2
to
80df2b0
Compare
/rerun-all |
80df2b0
to
15aef12
Compare
/rerun-all |
1 similar comment
/rerun-all |
For validation purposes, I had to restore the code fragment for converting requested resource memory labels to float (for comparing limits and requests). |
Signed-off-by: ntny <[email protected]>
15aef12
to
e760a06
Compare
@HumairAK Latest CI Check failed due to timeout. |
/retest |
Thanks for the PR. This kind of thing -- baking Kubernetes logic into KFP -- makes me a bit nervous. I already don't like that the kfp-kubernetes SDK basically "chases" the official python client, and the slope here seems slippery. Using a different example, today we allow mounting a secret as a volume. Is it KFP's responsibility to make sure that the secret name is a valid Kubernetes name, or should we just rely on the Kubernetes server to deal with that? Going further, should we be pre-checking that the secret exists, or should we just let Kubernetes tell us that the volume mount failed at pipeline run time? I've been mostly thinking we should just leave these sorts of things up to Kubernetes, and treat kfp-kubernetes as basically a "dumb" yaml generator. This change seems fairly innocuous, but it's more code to maintain, and we're on the hook if the Kubernetes behavior changes someday. I'd be a little more receptive to the change if we could somehow use the official python client. Did you by any chance look into that? What do others think? |
Thanks, i agree with your points. My reasoning for this PR:
But overall, I agree with your arguments about maintaining the code, and it’s up to you to decide here. |
@chensun do you have any thoughts on the above two comments? I'm not sure how much "Kubernetes logic" (for lack of a better term) we should be baking into KFP. Anton has some good points. |
Hi @gregsheremeta! |
I personally lean towards keeping SDK as a "dumb" yaml generator, and not baking too much logic into it. Otherwise, SDK might become a bottleneck, and we would need to keep chasing trivial changes from Kubernetes side.
This check is something we inherited from KFP v1 codebase, it's probably much easier to keep it as-is then reasoning the other way around. some_component().set_memory_limit('8G') # SDK validation can check and throw early error on incorrect format
upstream_task = upstream()
downstream().set_memory_limit(upstream_task.output) # SDK validation cannot do anything with the runtime value I personally prefer having consistent behavior and error message on both cases. |
Description of your changes: raise an error with a description if the requested resources (memory/CPU) exceed the selected limit while defining the pipeline.
examples: