-
Notifications
You must be signed in to change notification settings - Fork 69
Add efa support in manifest for training jobs #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| vcpu=str(self.vcpu) if self.vcpu else None, | ||
| memory=str(self.memory) if self.memory else None | ||
| memory=str(self.memory) if self.memory else None, | ||
| **{"vpc.amazonaws.com/efa": str(self.efa)} if self.efa else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add a comma at the of the line (here and below). It will prevent from unnecessary diffs in the future.
|
|
||
| limits_values = _get_limits(instance_type, vcpu_limit, memory_limit, acc_lim, accelerator_partition_type, acc_partition_lim) | ||
| if efa is not None: | ||
| requests_values["vpc.amazonaws.com/efa"] = efa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/ If there is no values provided, we use values from the instance. But looks like here you set requests_values. It is not a symmetric operation. Can you check ident in this block
2/ Do you want to add EFA_RESOURCE_KEY constant?
| else: | ||
| efa_count = instance.get("efa", 0) | ||
| if efa_count > 0: | ||
| result["vpc.amazonaws.com/efa"] = efa_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this:
efa_count = efa or instance.get("efa", 0)
if efa_count > 0:
...
| default=None, | ||
| description="Limit for the amount of memory in GiB", | ||
| ) | ||
| efa: Optional[int] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to call this just efa or efas/efa_interfaces? Not sure if efa is generally used as a noun, so leaning towards efa_interfaces, though I would defer to an expert here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are we not having a default option, something like network_interface_type = efa? Or is the default that efa is specified as an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if efa is not provided in the argument, the default network interface will be the socket
| raise ValueError('EFA request must equal EFA limit') | ||
| if efa_limit > max_efa_per_instance: | ||
| raise ValueError(f'Requested EFA limit ({efa_limit}) exceeds instance capacity ({max_efa_per_instance})') | ||
| if efa_request > max_efa_per_instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if statement is unreachable.
| raise ValueError(f'Requested EFA ({efa_request}) exceeds instance capacity ({max_efa_per_instance})') | ||
| elif efa_request is not None and efa_request > max_efa_per_instance: | ||
| raise ValueError(f'Requested EFA ({efa_request}) exceeds instance capacity ({max_efa_per_instance})') | ||
| elif efa_limit is not None and efa_limit > max_efa_per_instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check is if it's possible to set EFA limits without EFA requests or both need to be specified (and vice-versa).
What's changing and why?
Add support of efa in manifest for training jobs
Before/After UX
Before:
hyp create hyp-pytorch-job --job-name "<job_name>" --image "<image_name>" --accelerators 4 --vcpu 96.0 --memory 1152.0 --accelerators-limit 4 --vcpu-limit 96.0 --memory-limit 1152.0 --instance-type <instance_type>
After:
hyp create hyp-pytorch-job --job-name "<job_name>" --image "<image_name>" --accelerators 4 --vcpu 96.0 --memory 1152.0 --accelerators-limit 4 --vcpu-limit 96.0 --memory-limit 1152.0 --efa 16 --efa-limit 16 --instance-type <instance_type>
How was this change tested?
pip install . to include the changes locally, submit a job through cli, check the output with kubectl get hyperpodpytorchjob <job_name> -o yaml
Are unit tests added?
N/A
Are integration tests added?
N/A
Reviewer Guidelines
One of the following must be true: