Skip to content

Conversation

@yungwenh-aws
Copy link
Contributor

What's changing and why?

Add support of efa in manifest for training jobs

Before/After UX

Before:
hyp create hyp-pytorch-job --job-name "<job_name>" --image "<image_name>" --accelerators 4 --vcpu 96.0 --memory 1152.0 --accelerators-limit 4 --vcpu-limit 96.0 --memory-limit 1152.0 --instance-type <instance_type>

After:
hyp create hyp-pytorch-job --job-name "<job_name>" --image "<image_name>" --accelerators 4 --vcpu 96.0 --memory 1152.0 --accelerators-limit 4 --vcpu-limit 96.0 --memory-limit 1152.0 --efa 16 --efa-limit 16 --instance-type <instance_type>

How was this change tested?

pip install . to include the changes locally, submit a job through cli, check the output with kubectl get hyperpodpytorchjob <job_name> -o yaml

Are unit tests added?

N/A

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

vcpu=str(self.vcpu) if self.vcpu else None,
memory=str(self.memory) if self.memory else None
memory=str(self.memory) if self.memory else None,
**{"vpc.amazonaws.com/efa": str(self.efa)} if self.efa else {}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add a comma at the of the line (here and below). It will prevent from unnecessary diffs in the future.


limits_values = _get_limits(instance_type, vcpu_limit, memory_limit, acc_lim, accelerator_partition_type, acc_partition_lim)
if efa is not None:
requests_values["vpc.amazonaws.com/efa"] = efa

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/ If there is no values provided, we use values from the instance. But looks like here you set requests_values. It is not a symmetric operation. Can you check ident in this block
2/ Do you want to add EFA_RESOURCE_KEY constant?

else:
efa_count = instance.get("efa", 0)
if efa_count > 0:
result["vpc.amazonaws.com/efa"] = efa_count

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this:

efa_count = efa or instance.get("efa", 0)
if efa_count > 0:
  ...

default=None,
description="Limit for the amount of memory in GiB",
)
efa: Optional[int] = Field(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to call this just efa or efas/efa_interfaces? Not sure if efa is generally used as a noun, so leaning towards efa_interfaces, though I would defer to an expert here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are we not having a default option, something like network_interface_type = efa? Or is the default that efa is specified as an option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if efa is not provided in the argument, the default network interface will be the socket

raise ValueError('EFA request must equal EFA limit')
if efa_limit > max_efa_per_instance:
raise ValueError(f'Requested EFA limit ({efa_limit}) exceeds instance capacity ({max_efa_per_instance})')
if efa_request > max_efa_per_instance:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement is unreachable.

raise ValueError(f'Requested EFA ({efa_request}) exceeds instance capacity ({max_efa_per_instance})')
elif efa_request is not None and efa_request > max_efa_per_instance:
raise ValueError(f'Requested EFA ({efa_request}) exceeds instance capacity ({max_efa_per_instance})')
elif efa_limit is not None and efa_limit > max_efa_per_instance:
Copy link
Collaborator

@giuseppeporcelli giuseppeporcelli Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check is if it's possible to set EFA limits without EFA requests or both need to be specified (and vice-versa).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants