-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/udp: detect and use interface MTU to support large messages over UDP #10493
base: main
Are you sure you want to change the base?
Conversation
I have a small command-line test application, but I'm not sure of the best way to modify it into a unit test and integrate it into the code base. #include <rdma/fabric.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char** argv) {
int msg_size = 1000;
if (argc > 1) {
if (sscanf(argv[1], "%d", &msg_size) != 1 || msg_size <= 0) {
printf("failed to convert argument '%s' to a positive integer\n", argv[1]);
return 1;
}
}
struct fi_fabric_attr* fabric_attr = calloc(1, sizeof(struct fi_fabric_attr));
fabric_attr->prov_name = strdup("udp");
struct fi_ep_attr* ep_attr = calloc(1, sizeof(struct fi_ep_attr));
ep_attr->max_msg_size = msg_size;
struct fi_info* hints = fi_allocinfo();
hints->caps = FI_SEND | FI_RECV;
hints->mode = FI_CONTEXT | FI_CONTEXT2;
hints->addr_format = FI_SOCKADDR_IN;
hints->ep_attr = ep_attr;
hints->fabric_attr = fabric_attr;
struct fi_info* info = NULL;
struct fi_info* cur;
fi_getinfo(
FI_VERSION(FI_MAJOR_VERSION, FI_MINOR_VERSION),
NULL,
NULL,
0,
hints,
&info);
for (cur = info; cur; cur = cur->next)
printf("domain %s, max_msg_size %d\n",
cur->domain_attr->name,
cur->ep_attr ? cur->ep_attr->max_msg_size : 0);
fi_freeinfo(info);
fi_freeinfo(hints);
return 0;
} |
e2be47f
to
a1a89ce
Compare
prov/udp/src/udpx_init.c
Outdated
ofi_mutex_lock(&init_lock); | ||
udpx_util_prov_init(version, node, service, flags); | ||
ofi_mutex_unlock(&init_lock); | ||
return util_getinfo(&udpx_util_prov, version, node, service, flags, | ||
hints, info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
udpx_util_prov is initialized once with the "node", "service" and "flags" from the first fi_getinfo() call. That likely won't work properly for later fi_getinfo calls with different parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. Please check the updated code, which should address this.
src/linux/osd.c
Outdated
size_t filename_len = strlen(mtu_filename_prefix) + | ||
strlen(mtu_filename_prefix) + | ||
IF_NAMESIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please align the 2nd and 3rd lines with the 'strlen' of the first line.
src/linux/osd.c
Outdated
if (getline(&line, &len, fd) == -1) { | ||
goto err2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the braces here.
src/linux/osd.c
Outdated
|
||
int ofi_ifaddr_get_mtu(const struct ifaddrs *ifa) | ||
{ | ||
FILE *fd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use fp
instead. fd
is usually used for the integer file descriptor returned from open
.
src/linux/osd.c
Outdated
goto err2; | ||
} | ||
|
||
if (sscanf(line, "%d", &mtu) != 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use fscanf
instead. that eliminates the need of calling getline
.
src/linux/osd.c
Outdated
snprintf(mtu_filename, filename_len, "%s%s%s", | ||
mtu_filename_prefix, ifa->ifa_name, mtu_filename_suffix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use asprintf
instead. That way, you don't need to calculate the size of the string, and thus don't need to define the variable for the prefix and suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions. All applied in the update.
prov/udp/src/udpx_attr.c
Outdated
ifaddrs.ifa_netmask = NULL; | ||
ifaddrs.ifa_name = info->domain_attr->name; | ||
ifaddrs.ifa_addr = info->src_addr; | ||
return ofi_ifaddr_get_mtu(&ifaddrs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The window version of ofi_ifaddr_get_mtu
returns ifaddrs.mtu
which is undefined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the updated version.
ed17f9a
to
f9814bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small things, nothing glaring. Thanks for the patch!
As far as your unit test, I would forgo putting it into our testing as it's not really testing anything but rather just printing out the values as far as I can tell.
d218488
to
2d26352
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
You can include the man page updates in this PR btw, just have it in a separate patch. But you can also do it in a separate PR - up to you!
Our CI failed but it was unrelated to your changes. Rerunning for fun but don't consider it a blocker. Not sure what the AWS failure is though.
I pushed a commit to update the man page. |
struct fi_info* info; | ||
|
||
if (udpx_util_prov.info == NULL) { | ||
udpx_util_prov.info = &udpx_info; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is redundant since it is overwritten by L153.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But udpx_util_prov
is used in the call to ofi_ip_getinfo()
, which won't work correctly otherwise.
ofi_get_list_of_addr(&udpx_prov, "iface", &addr_list); | ||
for (cur = info; cur; cur = cur->next) | ||
set_mtu_from_addr_list(cur, &addr_list); | ||
*(struct fi_info**)&udpx_util_prov.info = info; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified to udpx_util_prov.info = info;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The info
field of struct util_prov
is declared const
, and the compiler complains without the cast.
prov/udp/src/udpx_attr.c
Outdated
} | ||
} | ||
|
||
void udpx_util_prov_fin() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to udpx_util_prov_fini
to be consistent with other similar cases in the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done with next commit
@@ -2182,7 +2190,7 @@ void ofi_get_list_of_addr(const struct fi_provider *prov, const char *env_name, | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addr_entry->mtu
is unitialized in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly unfamiliar with Windows, which is where this code is used. Given what I see in include/windows/config.h
, it's not clear to me that this code will actually be compiled, but I'm probably missing something. In any case, initializing addr_entry->mtu
to -1 suffices to make the program correct, although the standard default MTU will then be assumed.
addr_entry = container_of(entry, | ||
struct ofi_addr_list_entry, | ||
entry); | ||
max_msg_size = UDPX_MAX_MSG_SIZE(addr_entry->mtu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to handle the case that addr_entry->mtu == -1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That case is handled in line 125. (The value of UDPX_MAX_MSG_SIZE() is strictly less than its argument value.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mtu valid check should be direct, not hidden behind a macro which uses an incorrect value that's then checked. The code is assuming too much about an invalid mtu value and requires too much of the reader to figure out what's happening.
For each interface detected by the udp provider, determine the MTU of the interface, and use that value to set the max_msg_size field of the fi_ep_attr and fi_tx_attr values of the fi_info element. When the MTU cannot be determined, the MTU value assumed by previous code versions (1500) is used. Signed-off-by: Martin Pokorny <[email protected]>
Change "limitations" section to describe that the maximum transfer size depends on interface MTU, and add FI_UDP_IFACE parameter description. Signed-off-by: Martin Pokorny <[email protected]>
@@ -497,6 +498,11 @@ size_t ofi_ifaddr_get_speed(struct ifaddrs *ifa) | |||
return ifa->speed; | |||
} | |||
|
|||
int ofi_ifaddr_get_mtu(const struct ifaddrs *ifa) | |||
{ | |||
return ifa->mtu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is off.
@@ -76,6 +76,11 @@ static inline size_t ofi_ifaddr_get_speed(struct ifaddrs *ifa) | |||
return 0; | |||
} | |||
|
|||
static inline int ofi_ifaddr_get_mtu(const struct ifaddrs *ifa) | |||
{ | |||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just return 1500 as the default? Setting mtu = -1 as a default looks like a recipe for problems and missed error checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback. One reason for using the value -1 is that it removes the need to put default MTU size values in a bunch more files. If there is there a good, single place to put such a definition, I must have missed it. I'm open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually disagree with Sean for once. Returning -1 is similar to an enosys indicating that finding the MTU size is not supported/possible. Returning 1500 seems like it's falsely stating that it was able to determine that to be the MTU size. I think it's more appropriate for the default to be in the caller where it defaults to 1500 if were unable to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you need to add checks everywhere that the mtu size is valid prior to using it. There are several places where mtu == -1, but that value is used anyway, with, maybe, checks coming later that the calculation is valid. If the mtu can't be determined, then remove that interface from the list. I can't tell that these changes work at all on windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaand back to agreeing with Sean. Yes, proper error checking is absolutely necessary for that to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't removing an interface whose mtu can't be determined potentially break existing client code? Doesn't the code in the main branch implicitly assume mtu=1500 for any interface supporting udp to determine the endpoint max_msg_size
and inject_size
? Falling back to that behavior seems like it might be safer (from that perspective) than removing the interface from the list. From a correctness point of view, dropping the interface seems preferable. For my use case, either choice is fine, so I could use some advice about how to proceed.
.info = &udpx_info, | ||
.flags = 0, | ||
.info = NULL, | ||
.flags = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced tab with spaces here - revert change
set_mtu_from_addr_list(cur, &addr_list); | ||
*(struct fi_info**)&udpx_util_prov.info = info; | ||
ofi_free_list_of_addr(&addr_list); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't embed the entire function inside an if statement. Revert the if check and return, then outdent the rest of the code. Please add some spaces between lines.
|
||
if (udpx_util_prov.info == NULL) { | ||
udpx_util_prov.info = &udpx_info; | ||
info = fi_allocinfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must check for failure
if (max_msg_size > 0) { | ||
info->tx_attr->inject_size = max_msg_size; | ||
info->ep_attr->max_msg_size = max_msg_size; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_msg_size > 0 should be an assert. We shouldn't leave these sizes unset.
((ifa->ifa_addr->sa_family != AF_INET) && | ||
(ifa->ifa_addr->sa_family != AF_INET6))) | ||
continue; | ||
if (ifa->ifa_flags & IFF_LOOPBACK) { | ||
mtu = ofi_ifaddr_get_mtu(ifa); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to loopback_mtu to make it clear how this is used. On first glance, this value looks like it's ignored.
@@ -2174,6 +2182,7 @@ void ofi_get_list_of_addr(const struct fi_provider *prov, const char *env_name, | |||
addr_entry->ipaddr.sin.sin_family = AF_INET; | |||
addr_entry->ipaddr.sin.sin_addr.s_addr = | |||
iptbl->table[i].dwAddr; | |||
addr_entry->mtu = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everywhere mtu = -1 should probably use a better default than an unusable value
@@ -655,6 +655,7 @@ struct ofi_addr_list_entry { | |||
char ipstr[INET6_ADDRSTRLEN]; | |||
union ofi_sock_ip ipaddr; | |||
size_t speed; | |||
int mtu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please separate out adding / setting the mtu to ofi_addr_list_entry into its own patch. That will make it clearer which changes are generic, versus tied specifically to udp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll do that with the next update.
udpx_util_prov.info = &udpx_info; | ||
info = fi_allocinfo(); | ||
ofi_ip_getinfo(&udpx_util_prov, version, NULL, NULL, 0, NULL, | ||
&info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way to handle this is to add a callback to ofi_ip_getinfo(). That function is already iterating over the interfaces. The only part that's missing is the ability of the provider to update the fi_info based on the specific interface. For most callers, the callback will be null. For UDP, the callback will receive the fi_info and ofi_addr_list_entry, so that it can update the attributes based on the mtu. The callback may need to be passed through to subfunctions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do that. It seems like a more extensive change to the code, but if you're looking over my shoulder, I'll give it a try.
I should have some time to pick up work on this PR again this week. I propose doing the following:
|
@mpokorny - There are other places where callbacks are used within the getinfo() path. See |
I have a question about a possible alternative implementation to take. I noticed that |
I'm not entirely certain, but reading the code, I don't think the fi_link_attr (via fi_info::nic) is being set for the tcp/udp providers. If that were available, it would be great to make use of it. Unfortunately, the current alter_defaults() callback occurs too early in the getinfo flow to be used. ofi_ip_getinfo() -> util_getinfo() This flow checks against the provider's template fi_info list. For each fi_info that matches, the fi_info is dup'ed, then possibly altered by the provider. This alter flow allows custom changes to the template fi_info based on the input hints. The NIC interface list is checked after this flow (back in ofi_ip_getinfo()). For every fi_info returned from the above flow, a new fi_info is created, one per interface. (The exception is when a src_addr limits us to a single NIC.) The places to add nic information would be adjacent to the calls to: util_set_netif_names() Sooo... here's what I would consider:
|
This commit supports a maximum UDP message size that depends on the MTU size of the interface. The use case is UDP messaging on an Ethernet network configured to use jumbo frames.
For each interface detected by the udp provider, determine the MTU of the interface, and use that value to set the
max_msg_size
field of thefi_ep_attr
andfi_tx_attr
values of the fi_info element. When the MTU cannot be determined, the MTU value assumed by previous code versions (1500) is used.