-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make util_ns fields: run and is_initialized atomics #10570
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Dariusz Sciebura <[email protected]>
return; | ||
|
||
ofi_atomic_initialize32(&ns->ref, 0); | ||
ns->listen_sock = INVALID_SOCKET; | ||
if (!ns->hostname) | ||
ns->hostname = OFI_NS_DEFAULT_HOSTNAME; | ||
ns->is_initialized = 1; | ||
ofi_atomic_set32(&ns->is_initialized, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name service is basically a simple way to support running fabtests. It's not something we expect to have enabled as a general purpose feature.
If it were, the changes here are not sufficient to support multiple threads. Locks would be needed. It's ultimately the responsibility of the caller to have appropriate serialization to the name service calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip about the name service. I found that it is by default enabled for verbs provider, which, in case of a multithreaded application, leads to data races triggered by e.g. parallel fi_close and util_ns_accept_handler execution. Don't you think, that by as you said: "not expecting this option to be enabled as a general purpose feature" you should default this option to OFF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what sort of endpoints the app is using? It looks like the name service is related to dgram ep's, and only started if a verbs domain is opened for dgram ep's. Note that this can occur through the rxd provider for rdm ep's.
@j-xiong - I see that the name server is enabled by default, but disabled if specific, unrelated environment variables are set (OMPI_COMM_WORLD_RANK or PMI_RANK). Should the default be to disable it, and use an environment variable to enable instead?
I'm unsure how frequently the verbs dgram provider is used in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shefty The name server is needed for verbs dgram to resolve "node" into dst_addr in fi_getinfo(). By default this is a feature an application may use. MPIs (OpenMPI, MPICH, Intel MPI) don't use this feature so the name server can be safely disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My experiments that revealed the issue were made on a Mellanox RDMA card. Using that hardware and setting up the hints the way that rdm fabtests do, I ended up in getting an RXD protocol 'based' endpoint.
When checking my libfabric based app's test, I noticed that Thread Sanitizer reports data race in libfabric and ended up here. Turning off the 'name service' with the proper env variable solves the problem completely.
So, IMO it should be:
a) turned off by default if not necessary or
b) made thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsciebu - I agree. Can you open an issue to track this? The name service should be made thread safe, which I'm guessing won't be that difficult. Likely just needs a lock around most of the functions you identified.
Values of util_ns fields can be read/write in parallel from multiple threads. Making them atomics secures these kind of access.
Signed-off-by: Dariusz Sciebura [email protected]