Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update networking.rst #12938

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lisaschupper-cornelisnetworks

Updated Intel to Cornelis Networks for True Scale and Omni-Path.

Updated Intel to Cornelis Networks for True Scale and Omni-Path.

Signed-off-by: LisaSchupp <[email protected]>
jsquyres
jsquyres previously approved these changes Nov 26, 2024
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@rhc54
Copy link
Contributor

rhc54 commented Nov 26, 2024

Hmmm....just to be careful. When the split occurred, Intel also maintained an "OmniPath" device because that is what they continued to call their library. Caused some friction at the time. Has that been resolved - i.e., does Intel concur with this change? Do we need to differentiate any legacy devices that are still out there - i.e., do we need to add clarifying language so an Intel device user that employs "OmniPath" software doesn't get confused by this doc?

@jsquyres jsquyres marked this pull request as draft November 26, 2024 22:54
@jsquyres
Copy link
Member

jsquyres commented Nov 26, 2024

@rhc54 Good question; I wasn't aware of the history. I converted this PR to Draft to ensure we don't merge it before we figure this out.

@lisaschupper-cornelisnetworks Can you answer?

@rhc54
Copy link
Contributor

rhc54 commented Nov 26, 2024

Things got more than a little confusing after the divestiture. Intel continued to use the PSM name/terminology for their software library, which caused a lot of confusion since the marketplace had largely equated OmniPath device with PSM software. In fact, Intel often referred to both device and software as "OmniPath". So as Intel continued to market devices that used PSM and PSM-2, there was a lot of brand confusion over what constituted OmniPath. Not sure how all that eventually got resolved.

Anyway, I'm no legal expert and it has been a few years since all this went down. My point was only to raise awareness to be careful that the changes don't create more confusion in the user community. A quick web search indicates that Intel continues to market PSM-based devices - not sure how you folks are trying to differentiate your "PSM" from theirs in the market. Do we need to add verbiage here?

@lisaschupper-cornelisnetworks
Copy link
Author

Omni-Path is owned and trademarked by Cornelis Networks. Intel does not have any other product with that name or OmniPath.
- PSM was used by True Scale (also owned by Cornelis)
- PSM2 is used by Omni-Path. 
- PSM3 was created and used by Intel for an Ethernet-based software product.
Hope this answers your questions. Let me know if you need more information.
Thank you.

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

Thanks - appreciate the clarification. However, I probably didn't express myself clearly enough.

There has been a fair amount of confusion in the community regarding "PSM". As far as I can see, the problem hasn't really been resolved - we have two incompatible library families, both named "PSM". I think your clarifications here are fine, and I grok that you are correcting the named affiliations, but I'm wondering if this really goes far enough.

I guess what I'm trying to lead to probably merits its own PR. The problem is that we have had users who provide the specified configure flag and either point at the Intel version of the library, or configure finds the Intel version in a default location. I'm not sure our configure logic is smart enough to distinguish between the two library families - at least, the users report problems at runtime when this confusion occurs.

I'm not sure of the current direction of Intel's PSM-based product line (rumblings of a "Slingshot-like" product used to circulate), but the fact that the confusion has been reported a couple of times indicates that we might see this problem a bit more (appears that some orgs utilize the Intel Ethernet-based product in some capacity in their systems).

It would be nice if the docs could at least provide some insight into the "whose PSM?" problem so people are aware of the potential confusion - at least until the configure logic can be hardened to avoid mistakenly allowing the wrong "PSM" to be used. Might avoid some user hair pulling - and OMPI spending time trying to figure out their problem.

If that isn't possible (and I don't know the proper wording for it), then perhaps someone could follow-up soon with a review and an update (if needed) to the configure logic so we properly ignore the Intel "PSM" variant when configuring for Cornelis devices? If Intel wants/needs to specifically add support for their "PSM" product, then they can add their own configure option and/or logic for that purpose.

@jsquyres
Copy link
Member

jsquyres commented Nov 27, 2024

Sidenote: as a non-PSM user myself, I do remember complaining to Intel when they made PSM3 in libfabric, indicating that exactly this kind of confusion will occur with customers.

Are you saying that there are multiple different libraries out there named "PSM" that are incompatible with each other? I'm not talking about PSM1 vs. PSM2 vs. PSM3 -- but Intel "PSM" vs. Cornelis "PSM", and they're different, with different APIs, headers, and/or functionality? Such that Open MPI's ./configure --with-psm2=/path/to/some/PSM/install may or may not work?

@jsquyres
Copy link
Member

jsquyres commented Nov 27, 2024

Let me clarify my question:

Open MPI has --with-psm2. Which of the following does it work with:

  1. PSM (i.e., PSM v1)
  2. PSM2
  3. PSM3

EDIT: Open MPI only has --with-psm2 -- it does not have --with-psm. So maybe this isn't a huge deal in terms of configure. Does the documentation make this point clear? I.e., that we're always talking about PSM2 in terms of what Open MPI supports?

$ ./configure --help |& grep psm
  --disable-psm2-version-check
                          Disable PSM2 version checking. Not recommended to
  --with-psm2(=DIR)       Build PSM2 (Intel PSM2) support, optionally adding
  --with-psm2-libdir=DIR  Search for PSM (Intel PSM2) libraries in DIR

I note that configure --help explicitly states that PSM2 is an Intel product. @lisaschupper-cornelisnetworks Do you need to update that, too?

@lisaschupper-cornelisnetworks
Copy link
Author

lisaschupper-cornelisnetworks commented Nov 27, 2024

I believe PSM was superseded by PSM2 when True Scale product was end of life'd.

**Omni-Path PSM2 is still used. So Open MPI is correct.

As for PSM3, Intel would need to provide that information. It may not be part of Open MPI at all.

**Yes, configure --help should be updated as well. Thank you. (We have more to do finding and replacing Intel with Cornelis Networks in the documentation.)

@jsquyres
Copy link
Member

jsquyres commented Nov 27, 2024

@lisaschupper-cornelisnetworks I see a fair number of mentions of "PSM" and "PSM2" in the docs. Do you want to update this PR to be more comprehensive and always state PSM2? I also see at least one reference to Intel True Scale and Intel Omni-Path.

If the hood is already up here, it would probably be good to fix all the docs to make them correct. Not just a few minor changes here and there (and leave a bunch of stuff as old / stale / outdated).

The documentation shows both --with-psm and --with-psm2 options, which sounds like it may be out-of-date?

Yep, let's fix this kind of thing, too!

Regardless, the prior issue reports indicated that our configure "found" a libpsm2 out there that was not compatible with the system's device - but configure still passed, and the component disqualified itself at runtime when it found the library incompatible.

That's horrible as well.

At the least, we probably need to document all this so people realize the potential for confusion

10000% agree.

@lisaschupper-cornelisnetworks Can all of these issues be addressed?

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

Are you saying that there are multiple different libraries out there named "PSM" that are incompatible with each other? I'm not talking about PSM1 vs. PSM2 vs. PSM3 -- but Intel "PSM" vs. Cornelis "PSM", and they're different, with different APIs, headers, and/or functionality? Such that Open MPI's ./configure --with-psm=/path/to/some/PSM/install may or may not work?

I haven't tested it myself, and it has been a few years since I was involved in all this - so my info may not be accurate any more. What I can say is that Intel and Cornelis "PSM" were initially one-and-the-same at the time of the split. This covered both PSM1 and PSM2 libraries (since both families had been released). There was considerable debate at the time over who retained control over those - I gather from this thread that Cornelis may have eventually assumed control, but that is unclear.

The follow-on Intel devices initially used PSM2, but now have moved on to PSM3 and (most recently) PSM4. I don't know the internals, but would imagine they may be quite different. I don't know if Cornelis has made incompatible changes to PSM2 such that Intel devices can no longer use it.

The documentation shows both --with-psm and --with-psm2 options, which sounds like it may be out-of-date? Regardless, the prior issue reports indicated that our configure "found" a libpsm2 out there that was not compatible with the system's device - but configure still passed, and the component disqualified itself at runtime when it found the library incompatible. The user therefore silently devolved to using a low-performance network - which led to all the hair pulling to try and figure out why.

At the least, we probably need to document all this so people realize the potential for confusion, especially if they install an Intel Ethernet device (which I assume would be solely used for things like data storage support) and a Cornelis fabric device. Should probably also provide enough info so they understand that they need to take some care as to which PSM2 library they are installing, and would be nice if the configure logic could differentiate them. Might mean we need to add configure options that make it clear "I want the Cornelis PSM library (whatever version it is)" vs the Intel one.

@lisaschupper-cornelisnetworks
Copy link
Author

This first commit was a training exercise for me as I am new to GitHub PRs. 
My assignment is to be sure that Intel is scrubbed and replaced with Cornelis Networks. 
For actual updates to the content, I would need to defer to other subject matter experts in my company.
This has been an eye-opening conversation and I thank you both.

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

And let's not forget - there are still quite a few Intel OmniPath devices (i.e., sold prior to the split) out there. May be getting old in the teeth, but still serviceable. Do we need to provide some language here so those people know their device is still supported (as they may not know anything about Cornelis)?

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

This first commit was a training exercise for me as I am new to GitHub PRs. My assignment is to be sure that Intel is scrubbed and replaced with Cornelis Networks. For actual updates to the content, I would need to defer to other subject matter experts in my company. This has been an eye-opening conversation and I thank you both.

Fully understand - and welcome! Didn't mean to muddy the waters so much - as I noted, much of this conversation may be more appropriate for a follow-on PR or two.

@lisaschupper-cornelisnetworks
Copy link
Author

lisaschupper-cornelisnetworks commented Nov 27, 2024

And let's not forget - there are still quite a few Intel OmniPath devices (i.e., sold prior to the split) out there. May be getting old in the teeth, but still serviceable. Do we need to provide some language here so those people know their device is still supported (as they may not know anything about Cornelis)?

Cornelis is responsible for these older devices. That was part of the divestiture agreement.
I believe these customers were notified immediately.

@jsquyres
Copy link
Member

@lisaschupper-cornelisnetworks Haha! I echo what Ralph said -- welcome! And sorry to make you jump right into the deep end of complicated discussions for what should have been a simple issue to fix! 😂

@lisaschupper-cornelisnetworks
Copy link
Author

lisaschupper-cornelisnetworks commented Nov 27, 2024

No worries. I like this kind of interaction. How else will I learn.

@jsquyres
Copy link
Member

How about this:

  • For this PR:
    • Let's clean up all the "PSM" references in the docs (i.e., say PSM2 everywhere that we're talking about OMPI functionality).
    • Re-brand PSM2 as Cornelis as appropriate.
    • Add a .. note:: somewhere relevant about the whole Intel / Cornelis history to clarify what Open MPI supports in terms of products, including legacy Intel products, ... etc. Might even be worth mentioning that Open MPI only supports PSM2 -- not PSM (v1) or anything after PSM2.
  • Feel free to include in this PR, or do in follow-on / future PRs:
    • Fix branding language in configure --help output
    • Fix up configure logic to avoid the problem @rhc54 said a customer recently ran into (configure "found" a libpsm2 out there that was not compatible with the system's device - but configure still passed, and the component disqualified itself at runtime when it found the library incompatible)

Does that sound reasonable?

@jsquyres jsquyres dismissed their stale review November 27, 2024 18:48

We revamped the scope of this PR after the initial approval

@lisaschupper-cornelisnetworks
Copy link
Author

I do not have the authority to make these decisions. Bringing in another person to review the conversation and proposal. Thanks for your guidance @jsquyres.

@BrendanCunningham - Please review this PR and the proposed changes and let me know how we should proceed. Thanks!

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

@BrendanCunningham - Please review this PR and the proposed changes and let me know how we should proceed.

Hi Brendan! Been a long time!

Does that sound reasonable?

Sounds reasonable to me, but I defer to Brendan's input.

@jsquyres
Copy link
Member

Ping @BrendanCunningham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants