Skip to content

Conversation

@ecolternv
Copy link
Contributor

Description

Add project design document for nvlink + topology aware scheduling support

Issue #206

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@ecolternv ecolternv requested a review from a team January 8, 2026 21:36
@ecolternv ecolternv force-pushed the ecolter/nvlink-project-design branch from 8ba0fce to 8ff847e Compare January 14, 2026 20:29
@RyaliNvidia
Copy link
Contributor

This is a more general question regarding the feature, but once the feature is there, how will you verify that NVLink is performing as expected? What kind of tests can you do to verify that NVLink is actually being used and whether it is performing as well as expected? It could be that there isn't an issue with the implementation and that NVLink is not performing well, but it would be nice to know that.

@RyaliNvidia
Copy link
Contributor

RyaliNvidia commented Jan 15, 2026

Is NVLink a feature we want users to opt into so that even if the feature is there, they can specify not to use it? This is because they might want to build confidence that NVLink is improving their performance so they would want to run to side by side tests with NVLink there and not.

This way, we could get a KPI on how many people are switching to using NVLink and the usefulness of this feature once deployed

@RyaliNvidia RyaliNvidia reopened this Jan 15, 2026
# Topology keys that appear first are the finest grain
# (Ie multiple racks belong in the same spine)
"topology_keys": [
"topology.kubernetes.io/rack",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this a key value pair where the key is the user facing field and the value is the value in kubernetes like "topology.kubernetes.io/rack"

@RyaliNvidia
Copy link
Contributor

Another thing to consider is if we can change the resources list to show the zones/racks/whatnot visually so users know why their workflow isn't scheduling even though there are resources available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants