Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]feature: Create hyperNode cr by node label #3890

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

JesseStutler
Copy link
Member

@JesseStutler JesseStutler commented Dec 13, 2024

feature: #3888

The hypernode controller will list/watch node(only node with label volcano.sh/hypernodes will be list/watched) and hypernode, the format of label volcano.sh/hypernode is "hypernode-0,hypernode-1,...,hypernode-0`, use commas to separate the hypernodes of different layers that the node is connected to, from near to far, the front hypernode means the closer to the node.

Take this picture as an example:
image

Create Hypernodes Tree

  1. The node-0 has the label volcano.sh/hypernodes:"s0,s4,s6", then hypernode controller judge there is no s0,s4,s6 hypernodes created, it will create them sequentially( use hyperNodeAction)
  2. The node-1 then has the same label, the hypernode s0 will add the member node-1, there are no operations with s4 and s6.
  3. same as the previous...

Update Hypernode Tree

Currently, we only allow node label to change tier 1 hypernode, e.g., node-0's label can only from "s0,s4,s6" to "s1,s4,s6", that is, only allowed to change the node's parent hypernode, we don't allow to change the hypernode tier>1, like from "s0,s4,s6" to "s0,s5,s6". And we don't allow to add more hypernode or delete some hypernodes, like "s0,s4,s6" to "s0,s4" or from "s0,s4,s6" to "s0,s4,s6,s7", it may cause some strange hypernode tree structure, and then causing strange scheduling.


Now, I only designed the hypernode controller to have one goroutine to handle the node's events. If there are multiple goroutines processing events at the same time, it may cause multiple create requests of the same hypernode to be issued at the same time. Even if there is only one goroutine processing events, when the hypernode create/update requests is already sent to the api-server, the controller may still not have list/watch add/update/delete hypernode events to the hypernode controller yet, so I added hyperNodesCache. After the hypernode create/update/delete request is sent, the local cache will be updated.After hypernode create/update/delete events arriving, the controller will overwride the hypernode in cache.

@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 13, 2024
@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 13, 2024
@JesseStutler
Copy link
Member Author

Please review it and give me some feedback, thanks! @Monokaix @william-wang @wangyang0616 @penggu

@JesseStutler JesseStutler force-pushed the network_topo_controller branch from 808109d to 8cd0edd Compare December 13, 2024 15:35
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign shinytang6
You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JesseStutler JesseStutler force-pushed the network_topo_controller branch 2 times, most recently from 4b37f25 to ce1bf5c Compare December 16, 2024 03:03
@hwdef
Copy link
Member

hwdef commented Dec 16, 2024

Wouldn't it be better if label volcano.sh/hypernodes had only one value, so it's even less error prone.
We can design another CRD, such as hyperSwitch
In this case.
The node-0 has the label volcano.sh/hypernodes:"s0"
hyperSwitch-s0 has the label volcano/hyperswitch:"s4"
hyperSwitch-s4 has the label volcano/hyperswitch:"s6"
Or we can design a field for hyperSwitch, such as parent,
hyperSwitch-s0's parent is s4
hyperSwitch-s4's parent is s6

Each key corresponds to a value, which looks clearer. HyperSwitch CRD can also support more scenarios in the future.
For example, when a switch fails, we can quickly take it offline.

@JesseStutler
Copy link
Member Author

Wouldn't it be better if label volcano.sh/hypernodes had only one value, so it's even less error prone. We can design another CRD, such as hyperSwitch In this case. The node-0 has the label volcano.sh/hypernodes:"s0" hyperSwitch-s0 has the label volcano/hyperswitch:"s4" hyperSwitch-s4 has the label volcano/hyperswitch:"s6" Or we can design a field for hyperSwitch, such as parent, hyperSwitch-s0's parent is s4 hyperSwitch-s4's parent is s6

Each key corresponds to a value, which looks clearer. HyperSwitch CRD can also support more scenarios in the future. For example, when a switch fails, we can quickly take it offline.

Add a CRD will become more complicated, after discussing with @Monokaix @penggu, we decide to redesign this label later, in this version of network topology, this PR may not be get merged.

@JesseStutler JesseStutler force-pushed the network_topo_controller branch from ce1bf5c to 840a08f Compare December 16, 2024 11:40
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 20, 2024
@volcano-sh-bot
Copy link
Contributor

@JesseStutler: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants