feat: Supports node black list for load balance #1985
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
#1976
In addition this, solve two issues for cluster_balance_policy.
1.
incubator-pegasus/src/meta/cluster_balance_policy.cpp
Lines 201 to 202 in cd1682d
std::move(info) is before use the variable info, causes the skew is wrong.
incubator-pegasus/src/meta/load_balance_policy.h
Line 113 in cd1682d
incubator-pegasus/src/meta/greedy_load_balancer.h
Lines 77 to 78 in cd1682d
The variable _balancer_ignored_apps is not static, that causes _app_balance_policy and _cluster_balance_policy have separate _balancer_ignored_apps. So, when we set _balancer_ignored_apps, it only takes effect on _app_balance_policy.
Both issues will be fixed in this pr.
What is changed and how does it work?
meta.lb.ignored_nodes_list <get|set|clear> [node_addr1,nodes_addr2..]
Supports get, set, and clear commands.
The number of blacklisted nodes must not exceed the number of alive_nodes minus 2, otherwise balancing will not be possible.
app_balance_policy
No increase or decrease in node slicing is involved, so no restriction is applied.
The balancing strategy is the same for both phases, so the restriction method is the same.
The difference between copy primary and copy secondary is simply that the queue for copy primary is sorted based on the
number of primary slices of the table on each node. Copy secondary is sorted based on the number of all slices of the table
on each node.
Therefore, it is sufficient to exclude the blacklisted nodes when choosing id_min/id_max.
cluster_balance_policy
No increase or decrease in node slicing is involved, so no restriction is applied.
The balancing strategy is the same for both phases, so the restriction method is the same.
The difference between copy primary and copy secondary is simply that the number of slices computed for copy primary is the primary slice and the number of slices computed for copy secondary is the slave slice (excluding the primary slice).
The strategy to implement node blacklisting is:
Checklist
Tests
Use node restart and the command remote_command -t meta-server meta.lb.assign_secondary_black_list $address_list
The initial state of the cluster is:
Set 172.17.0.2:34801, 172.17.0.2:34806 as blacklisted, and then load-balance with a termination state of:
It can be seen that the number of slices for two nodes, 172.17.0.2:34801 and 172.17.0.2:34806, did not change, and the other four nodes reached a balanced state. After clear ignored_node_list, perform balance, the result is:
The initial state of the cluster is:
Set 172.17.0.2:34801, 172.17.0.2:34806 as blacklisted, and then load-balance with a termination state of:
It can be seen that the number of slices for two nodes, 172.17.0.2:34801 and 172.17.0.2:34806, did not change, and the other four nodes reached a balanced state. After clear ignored_node_list, perform balance, the result is: