cli: fully and properly drain target node of decommission #141411
+48
−29
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With this patch, at the end of decommissioning, we call the drain step as we
would for
./cockroach node drain
:In particular, note how the first invocation returns a RemainingIndicator of
26, so before this patch, we had initiated draining, but it hadn't fully completed.
I thought for a while that this could explain #140774, i.e. that #138732 was
insufficient as it did not guarantee that the node had actually drained fully
by the time it was marked as fully decommissioned and the
node decommission
had returned. But I found that fully draining did not fix the test, and
ultimately tracked the issue down to a test infra problem. Still, this PR is
a good change, that brings the drain experience in decommission on par with
the standalone CLI.
See #140774.
I verified that the modified decommission/drains roachtest passes via
Touches #140774.
Touches #139411.
Touches #139413.
PR #138732 already fixed most of the drain issues, but since the
decommissioning process still went ahead and shut the node out
from the cluster, SQL connections that drain was still waiting
for would likely hit errors (since the gateway node would not
be able to connect to the rest of the cluster any more due to
having been flipped to fully decommissioned). So there's a new
release note for the improvement in this PR, which avoids that.
Release note (bug fix): previously, a node that was drained as part
of decommissioning may have interrupted SQL connections that were
still active during drain (and for which drain would have been
expected to wait).
Epic: None