Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
cli: fully and properly drain target node of decommission
With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26. This explains the failure in cockroachdb#140774 - cockroachdb#138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. See cockroachdb#140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches cockroachdb#140774. ^-- backport to 25.1-rc would fix it. Touches cockroachdb#139411. ^-- backport to 25.1 will fix it. Fixes cockroachdb#139413. Release note (ops change): the node decommission cli command now waits until the target node is drained before marking it as fully decommissioned. Previously, it would start drain but not wait, leaving the target node briefly in a state where it would be unable to communicate with the cluster but would still accept client requests (which would then hang or hit unexpected errors). Note that a previous release note claimed to fix the same defect, but in fact only reduced the likelihood of its occurrence. As of this release note, this problem has truly been addressed. Epic: None
- Loading branch information