-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: decommission/mixed-versions failed #139413
Comments
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.decommission/mixed-versions failed with artifacts on master @ 758fe6af0492e04f677196745f28a546b3a657cc:
Parameters:
Same failure on other branches
|
This looks like it could be an issue in the |
My guess is that this isn't a new bug. Rather I think it might have been uncovered by a fix that allowed these post-assertion failures to run against secure and insecure clusters: 6efeadc (Edit: maybe not as we seem to also be seeing this on release branches) |
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.decommission/mixed-versions failed with artifacts on master @ 12a979cd5f2ac1d37502a1353596a1d77751ab5d:
Parameters:
Same failure on other branches
|
roachtest.decommission/mixed-versions failed with artifacts on master @ 11265b25b659c9858345b484f73740359a50613b:
Parameters:
Same failure on other branches
|
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.decommission/mixed-versions failed with artifacts on master @ 8655f5db5fe8ff856d90f71211a3158f60f1769f:
Parameters:
Same failure on other branches
|
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.decommission/mixed-versions failed with artifacts on master @ 82b1a2b174dcedfd23b0a3a40c6069143f4b038f:
Parameters:
Same failure on other branches
|
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.decommission/mixed-versions failed with artifacts on master @ bbaa2e50b2fe789527aac09b99fa5eee432e7695:
Parameters:
Same failure on other branches
|
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.decommission/mixed-versions failed with artifacts on master @ ecfb74b7ad75327f11814966a2cdab1b9793a549:
Parameters:
Same failure on other branches
|
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.decommission/mixed-versions failed with artifacts on master @ 5e499490a29405b71d5b2d568a43fb7e6b14fe56:
Parameters:
Same failure on other branches
|
With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26. This explains the failure in cockroachdb#140774 - cockroachdb#138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. See cockroachdb#140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches cockroachdb#140774. ^-- backport to 25.1-rc would fix it. Touches cockroachdb#139411. ^-- backport to 25.1 will fix it. Fixes cockroachdb#139413. Release note (ops change): the node decommission cli command now waits until the target node is drained before marking it as fully decommissioned. Previously, it would start drain but not wait, leaving the target node briefly in a state where it would be unable to communicate with the cluster but would still accept client requests (which would then hang or hit unexpected errors). Note that a previous release note claimed to fix the same defect, but in fact only reduced the likelihood of its occurrence. As of this release note, this problem has truly been addressed. Epic: None
cc @cockroachdb/test-eng |
With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26, so before this patch, we had initiated draining, but it hadn't fully completed. I thought for a while that this could explain cockroachdb#140774, i.e. that cockroachdb#138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. But I found that fully draining did not fix the test, and ultimately tracked the issue down to a test infra problem. Still, this PR is a good change, that brings the drain experience in decommission on par with the standalone CLI. See cockroachdb#140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches cockroachdb#140774. Touches cockroachdb#139411. Touches cockroachdb#139413. PR cockroachdb#138732 already fixed most of the drain issues, but since the decommissioning process still went ahead and shut the node out from the cluster, SQL connections that drain was still waiting for would likely hit errors (since the gateway node would not be able to connect to the rest of the cluster any more due to having been flipped to fully decommissioned). So there's a new release note for the improvement in this PR, which avoids that. Release note (bug fix): previously, a node that was drained as part of decommissioning may have interrupted SQL connections that were still active during drain (and for which drain would have been expected to wait). Epic: None
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.decommission/mixed-versions failed with artifacts on master @ 758fe6af0492e04f677196745f28a546b3a657cc:
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=4
encrypted=false
mvtDeploymentMode=separate-process
mvtVersions=v23.1.21 → v23.2.6 → v24.1.8 → v24.3.2 → master
runtimeAssertionsBuild=true
ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-46640
The text was updated successfully, but these errors were encountered: