-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: decommission/mixed-versions failed #140774
Comments
roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ e32dca763783603b41fc08dbc3cf40f2cc0c1ec8:
Parameters:
Same failure on other branches
|
The first failure is because of:
which isn't concerning. I'll remove the release blocker given this and Tobi's comment above. |
Investigating the latest failure. It occurs at the end of the test: we're fully upgraded, and just finished decommissing n1: cockroach/pkg/cmd/roachtest/tests/mixed_version_decommission.go Lines 126 to 136 in e32dca7
Then the test teardown hits this:
We then hit an error here: cockroach/pkg/cmd/roachtest/test_runner.go Lines 1535 to 1546 in e32dca7
which suggests n1 was chosen as
The tail output of the decommissioning command:
Before the "No more data reported...", the cli drained n1: Lines 613 to 622 in e32dca7
so it is puzzling that a moment later, it returns The drain on n1, according to logs, is completed by 06:49:07:
The decommission is finished in the same second:
I ran out of time but will poke more tomorrow. |
roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ 76c2bc942ee1d8d8d62e32047bcf3cacaa21fdc1:
Parameters:
Same failure on other branches
|
Picking up where I left off above. Lines 2139 to 2155 in e32dca7
This indicates that once a server has completed draining, the I'll fix this. |
With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26. This explains the failure in cockroachdb#140774 - cockroachdb#138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. See cockroachdb#140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches cockroachdb#140774. ^-- backport to 25.1-rc would fix it. Touches cockroachdb#139411. ^-- backport to 25.1 will fix it. Fixes cockroachdb#139413. Release note (ops change): the node decommission cli command now waits until the target node is drained before marking it as fully decommissioned. Previously, it would start drain but not wait, leaving the target node briefly in a state where it would be unable to communicate with the cluster but would still accept client requests (which would then hang or hit unexpected errors). Note that a previous release note claimed to fix the same defect, but in fact only reduced the likelihood of its occurrence. As of this release note, this problem has truly been addressed. Epic: None
I sent a PR to fix this (#141411) but guess what, decommission/mixed-versions failed in the same way. This time, drain has definitely succeeded:
and yet, a moment later, it happily returns Even just starting to drain flips this bool: and this is what the health check is sensitive to (not to mention that it is also sensitive to that grpc server mode): n1 also clearly logs that it's done:
By the first message, we've already flipped the bool. I definitely fixed a real bug in the PR, but there seems to be more! In the above example, the drain is definitely very much completely done at 9:38:02.337, but then 700ms later the health endpoint still said it wasn't draining:
So increasingly it looks like we're hitting the wrong health endpoint here. In the iterations of the test that passes, the Looking into how these commands differ on the good vs bad runs, I see this on a "good" run:
n1 is queried at :26257, and as expected it refuses to do anything. The above command is in service of getting a cookie, we eventually get a cookie from n2, but then when we use that cookie to hit n1's health status, it fails too:
But in the bad runs, we always do end up getting a cookie from n11:
but... this isn't the same n1! Note the port, 29000. There are some tenant shenanigans going on here. Sure enough - the good run (run_5) is I added an additional path to PR #138732 and can confirm that the failing cases correspond exactly to querying the health check on port I find it difficult to parse exactly where we're making the choice to query the SQL pod and not the KV pod. The call stack is which gets the admin addresses here and this determines the URL meaning that we need to look into which might pick the sql pod port, since at the caller we invoke and that method probably does return the virtual cluster name when one exists: I'll leave it at that and throw this over the fence to T-testeng. We need to be more deliberate in these post-test assertions, and likely want to always test the health of the KV layer. I'll still go through with the PR to more fully drain, but it didn't cause this issue, since commencing drain already disables the health endpoint. Footnotes
|
cc @cockroachdb/test-eng |
This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26, so before this patch, we had initiated draining, but it hadn't fully completed. I thought for a while that this could explain cockroachdb#140774, i.e. that cockroachdb#138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. But I found that fully draining did not fix the test, and ultimately tracked the issue down to a test infra problem. Still, this PR is a good change, that brings the drain experience in decommission on par with the standalone CLI. See cockroachdb#140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches cockroachdb#140774. Touches cockroachdb#139411. Touches cockroachdb#139413. PR cockroachdb#138732 already fixed most of the drain issues, but since the decommissioning process still went ahead and shut the node out from the cluster, SQL connections that drain was still waiting for would likely hit errors (since the gateway node would not be able to connect to the rest of the cluster any more due to having been flipped to fully decommissioned). So there's a new release note for the improvement in this PR, which avoids that. Release note (bug fix): previously, a node that was drained as part of decommissioning may have interrupted SQL connections that were still active during drain (and for which drain would have been expected to wait). Epic: None
141414: roachtest: log queried URL in HealthStatus r=tbg a=tbg See #140774 (comment). Release note: none Epic: none Co-authored-by: Tobias Grieger <[email protected]>
141414: roachtest: log queried URL in HealthStatus r=tbg a=tbg See #140774 (comment). Release note: none Epic: none Co-authored-by: Tobias Grieger <[email protected]>
The mixed version framework will set
This is what So if we don't want the tenant I think it should be an easy change to just add the virtual cluster option:
I'll give it a try and send out a patch if it fixes it.
Ack, thanks for pointing it out. Think this is a bug, it shouldn't be fetching the cookies so often, will look into it. |
roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ e32dca763783603b41fc08dbc3cf40f2cc0c1ec8:
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=4
encrypted=false
mvtDeploymentMode=shared-process
mvtVersions=v24.1.6 → v24.2.2 → v24.3.0 → release-25.1.0-rc
runtimeAssertionsBuild=false
ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-47357
The text was updated successfully, but these errors were encountered: