Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: decommission/mixed-versions failed #139413

Open
cockroach-teamcity opened this issue Jan 19, 2025 · 12 comments
Open

roachtest: decommission/mixed-versions failed #139413

cockroach-teamcity opened this issue Jan 19, 2025 · 12 comments
Labels
B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-testeng TestEng Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 19, 2025

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on master @ 758fe6af0492e04f677196745f28a546b3a657cc:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v23.1.21 → v23.2.6 → v24.1.8 → v24.3.2 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-46640

@cockroach-teamcity cockroach-teamcity added B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jan 19, 2025
@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on master @ 758fe6af0492e04f677196745f28a546b3a657cc:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.2.5 → v24.3.1 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@stevendanna
Copy link
Collaborator

This looks like it could be an issue in the /health endpoint for tenants. Namely, the health endpoint is telling us that n1 is OK but it has actually been decomissioned.

@stevendanna stevendanna added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 20, 2025
@stevendanna
Copy link
Collaborator

stevendanna commented Jan 20, 2025

My guess is that this isn't a new bug. Rather I think it might have been uncovered by a fix that allowed these post-assertion failures to run against secure and insecure clusters: 6efeadc (Edit: maybe not as we seem to also be seeing this on release branches)

@stevendanna stevendanna added P-2 Issues/test failures with a fix SLA of 3 months and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 20, 2025
@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on master @ 12a979cd5f2ac1d37502a1353596a1d77751ab5d:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.10 → v24.2.2 → v24.3.0 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.decommission/mixed-versions failed with artifacts on master @ 11265b25b659c9858345b484f73740359a50613b:

(mixedversion.go:804).Run: mixed-version test failure while running step 4 (run "preload data"): failed to import fixtures: full command output in run_063233.975598097_n2_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/decommission/mixed-versions/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=system-only
  • mvtVersions=v23.2.3 → v24.1.0 → v24.2.8 → v24.3.3 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on master @ 8655f5db5fe8ff856d90f71211a3158f60f1769f:

(mixedversion.go:804).Run: mixed-version test failure while running step 9 (run "preload data"): failed to import fixtures: full command output in run_063838.641880966_n2_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.3 → v24.2.4 → v24.3.2 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on master @ 82b1a2b174dcedfd23b0a3a40c6069143f4b038f:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v23.2.5 → v24.1.10 → v24.2.3 → v24.3.2 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on master @ bbaa2e50b2fe789527aac09b99fa5eee432e7695:

(test_runner.go:1478).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1478).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v23.2.12 → v24.1.6 → v24.3.0 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on master @ ecfb74b7ad75327f11814966a2cdab1b9793a549:

(test_runner.go:1478).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1478).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.1 → v24.3.3 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on master @ 5e499490a29405b71d5b2d568a43fb7e6b14fe56:

(test_runner.go:1478).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1478).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.7 → v24.3.4 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@tbg tbg self-assigned this Feb 13, 2025
tbg added a commit to tbg/cockroach that referenced this issue Feb 13, 2025
With this patch, at the end of decommissioning, we call the drain step as we
would for `./cockroach node drain`:

```
[...]
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	2	true	decommissioning	false	ready	0
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0
draining node n2
node is draining... remaining: 26
node is draining... remaining: 0 (complete)
node n2 drained successfully

No more data reported on target nodes. Please verify cluster health before removing the nodes.
```

In particular, note how the first invocation returns a RemainingIndicator of
26. This explains the failure in cockroachdb#140774 - cockroachdb#138732 was insufficient as it did
not guarantee that the node had actually drained fully by the time it was
marked as fully decommissioned and the `node decommission` had returned.

See cockroachdb#140774.

I verified that the modified decommission/drains roachtest passes via

```
./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive
```

Touches cockroachdb#140774.
^-- backport to 25.1-rc would fix it.
Touches cockroachdb#139411.
^-- backport to 25.1 will fix it.
Fixes cockroachdb#139413.

Release note (ops change): the node decommission cli command now waits until
the target node is drained before marking it as fully decommissioned.
Previously, it would start drain but not wait, leaving the target node briefly
in a state where it would be unable to communicate with the cluster but would
still accept client requests (which would then hang or hit unexpected errors).
Note that a previous release note claimed to fix the same defect, but in fact
only reduced the likelihood of its occurrence. As of this release note, this
problem has truly been addressed.
Epic: None
@tbg tbg removed their assignment Feb 13, 2025
@tbg tbg added T-testeng TestEng Team and removed T-kv KV Team labels Feb 13, 2025
Copy link

blathers-crl bot commented Feb 13, 2025

cc @cockroachdb/test-eng

@tbg
Copy link
Member

tbg commented Feb 13, 2025

See #140774 (comment)

tbg added a commit to tbg/cockroach that referenced this issue Feb 13, 2025
With this patch, at the end of decommissioning, we call the drain step as we
would for `./cockroach node drain`:

```
[...]
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	2	true	decommissioning	false	ready	0
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0
draining node n2
node is draining... remaining: 26
node is draining... remaining: 0 (complete)
node n2 drained successfully

No more data reported on target nodes. Please verify cluster health before removing the nodes.
```

In particular, note how the first invocation returns a RemainingIndicator of
26, so before this patch, we had initiated draining, but it hadn't fully completed.

I thought for a while that this could explain cockroachdb#140774, i.e. that cockroachdb#138732 was
insufficient as it did not guarantee that the node had actually drained fully
by the time it was marked as fully decommissioned and the `node decommission`
had returned. But I found that fully draining did not fix the test, and
ultimately tracked the issue down to a test infra problem. Still, this PR is
a good change, that brings the drain experience in decommission on par with
the standalone CLI.

See cockroachdb#140774.

I verified that the modified decommission/drains roachtest passes via

```
./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive
```

Touches cockroachdb#140774.
Touches cockroachdb#139411.
Touches cockroachdb#139413.

PR cockroachdb#138732 already fixed most of the drain issues, but since the
decommissioning process still went ahead and shut the node out
from the cluster, SQL connections that drain was still waiting
for would likely hit errors (since the gateway node would not
be able to connect to the rest of the cluster any more due to
having been flipped to fully decommissioned). So there's a new
release note for the improvement in this PR, which avoids that.

Release note (bug fix): previously, a node that was drained as part
of decommissioning may have interrupted SQL connections that were
still active during drain (and for which drain would have been
expected to wait).
Epic: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

3 participants