Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: decommission/mixed-versions failed #139411

Open
cockroach-teamcity opened this issue Jan 19, 2025 · 13 comments
Open

roachtest: decommission/mixed-versions failed #139411

cockroach-teamcity opened this issue Jan 19, 2025 · 13 comments
Labels
A-kv-server Relating to the KV-level RPC server branch-release-25.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-testeng TestEng Team X-duplicate Closed as a duplicate of another issue.

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 19, 2025

roachtest.decommission/mixed-versions failed with artifacts on release-25.1 @ 6123ebaa10c3847899aa81b09ec4cb8d08c0bd6d:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.8 → v24.3.2 → release-25.1
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-46638

@cockroach-teamcity cockroach-teamcity added branch-release-25.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jan 19, 2025
@stevendanna
Copy link
Collaborator

Looks like a duplicate of #139411 but going to leave this open until that is solved as this will likely keep failing.

@stevendanna stevendanna added the X-duplicate Closed as a duplicate of another issue. label Jan 20, 2025
@arulajmani
Copy link
Collaborator

@tbg maybe after a few more weeks of baking we should consider backporting #138732? Wdyt?

@arulajmani arulajmani added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 21, 2025
@stevendanna
Copy link
Collaborator

@arulajmani I'm not sure that just that will do it. Note that #139413 is the same failure on master after that was merged afaik.

@arulajmani
Copy link
Collaborator

I missed that failure, my bad.

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on release-25.1 @ 0fb79d22b0fc0719f28a1e3993be2c4babf5155e:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.3.1 → release-25.1
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.decommission/mixed-versions failed with artifacts on release-25.1 @ 581433a5f096e05cb0654a114120c9f782f06588:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v23.1.3 → v23.2.6 → v24.1.4 → v24.3.0 → release-25.1
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.decommission/mixed-versions failed with artifacts on release-25.1 @ 38229635f670bbdd2b22640a55a076d3ae373ead:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.11 → v24.2.4 → v24.3.1 → release-25.1
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on release-25.1 @ 7c11725f143617e0d0dba932ca8eb2a23bbcaa8b:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.5 → v24.2.9 → v24.3.3 → release-25.1
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.decommission/mixed-versions failed with artifacts on release-25.1 @ 9893597b3d339f77fbcb6c59e9881c86cb384396:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.2.9 → v24.3.3 → release-25.1
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.decommission/mixed-versions failed with artifacts on release-25.1 @ e27a850a967ee653acf52e0906e03da308631212:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v23.1.27 → v23.2.8 → v24.1.4 → v24.3.2 → release-25.1
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@iskettaneh iskettaneh added the P-2 Issues/test failures with a fix SLA of 3 months label Feb 10, 2025
@tbg tbg assigned tbg and unassigned stevendanna Feb 13, 2025
tbg added a commit to tbg/cockroach that referenced this issue Feb 13, 2025
With this patch, at the end of decommissioning, we call the drain step as we
would for `./cockroach node drain`:

```
[...]
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	2	true	decommissioning	false	ready	0
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0
draining node n2
node is draining... remaining: 26
node is draining... remaining: 0 (complete)
node n2 drained successfully

No more data reported on target nodes. Please verify cluster health before removing the nodes.
```

In particular, note how the first invocation returns a RemainingIndicator of
26. This explains the failure in cockroachdb#140774 - cockroachdb#138732 was insufficient as it did
not guarantee that the node had actually drained fully by the time it was
marked as fully decommissioned and the `node decommission` had returned.

See cockroachdb#140774.

I verified that the modified decommission/drains roachtest passes via

```
./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive
```

Touches cockroachdb#140774.
^-- backport to 25.1-rc would fix it.
Touches cockroachdb#139411.
^-- backport to 25.1 will fix it.
Fixes cockroachdb#139413.

Release note (ops change): the node decommission cli command now waits until
the target node is drained before marking it as fully decommissioned.
Previously, it would start drain but not wait, leaving the target node briefly
in a state where it would be unable to communicate with the cluster but would
still accept client requests (which would then hang or hit unexpected errors).
Note that a previous release note claimed to fix the same defect, but in fact
only reduced the likelihood of its occurrence. As of this release note, this
problem has truly been addressed.
Epic: None
@tbg tbg removed their assignment Feb 13, 2025
@tbg tbg added the T-testeng TestEng Team label Feb 13, 2025
Copy link

blathers-crl bot commented Feb 13, 2025

cc @cockroachdb/test-eng

Copy link

blathers-crl bot commented Feb 13, 2025

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg
Copy link
Member

tbg commented Feb 13, 2025

See #140774 (comment)

@tbg tbg removed the T-kv KV Team label Feb 13, 2025
tbg added a commit to tbg/cockroach that referenced this issue Feb 13, 2025
With this patch, at the end of decommissioning, we call the drain step as we
would for `./cockroach node drain`:

```
[...]
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	2	true	decommissioning	false	ready	0
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0
draining node n2
node is draining... remaining: 26
node is draining... remaining: 0 (complete)
node n2 drained successfully

No more data reported on target nodes. Please verify cluster health before removing the nodes.
```

In particular, note how the first invocation returns a RemainingIndicator of
26, so before this patch, we had initiated draining, but it hadn't fully completed.

I thought for a while that this could explain cockroachdb#140774, i.e. that cockroachdb#138732 was
insufficient as it did not guarantee that the node had actually drained fully
by the time it was marked as fully decommissioned and the `node decommission`
had returned. But I found that fully draining did not fix the test, and
ultimately tracked the issue down to a test infra problem. Still, this PR is
a good change, that brings the drain experience in decommission on par with
the standalone CLI.

See cockroachdb#140774.

I verified that the modified decommission/drains roachtest passes via

```
./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive
```

Touches cockroachdb#140774.
Touches cockroachdb#139411.
Touches cockroachdb#139413.

PR cockroachdb#138732 already fixed most of the drain issues, but since the
decommissioning process still went ahead and shut the node out
from the cluster, SQL connections that drain was still waiting
for would likely hit errors (since the gateway node would not
be able to connect to the rest of the cluster any more due to
having been flipped to fully decommissioned). So there's a new
release note for the improvement in this PR, which avoids that.

Release note (bug fix): previously, a node that was drained as part
of decommissioning may have interrupted SQL connections that were
still active during drain (and for which drain would have been
expected to wait).
Epic: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server branch-release-25.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-testeng TestEng Team X-duplicate Closed as a duplicate of another issue.
Projects
None yet
Development

No branches or pull requests

5 participants