Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: decommission/mixed-versions failed #140774

Open
cockroach-teamcity opened this issue Feb 8, 2025 · 10 comments
Open

roachtest: decommission/mixed-versions failed #140774

cockroach-teamcity opened this issue Feb 8, 2025 · 10 comments
Labels
branch-release-25.1.0-rc Used to mark GA and release blockers and technical advisories for 25.1.0-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team X-duplicate Closed as a duplicate of another issue.

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 8, 2025

roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ e32dca763783603b41fc08dbc3cf40f2cc0c1ec8:

(mixedversion.go:804).Run: mixed-version test failure while running step 40 (run "test decommission"): full command output in run_063957.844480889_n2_v2430cockroach-node-.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=shared-process
  • mvtVersions=v24.1.6 → v24.2.2 → v24.3.0 → release-25.1.0-rc
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-47357

@cockroach-teamcity cockroach-teamcity added branch-release-25.1.0-rc Used to mark GA and release blockers and technical advisories for 25.1.0-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Feb 8, 2025
@cockroach-teamcity
Copy link
Member Author

roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ e32dca763783603b41fc08dbc3cf40f2cc0c1ec8:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.3.1 → release-25.1.0-rc
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@tbg tbg added the X-duplicate Closed as a duplicate of another issue. label Feb 10, 2025
@tbg
Copy link
Member

tbg commented Feb 10, 2025

Second failure looks like a duplicate of #139411. It looks like nobody has investigated in detail yet, but it may be connected to #138732.

@tbg tbg self-assigned this Feb 10, 2025
@arulajmani
Copy link
Collaborator

The first failure is because of:

n1 has 33 replicas blocked with error: "1 matching stores are currently throttled: [recv msg error: rpc error: code = DeadlineExceeded desc = giving up during snapshot reservation due to cluster setting "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded]"

which isn't concerning. I'll remove the release blocker given this and Tobi's comment above.

@arulajmani arulajmani removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 12, 2025
@tbg
Copy link
Member

tbg commented Feb 12, 2025

Investigating the latest failure.

It occurs at the end of the test: we're fully upgraded, and just finished decommissing n1:

// Make sure we can fully decommission a node after the upgrade is complete.
sleepDur := 2 * suspectDuration
t.L().Printf("sleeping for %s", sleepDur)
sleepCtx(ctx, sleepDur)
t.L().Printf("fully decommissioning n1 via n2")
if err := fullyDecommission(ctx, c, n1, n2, test.DefaultCockroachPath); err != nil {
t.Fatal(err)
}
}

Then the test teardown hits this:

failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(1) attached stack trace
  -- stack trace:
  | main.(*testRunner).postTestAssertions.func1
  |     pkg/cmd/roachtest/test_runner.go:1475
  | main.(*testRunner).postTestAssertions.func2.1
  |     pkg/cmd/roachtest/test_runner.go:1531
  | main.(*testRunner).postTestAssertions.func2
  |     pkg/cmd/roachtest/test_runner.go:1533

We then hit an error here:

// Detect replica divergence (i.e. ranges in which replicas have arrived
// at the same log position with different states).
if t.spec.SkipPostValidations&registry.PostValidationReplicaDivergence == 0 {
func() {
// NB: the consistency checks should run at the system tenant level.
db := c.Conn(ctx, t.L(), validationNode, option.VirtualClusterName(install.SystemInterfaceName))
defer db.Close()
if err := c.assertConsistentReplicas(ctx, db, t); err != nil {
postAssertionErr(errors.WithDetail(err, "consistency check failed"))
}
}()
}

which suggests n1 was chosen as validationNode, and this is confirmed by logs:

test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1506: n1:/health?ready=1 status=200 ok
test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1506: n2:/health?ready=1 status=200 ok
test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1506: n3:/health?ready=1 status=200 ok
test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1506: n4:/health?ready=1 status=200 ok
test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1522: running validation checks on node 1 (<10m)

The tail output of the decommissioning command:

$ tail run_064633.831009390_n2_cockroach-node-decom.log
1	true	11	true	decommissioning	false	ready	0
............
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0

No more data reported on target nodes. Please verify cluster health before removing the nodes.
run_064633.831009390_n2_cockroach-node-decom: 2025/02/10 06:49:07 cluster.go:2527: > result: <ok>

Before the "No more data reported...", the cli drained n1:

cockroach/pkg/cli/node.go

Lines 613 to 622 in e32dca7

drainReq := &serverpb.DrainRequest{
Shutdown: false,
DoDrain: true,
NodeId: targetNode.String(),
}
stream, err := c.Drain(ctx, drainReq)
if err != nil {
fmt.Fprintln(stderr)
return errors.Wrapf(err, "while trying to drain n%d", targetNode)
}

so it is puzzling that a moment later, it returns 200 ok from its health?ready=1 endpoint.

The drain on n1, according to logs, is completed by 06:49:07:

I250210 **06:49:07.**180951 49415 1@server/drain.go:195 ⋮ [T1,Vsystem,n1] 309 drain request completed without server shutdown

The decommission is finished in the same second:

run_064633.831009390_n2_cockroach-node-decom: 2025/02/10 06:49:07 cluster.go:2527: > result:

I ran out of time but will poke more tomorrow.

@cockroach-teamcity
Copy link
Member Author

roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ 76c2bc942ee1d8d8d62e32047bcf3cacaa21fdc1:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.5 → v24.3.1 → release-25.1.0-rc
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Feb 13, 2025

Picking up where I left off above.

cockroach/pkg/server/admin.go

Lines 2139 to 2155 in e32dca7

// checkReadinessForHealthCheck returns a gRPC error.
func (s *systemAdminServer) checkReadinessForHealthCheck(ctx context.Context) error {
if err := s.grpc.health(ctx); err != nil {
return err
}
status := s.nodeLiveness.GetNodeVitalityFromCache(roachpb.NodeID(s.serverIterator.getID()))
if !status.IsLive(livenesspb.AdminHealthCheck) {
return grpcstatus.Errorf(codes.Unavailable, "node is not healthy")
}
if !s.sqlServer.isReady.Load() {
return grpcstatus.Errorf(codes.Unavailable, "node is not accepting SQL clients")
}
return nil
}

This indicates that once a server has completed draining, the health?ready=1 endpoint should not return successfully.
So, it hasn't completed draining. Looking at the Drain API, I'm noticing that you have to poll until the "RemainingIndicator" drops to zero. On the first call, the server just begins draining.

I'll fix this.

tbg added a commit to tbg/cockroach that referenced this issue Feb 13, 2025
With this patch, at the end of decommissioning, we call the drain step as we
would for `./cockroach node drain`:

```
[...]
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	2	true	decommissioning	false	ready	0
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0
draining node n2
node is draining... remaining: 26
node is draining... remaining: 0 (complete)
node n2 drained successfully

No more data reported on target nodes. Please verify cluster health before removing the nodes.
```

In particular, note how the first invocation returns a RemainingIndicator of
26. This explains the failure in cockroachdb#140774 - cockroachdb#138732 was insufficient as it did
not guarantee that the node had actually drained fully by the time it was
marked as fully decommissioned and the `node decommission` had returned.

See cockroachdb#140774.

I verified that the modified decommission/drains roachtest passes via

```
./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive
```

Touches cockroachdb#140774.
^-- backport to 25.1-rc would fix it.
Touches cockroachdb#139411.
^-- backport to 25.1 will fix it.
Fixes cockroachdb#139413.

Release note (ops change): the node decommission cli command now waits until
the target node is drained before marking it as fully decommissioned.
Previously, it would start drain but not wait, leaving the target node briefly
in a state where it would be unable to communicate with the cluster but would
still accept client requests (which would then hang or hit unexpected errors).
Note that a previous release note claimed to fix the same defect, but in fact
only reduced the likelihood of its occurrence. As of this release note, this
problem has truly been addressed.
Epic: None
@tbg
Copy link
Member

tbg commented Feb 13, 2025

I sent a PR to fix this (#141411) but guess what, decommission/mixed-versions failed in the same way. This time, drain has definitely succeeded:

draining node n1
id      is_live replicas        is_decommissioning      membership      is_draining     readiness       blocking_ranges
1       true    0       true    decommissioning false   ready   0
node is draining... remaining: 170
node is draining... remaining: 0 (complete)
node n1 drained successfully

and yet, a moment later, it happily returns 200 ok from /health?ready=true.

Even just starting to drain flips this bool:

https://github.com/tbg/cockroach/blob/ea532f15f8fcf8e5fd96042053018430f4dd2452/pkg/server/drain.go#L373-L387

and this is what the health check is sensitive to (not to mention that it is also sensitive to that grpc server mode):

https://github.com/tbg/cockroach/blob/ea532f15f8fcf8e5fd96042053018430f4dd2452/pkg/server/admin.go#L2139-L2156

n1 also clearly logs that it's done:

I250213 09:38:02.336614 46067 server/drain.go:482 ⋮ [T1,Vsystem,n1] 738  SQL server drained successfully; SQL queries cannot execute any more
I250213 09:38:02.336642 46067 server/drain.go:361 ⋮ [T1,Vsystem,n1] 739  done draining clients
I250213 09:38:02.337939 46067 1@server/drain.go:312 ⋮ [T1,Vsystem,n1] 740  drain remaining: 0
I250213 09:38:02.337972 46067 1@server/drain.go:194 ⋮ [T1,Vsystem,n1] 741  drain request completed without server shutdown

By the first message, we've already flipped the bool.

I definitely fixed a real bug in the PR, but there seems to be more! In the above example, the drain is definitely very much completely done at 9:38:02.337, but then 700ms later the health endpoint still said it wasn't draining:

test-post-assertions: 2025/02/13 09:38:03 test_runner.go:1509: n1:/health?ready=1 status=200 ok

So increasingly it looks like we're hitting the wrong health endpoint here.

In the iterations of the test that passes, the test-post-assertions file follows a pattern in which the ./cockroach auth-session command doesn't even work, which precedes the health check.

Looking into how these commands differ on the good vs bad runs, I see this on a "good" run:

less ../run_5/run_092952.841408532_n1_cockroach-authsessio.log
run_092952.841408532_n1_cockroach-authsessio: 2025/02/13 09:29:52 cluster.go:2575: > ./cockroach auth-session login root --url={pgurl:1} --certs-dir ./certs --only-cookie
tobias-1739437858-06-n4cpu4:[1]: ./cockroach auth-session lo...
run_092952.841408532_n1_cockroach-authsessio: 2025/02/13 09:29:52 cluster_synced.go:832: Node 1 expanded cmd: ./cockroach auth-session login root --url='postgres://roachprod:[email protected]:26257?sslcert=.%2Fcerts%2Fclient.roachprod.crt&sslkey=.%2Fcerts%2Fclient.roachprod.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' --certs-dir ./certs --only-cookie
ERROR: server is not accepting clients, try another node
SQLSTATE: 57P01
Failed running "auth-session login"

n1 is queried at :26257, and as expected it refuses to do anything. The above command is in service of getting a cookie, we eventually get a cookie from n2, but then when we use that cookie to hit n1's health status, it fails too:

test-post-assertions: 2025/02/13 09:30:07 test_runner.go:1502: n1:/health?ready=1 status=503 body={
  "error": "node is shutting down",
  "code": 14,
  "message": "node is shutting down",
  "details": [
  ]
}

But in the bad runs, we always do end up getting a cookie from n11:

run_093802.564523804_n1_cockroach-authsessio: 2025/02/13 09:38:02 cluster.go:2575: > ./cockroach auth-session login root --url={pgurl:1} --certs-dir ./certs --only-cookie
tobias-1739437858-01-n4cpu4:[1]: ./cockroach auth-session lo...
run_093802.564523804_n1_cockroach-authsessio: 2025/02/13 09:38:02 cluster_synced.go:832: Node 1 expanded cmd: ./cockroach auth-session login root --url='postgres://roachprod:[email protected]:29000?sslcert=.%2Fcerts%2Fclient.roachprod.crt&sslkey=.%2Fcerts%2Fclient.roachprod.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' --certs-dir ./certs --only-cookie
session=CIGApKWjwfzCDhIQNA5D5nIb63htqpJFwYUdJg==; Path=/; HttpOnly
run_093802.564523804_n1_cockroach-authsessio: 2025/02/13 09:38:03 cluster.go:2609: > result: <ok>

but... this isn't the same n1! Note the port, 29000. There are some tenant shenanigans going on here.

Sure enough - the good run (run_5) is system only deployment mode. The bad run is separate-process. Sure enough, on n1 we have a SQL pod on :29000, and :29001 for http, but its health doesn't matter. I bet the health check also uses :29001, so it accidentally queries the SQL pod, which isn't even draining, but then hits the host cluster node with the divergence check, which then promptly fails to connect.

I added an additional path to PR #138732 and can confirm that the failing cases correspond exactly to querying the health check on port :29001.

I find it difficult to parse exactly where we're making the choice to query the SQL pod and not the KV pod.

The call stack is

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/test_runner.go#L1486-L1491

which gets the admin addresses here

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/cluster.go#L1529-L1534

and this determines the URL

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/cluster.go#L1539-L1542

meaning that we need to look into c.ExternalAdminUIAddr. Now roachprod adminui tobias-1739443754-03-n4cpu4:1 does stick to :26257 and tracing the ExternalAdminUIAddr call all the way down, I also don't see an obvious place where we prefer the tenant. The one place where this might happen, but I can't quite tell, is here:

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/roachprod/roachprod.go#L1154-L1162

which might pick the sql pod port, since at the caller we invoke c.virtualCluster

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/cluster.go#L2830-L2833

and that method probably does return the virtual cluster name when one exists:

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/cluster.go#L2811-L2818

I'll leave it at that and throw this over the fence to T-testeng. We need to be more deliberate in these post-test assertions, and likely want to always test the health of the KV layer.

I'll still go through with the PR to more fully drain, but it didn't cause this issue, since commencing drain already disables the health endpoint.

Footnotes

  1. (side note: the test harness is a little annoying here - the post assertions launch the health check for each node concurrently, each of these goroutines then tries to acquire a cookie by iterating through all the nodes in order, so n1 gets hammered with O(nodes) cookie requests in a retry loop. Looks strange in the logs and confused me for a while but otherwise benign)

@tbg tbg removed their assignment Feb 13, 2025
@tbg tbg added the T-testeng TestEng Team label Feb 13, 2025
Copy link

blathers-crl bot commented Feb 13, 2025

cc @cockroachdb/test-eng

Copy link

blathers-crl bot commented Feb 13, 2025

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Feb 13, 2025
tbg added a commit to tbg/cockroach that referenced this issue Feb 13, 2025
With this patch, at the end of decommissioning, we call the drain step as we
would for `./cockroach node drain`:

```
[...]
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	2	true	decommissioning	false	ready	0
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0
draining node n2
node is draining... remaining: 26
node is draining... remaining: 0 (complete)
node n2 drained successfully

No more data reported on target nodes. Please verify cluster health before removing the nodes.
```

In particular, note how the first invocation returns a RemainingIndicator of
26, so before this patch, we had initiated draining, but it hadn't fully completed.

I thought for a while that this could explain cockroachdb#140774, i.e. that cockroachdb#138732 was
insufficient as it did not guarantee that the node had actually drained fully
by the time it was marked as fully decommissioned and the `node decommission`
had returned. But I found that fully draining did not fix the test, and
ultimately tracked the issue down to a test infra problem. Still, this PR is
a good change, that brings the drain experience in decommission on par with
the standalone CLI.

See cockroachdb#140774.

I verified that the modified decommission/drains roachtest passes via

```
./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive
```

Touches cockroachdb#140774.
Touches cockroachdb#139411.
Touches cockroachdb#139413.

PR cockroachdb#138732 already fixed most of the drain issues, but since the
decommissioning process still went ahead and shut the node out
from the cluster, SQL connections that drain was still waiting
for would likely hit errors (since the gateway node would not
be able to connect to the rest of the cluster any more due to
having been flipped to fully decommissioned). So there's a new
release note for the improvement in this PR, which avoids that.

Release note (bug fix): previously, a node that was drained as part
of decommissioning may have interrupted SQL connections that were
still active during drain (and for which drain would have been
expected to wait).
Epic: None
craig bot pushed a commit that referenced this issue Feb 13, 2025
141414: roachtest: log queried URL in HealthStatus r=tbg a=tbg

See #140774 (comment).

Release note: none
Epic: none

Co-authored-by: Tobias Grieger <[email protected]>
craig bot pushed a commit that referenced this issue Feb 13, 2025
141414: roachtest: log queried URL in HealthStatus r=tbg a=tbg

See #140774 (comment).

Release note: none
Epic: none

Co-authored-by: Tobias Grieger <[email protected]>
@DarrylWong
Copy link
Contributor

DarrylWong commented Feb 13, 2025

I also don't see an obvious place where we prefer the tenant.

The mixed version framework will set c.defaultVirtualCluster to the tenant when it's running in separate process mode:

h.runner.cluster.SetDefaultVirtualCluster(s.name)

This is what AdminUIAddr is checking to return the tenant addr.

So if we don't want the tenant I think it should be an easy change to just add the virtual cluster option:

adminAddrs, err := c.ExternalAdminUIAddr(ctx, l, nodes, option.VirtualClusterName(install.SystemInterfaceName))

I'll give it a try and send out a patch if it fixes it.


(side note: the test harness is a little annoying here - the post assertions launch the health check for each node concurrently, each of these goroutines then tries to acquire a cookie by iterating through all the nodes in order, so n1 gets hammered with O(nodes) cookie requests in a retry loop. Looks strange in the logs and confused me for a while but otherwise benign)

Ack, thanks for pointing it out. Think this is a bug, it shouldn't be fetching the cookies so often, will look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-25.1.0-rc Used to mark GA and release blockers and technical advisories for 25.1.0-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team X-duplicate Closed as a duplicate of another issue.
Projects
None yet
Development

No branches or pull requests

4 participants