roachtest: decommission/mixed-versions failed #140774

cockroach-teamcity · 2025-02-08T06:40:25Z

roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ e32dca763783603b41fc08dbc3cf40f2cc0c1ec8:

(mixedversion.go:804).Run: mixed-version test failure while running step 40 (run "test decommission"): full command output in run_063957.844480889_n2_v2430cockroach-node-.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

arch=amd64
cloud=gce
coverageBuild=false
cpu=4
encrypted=false
mvtDeploymentMode=shared-process
mvtVersions=v24.1.6 → v24.2.2 → v24.3.0 → release-25.1.0-rc
runtimeAssertionsBuild=false
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: decommission/mixed-versions failed #139413 roachtest: decommission/mixed-versions failed [B-runtime-assertions-enabled C-bug C-test-failure O-roachtest O-robot P-2 T-kv branch-master]
roachtest: decommission/mixed-versions failed #139411 roachtest: decommission/mixed-versions failed [A-kv-server C-bug C-test-failure O-roachtest O-robot T-kv X-duplicate branch-release-25.1]
roachtest: decommission/mixed-versions failed #139393 roachtest: decommission/mixed-versions failed [B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-kv X-duplicate branch-release-24.3]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

Jira issue: CRDB-47357

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2025-02-10T06:49:37Z

roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ e32dca763783603b41fc08dbc3cf40f2cc0c1ec8:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/cpu_arch=arm64/run_1

Parameters:

arch=arm64
cloud=gce
coverageBuild=false
cpu=4
encrypted=false
mvtDeploymentMode=separate-process
mvtVersions=v24.3.1 → release-25.1.0-rc
runtimeAssertionsBuild=false
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: decommission/mixed-versions failed #139413 roachtest: decommission/mixed-versions failed [B-runtime-assertions-enabled C-bug C-test-failure O-roachtest O-robot P-2 T-kv branch-master]
roachtest: decommission/mixed-versions failed #139411 roachtest: decommission/mixed-versions failed [A-kv-server C-bug C-test-failure O-roachtest O-robot T-kv X-duplicate branch-release-25.1]
roachtest: decommission/mixed-versions failed #139393 roachtest: decommission/mixed-versions failed [B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-kv X-duplicate branch-release-24.3]

_{This test on roachdash | Improve this report!}

tbg · 2025-02-10T15:49:52Z

Second failure looks like a duplicate of #139411. It looks like nobody has investigated in detail yet, but it may be connected to #138732.

arulajmani · 2025-02-12T15:09:13Z

The first failure is because of:

n1 has 33 replicas blocked with error: "1 matching stores are currently throttled: [recv msg error: rpc error: code = DeadlineExceeded desc = giving up during snapshot reservation due to cluster setting "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded]"

which isn't concerning. I'll remove the release blocker given this and Tobi's comment above.

tbg · 2025-02-12T22:48:17Z

Investigating the latest failure.

It occurs at the end of the test: we're fully upgraded, and just finished decommissing n1:

cockroach/pkg/cmd/roachtest/tests/mixed_version_decommission.go

Lines 126 to 136 in e32dca7

    
           	// Make sure we can fully decommission a node after the upgrade is complete. 
        
           	sleepDur := 2 * suspectDuration 
        
           	t.L().Printf("sleeping for %s", sleepDur) 
        
           	sleepCtx(ctx, sleepDur) 
        
           	t.L().Printf("fully decommissioning n1 via n2") 
        
           	if err := fullyDecommission(ctx, c, n1, n2, test.DefaultCockroachPath); err != nil { 
        
           		t.Fatal(err) 
        
           	} 
        
           }

Then the test teardown hits this:

failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(1) attached stack trace
  -- stack trace:
  | main.(*testRunner).postTestAssertions.func1
  |     pkg/cmd/roachtest/test_runner.go:1475
  | main.(*testRunner).postTestAssertions.func2.1
  |     pkg/cmd/roachtest/test_runner.go:1531
  | main.(*testRunner).postTestAssertions.func2
  |     pkg/cmd/roachtest/test_runner.go:1533

We then hit an error here:

cockroach/pkg/cmd/roachtest/test_runner.go

Lines 1535 to 1546 in e32dca7

    
           // Detect replica divergence (i.e. ranges in which replicas have arrived 
        
           // at the same log position with different states). 
        
           if t.spec.SkipPostValidations&registry.PostValidationReplicaDivergence == 0 { 
        
           	func() { 
        
           		// NB: the consistency checks should run at the system tenant level. 
        
           		db := c.Conn(ctx, t.L(), validationNode, option.VirtualClusterName(install.SystemInterfaceName)) 
        
           		defer db.Close() 
        
           		if err := c.assertConsistentReplicas(ctx, db, t); err != nil { 
        
           			postAssertionErr(errors.WithDetail(err, "consistency check failed")) 
        
           		} 
        
           	}() 
        
           }

which suggests n1 was chosen as validationNode, and this is confirmed by logs:

test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1506: n1:/health?ready=1 status=200 ok
test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1506: n2:/health?ready=1 status=200 ok
test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1506: n3:/health?ready=1 status=200 ok
test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1506: n4:/health?ready=1 status=200 ok
test-post-assertions: 2025/02/10 06:49:08 test_runner.go:1522: running validation checks on node 1 (<10m)

The tail output of the decommissioning command:

$ tail run_064633.831009390_n2_cockroach-node-decom.log
1	true	11	true	decommissioning	false	ready	0
............
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0

No more data reported on target nodes. Please verify cluster health before removing the nodes.
run_064633.831009390_n2_cockroach-node-decom: 2025/02/10 06:49:07 cluster.go:2527: > result: <ok>

Before the "No more data reported...", the cli drained n1:

cockroach/pkg/cli/node.go

Lines 613 to 622 in e32dca7

    
           drainReq := &serverpb.DrainRequest{ 
        
           	Shutdown: false, 
        
           	DoDrain:  true, 
        
           	NodeId:   targetNode.String(), 
        
           } 
        
           stream, err := c.Drain(ctx, drainReq) 
        
           if err != nil { 
        
           	fmt.Fprintln(stderr) 
        
           	return errors.Wrapf(err, "while trying to drain n%d", targetNode) 
        
           }

so it is puzzling that a moment later, it returns 200 ok from its health?ready=1 endpoint.

The drain on n1, according to logs, is completed by 06:49:07:

I250210 **06:49:07.**180951 49415 1@server/drain.go:195 ⋮ [T1,Vsystem,n1] 309 drain request completed without server shutdown

The decommission is finished in the same second:

run_064633.831009390_n2_cockroach-node-decom: 2025/02/10 06:49:07 cluster.go:2527: > result:

I ran out of time but will poke more tomorrow.

cockroach-teamcity · 2025-02-13T07:11:50Z

roachtest.decommission/mixed-versions failed with artifacts on release-25.1.0-rc @ 76c2bc942ee1d8d8d62e32047bcf3cacaa21fdc1:

(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
(test_runner.go:1475).func1: failed during post test assertions (see test-post-assertions.log): pq: server is not accepting clients, try another node
test artifacts and logs in: /artifacts/decommission/mixed-versions/run_1

Parameters:

arch=amd64
cloud=gce
coverageBuild=false
cpu=4
encrypted=false
mvtDeploymentMode=separate-process
mvtVersions=v24.1.5 → v24.3.1 → release-25.1.0-rc
runtimeAssertionsBuild=false
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: decommission/mixed-versions failed #139413 roachtest: decommission/mixed-versions failed [B-runtime-assertions-enabled C-bug C-test-failure O-roachtest O-robot P-2 T-kv branch-master]
roachtest: decommission/mixed-versions failed #139411 roachtest: decommission/mixed-versions failed [A-kv-server C-bug C-test-failure O-roachtest O-robot P-2 T-kv X-duplicate branch-release-25.1]

_{This test on roachdash | Improve this report!}

tbg · 2025-02-13T07:56:55Z

Picking up where I left off above.

cockroach/pkg/server/admin.go

Lines 2139 to 2155 in e32dca7

    
           // checkReadinessForHealthCheck returns a gRPC error. 
        
           func (s *systemAdminServer) checkReadinessForHealthCheck(ctx context.Context) error { 
        
           	if err := s.grpc.health(ctx); err != nil { 
        
           		return err 
        
           	} 
        
           	status := s.nodeLiveness.GetNodeVitalityFromCache(roachpb.NodeID(s.serverIterator.getID())) 
        
           	if !status.IsLive(livenesspb.AdminHealthCheck) { 
        
           		return grpcstatus.Errorf(codes.Unavailable, "node is not healthy") 
        
           	} 
        
           	if !s.sqlServer.isReady.Load() { 
        
           		return grpcstatus.Errorf(codes.Unavailable, "node is not accepting SQL clients") 
        
           	} 
        
           	return nil 
        
           }

This indicates that once a server has completed draining, the health?ready=1 endpoint should not return successfully.
So, it hasn't completed draining. Looking at the Drain API, I'm noticing that you have to poll until the "RemainingIndicator" drops to zero. On the first call, the server just begins draining.

I'll fix this.

tbg · 2025-02-13T11:27:22Z

I sent a PR to fix this (#141411) but guess what, decommission/mixed-versions failed in the same way. This time, drain has definitely succeeded:

draining node n1
id      is_live replicas        is_decommissioning      membership      is_draining     readiness       blocking_ranges
1       true    0       true    decommissioning false   ready   0
node is draining... remaining: 170
node is draining... remaining: 0 (complete)
node n1 drained successfully

and yet, a moment later, it happily returns 200 ok from /health?ready=true.

Even just starting to drain flips this bool:

https://github.com/tbg/cockroach/blob/ea532f15f8fcf8e5fd96042053018430f4dd2452/pkg/server/drain.go#L373-L387

and this is what the health check is sensitive to (not to mention that it is also sensitive to that grpc server mode):

https://github.com/tbg/cockroach/blob/ea532f15f8fcf8e5fd96042053018430f4dd2452/pkg/server/admin.go#L2139-L2156

n1 also clearly logs that it's done:

I250213 09:38:02.336614 46067 server/drain.go:482 ⋮ [T1,Vsystem,n1] 738  SQL server drained successfully; SQL queries cannot execute any more
I250213 09:38:02.336642 46067 server/drain.go:361 ⋮ [T1,Vsystem,n1] 739  done draining clients
I250213 09:38:02.337939 46067 1@server/drain.go:312 ⋮ [T1,Vsystem,n1] 740  drain remaining: 0
I250213 09:38:02.337972 46067 1@server/drain.go:194 ⋮ [T1,Vsystem,n1] 741  drain request completed without server shutdown

By the first message, we've already flipped the bool.

I definitely fixed a real bug in the PR, but there seems to be more! In the above example, the drain is definitely very much completely done at 9:38:02.337, but then 700ms later the health endpoint still said it wasn't draining:

test-post-assertions: 2025/02/13 09:38:03 test_runner.go:1509: n1:/health?ready=1 status=200 ok

So increasingly it looks like we're hitting the wrong health endpoint here.

In the iterations of the test that passes, the test-post-assertions file follows a pattern in which the ./cockroach auth-session command doesn't even work, which precedes the health check.

Looking into how these commands differ on the good vs bad runs, I see this on a "good" run:

less ../run_5/run_092952.841408532_n1_cockroach-authsessio.log
run_092952.841408532_n1_cockroach-authsessio: 2025/02/13 09:29:52 cluster.go:2575: > ./cockroach auth-session login root --url={pgurl:1} --certs-dir ./certs --only-cookie
tobias-1739437858-06-n4cpu4:[1]: ./cockroach auth-session lo...
run_092952.841408532_n1_cockroach-authsessio: 2025/02/13 09:29:52 cluster_synced.go:832: Node 1 expanded cmd: ./cockroach auth-session login root --url='postgres://roachprod:[email protected]:26257?sslcert=.%2Fcerts%2Fclient.roachprod.crt&sslkey=.%2Fcerts%2Fclient.roachprod.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' --certs-dir ./certs --only-cookie
ERROR: server is not accepting clients, try another node
SQLSTATE: 57P01
Failed running "auth-session login"

n1 is queried at :26257, and as expected it refuses to do anything. The above command is in service of getting a cookie, we eventually get a cookie from n2, but then when we use that cookie to hit n1's health status, it fails too:

test-post-assertions: 2025/02/13 09:30:07 test_runner.go:1502: n1:/health?ready=1 status=503 body={
  "error": "node is shutting down",
  "code": 14,
  "message": "node is shutting down",
  "details": [
  ]
}

But in the bad runs, we always do end up getting a cookie from n1¹:

run_093802.564523804_n1_cockroach-authsessio: 2025/02/13 09:38:02 cluster.go:2575: > ./cockroach auth-session login root --url={pgurl:1} --certs-dir ./certs --only-cookie
tobias-1739437858-01-n4cpu4:[1]: ./cockroach auth-session lo...
run_093802.564523804_n1_cockroach-authsessio: 2025/02/13 09:38:02 cluster_synced.go:832: Node 1 expanded cmd: ./cockroach auth-session login root --url='postgres://roachprod:[email protected]:29000?sslcert=.%2Fcerts%2Fclient.roachprod.crt&sslkey=.%2Fcerts%2Fclient.roachprod.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' --certs-dir ./certs --only-cookie
session=CIGApKWjwfzCDhIQNA5D5nIb63htqpJFwYUdJg==; Path=/; HttpOnly
run_093802.564523804_n1_cockroach-authsessio: 2025/02/13 09:38:03 cluster.go:2609: > result: <ok>

but... this isn't the same n1! Note the port, 29000. There are some tenant shenanigans going on here.

Sure enough - the good run (run_5) is system only deployment mode. The bad run is separate-process. Sure enough, on n1 we have a SQL pod on :29000, and :29001 for http, but its health doesn't matter. I bet the health check also uses :29001, so it accidentally queries the SQL pod, which isn't even draining, but then hits the host cluster node with the divergence check, which then promptly fails to connect.

I added an additional path to PR #138732 and can confirm that the failing cases correspond exactly to querying the health check on port :29001.

I find it difficult to parse exactly where we're making the choice to query the SQL pod and not the KV pod.

The call stack is

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/test_runner.go#L1486-L1491

which gets the admin addresses here

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/cluster.go#L1529-L1534

and this determines the URL

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/cluster.go#L1539-L1542

meaning that we need to look into c.ExternalAdminUIAddr. Now roachprod adminui tobias-1739443754-03-n4cpu4:1 does stick to :26257 and tracing the ExternalAdminUIAddr call all the way down, I also don't see an obvious place where we prefer the tenant. The one place where this might happen, but I can't quite tell, is here:

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/roachprod/roachprod.go#L1154-L1162

which might pick the sql pod port, since at the caller we invoke c.virtualCluster

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/cluster.go#L2830-L2833

and that method probably does return the virtual cluster name when one exists:

https://github.com/tbg/cockroach/blob/770079966b9d0559c00f9289ce2899206092d57b/pkg/cmd/roachtest/cluster.go#L2811-L2818

I'll leave it at that and throw this over the fence to T-testeng. We need to be more deliberate in these post-test assertions, and likely want to always test the health of the KV layer.

I'll still go through with the PR to more fully drain, but it didn't cause this issue, since commencing drain already disables the health endpoint.

(side note: the test harness is a little annoying here - the post assertions launch the health check for each node concurrently, each of these goroutines then tries to acquire a cookie by iterating through all the nodes in order, so n1 gets hammered with O(nodes) cookie requests in a retry loop. Looks strange in the logs and confused me for a while but otherwise benign) ↩

blathers-crl · 2025-02-13T11:34:14Z

cc @cockroachdb/test-eng

blathers-crl · 2025-02-13T11:34:15Z

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26, so before this patch, we had initiated draining, but it hadn't fully completed. I thought for a while that this could explain cockroachdb#140774, i.e. that cockroachdb#138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. But I found that fully draining did not fix the test, and ultimately tracked the issue down to a test infra problem. Still, this PR is a good change, that brings the drain experience in decommission on par with the standalone CLI. See cockroachdb#140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches cockroachdb#140774. Touches cockroachdb#139411. Touches cockroachdb#139413. PR cockroachdb#138732 already fixed most of the drain issues, but since the decommissioning process still went ahead and shut the node out from the cluster, SQL connections that drain was still waiting for would likely hit errors (since the gateway node would not be able to connect to the rest of the cluster any more due to having been flipped to fully decommissioned). So there's a new release note for the improvement in this PR, which avoids that. Release note (bug fix): previously, a node that was drained as part of decommissioning may have interrupted SQL connections that were still active during drain (and for which drain would have been expected to wait). Epic: None

141414: roachtest: log queried URL in HealthStatus r=tbg a=tbg See #140774 (comment). Release note: none Epic: none Co-authored-by: Tobias Grieger <[email protected]>

DarrylWong · 2025-02-13T15:15:59Z

I also don't see an obvious place where we prefer the tenant.

The mixed version framework will set c.defaultVirtualCluster to the tenant when it's running in separate process mode:

cockroach/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go

Line 179 in fcc447c

h.runner.cluster.SetDefaultVirtualCluster(s.name)

This is what AdminUIAddr is checking to return the tenant addr.

So if we don't want the tenant I think it should be an easy change to just add the virtual cluster option:

adminAddrs, err := c.ExternalAdminUIAddr(ctx, l, nodes, option.VirtualClusterName(install.SystemInterfaceName))

I'll give it a try and send out a patch if it fixes it.

(side note: the test harness is a little annoying here - the post assertions launch the health check for each node concurrently, each of these goroutines then tries to acquire a cookie by iterating through all the nodes in order, so n1 gets hammered with O(nodes) cookie requests in a retry loop. Looks strange in the logs and confused me for a while but otherwise benign) ↩

Ack, thanks for pointing it out. Think this is a bug, it shouldn't be fetching the cookies so often, will look into it.

This changes the post test assertions to always run the health checks on the KV pod. Previously it would run these health checks on the default virtual cluster, which may be a SQL pod. Fixes: #140774

This changes the post test assertions to always run the health checks on the KV pod. Previously it would run these health checks on the default virtual cluster, which may be a SQL pod. Fixes: cockroachdb#140774

141411: cli: fully and properly drain target node of decommission r=tbg a=tbg With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26, so before this patch, we had initiated draining, but it hadn't fully completed. I thought for a while that this could explain #140774, i.e. that #138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. But I found that fully draining did not fix the test, and ultimately tracked the issue down to a test infra problem. Still, this PR is a good change, that brings the drain experience in decommission on par with the standalone CLI. See #140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches #140774. Touches #139411. Touches #139413. Closes #140098 (since we no longer fail decommission on drain failure) PR #138732 already fixed most of the drain issues, but since the decommissioning process still went ahead and shut the node out from the cluster, SQL connections that drain was still waiting for would likely hit errors (since the gateway node would not be able to connect to the rest of the cluster any more due to having been flipped to fully decommissioned). So there's a new release note for the improvement in this PR, which avoids that. We should consider backporting this new set of changes to 25.1 to address flakes (and the corresponding poor UX that could occur in production) such as #141578 (comment). Release note (bug fix): previously, a node that was drained as part of decommissioning may have interrupted SQL connections that were still active during drain (and for which drain would have been expected to wait). Epic: None Co-authored-by: Tobias Grieger <[email protected]>

With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26, so before this patch, we had initiated draining, but it hadn't fully completed. I thought for a while that this could explain #140774, i.e. that #138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. But I found that fully draining did not fix the test, and ultimately tracked the issue down to a test infra problem. Still, this PR is a good change, that brings the drain experience in decommission on par with the standalone CLI. See #140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches #140774. Touches #139411. Touches #139413. PR #138732 already fixed most of the drain issues, but since the decommissioning process still went ahead and shut the node out from the cluster, SQL connections that drain was still waiting for would likely hit errors (since the gateway node would not be able to connect to the rest of the cluster any more due to having been flipped to fully decommissioned). So there's a new release note for the improvement in this PR, which avoids that. Release note (bug fix): previously, a node that was drained as part of decommissioning may have interrupted SQL connections that were still active during drain (and for which drain would have been expected to wait). Epic: None

This changes the post test assertions to always run the health checks on the KV pod. Previously it would run these health checks on the default virtual cluster, which may be a SQL pod. Fixes: cockroachdb#140774

With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26, so before this patch, we had initiated draining, but it hadn't fully completed. I thought for a while that this could explain cockroachdb#140774, i.e. that cockroachdb#138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. But I found that fully draining did not fix the test, and ultimately tracked the issue down to a test infra problem. Still, this PR is a good change, that brings the drain experience in decommission on par with the standalone CLI. See cockroachdb#140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches cockroachdb#140774. Touches cockroachdb#139411. Touches cockroachdb#139413. PR cockroachdb#138732 already fixed most of the drain issues, but since the decommissioning process still went ahead and shut the node out from the cluster, SQL connections that drain was still waiting for would likely hit errors (since the gateway node would not be able to connect to the rest of the cluster any more due to having been flipped to fully decommissioned). So there's a new release note for the improvement in this PR, which avoids that. Release note (bug fix): previously, a node that was drained as part of decommissioning may have interrupted SQL connections that were still active during drain (and for which drain would have been expected to wait). Epic: None

With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26, so before this patch, we had initiated draining, but it hadn't fully completed. I thought for a while that this could explain #140774, i.e. that #138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. But I found that fully draining did not fix the test, and ultimately tracked the issue down to a test infra problem. Still, this PR is a good change, that brings the drain experience in decommission on par with the standalone CLI. See #140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches #140774. Touches #139411. Touches #139413. PR #138732 already fixed most of the drain issues, but since the decommissioning process still went ahead and shut the node out from the cluster, SQL connections that drain was still waiting for would likely hit errors (since the gateway node would not be able to connect to the rest of the cluster any more due to having been flipped to fully decommissioned). So there's a new release note for the improvement in this PR, which avoids that. Release note (bug fix): previously, a node that was drained as part of decommissioning may have interrupted SQL connections that were still active during drain (and for which drain would have been expected to wait). Epic: None

cockroach-teamcity mentioned this issue Feb 8, 2025

roachtest: decommission/mixed-versions failed #139413

Closed

cockroach-teamcity mentioned this issue Feb 10, 2025

roachtest: decommission/mixed-versions failed #139411

Open

tbg added the X-duplicate Closed as a duplicate of another issue. label Feb 10, 2025

tbg self-assigned this Feb 10, 2025

arulajmani removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 12, 2025

tbg mentioned this issue Feb 13, 2025

cli: fully and properly drain target node of decommission #141411

Merged

tbg mentioned this issue Feb 13, 2025

roachtest: log queried URL in HealthStatus #141414

Merged

tbg removed their assignment Feb 13, 2025

tbg added the T-testeng TestEng Team label Feb 13, 2025

exalate-issue-sync bot removed the T-kv KV Team label Feb 13, 2025

craig bot pushed a commit that referenced this issue Feb 13, 2025

Merge #141414

ab7abe1

141414: roachtest: log queried URL in HealthStatus r=tbg a=tbg See #140774 (comment). Release note: none Epic: none Co-authored-by: Tobias Grieger <[email protected]>

blathers-crl bot added branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 branch-release-25.1 labels Feb 14, 2025

celeste-cockroachdb bot added the target-release-25.2.0 label Feb 14, 2025

blathers-crl bot mentioned this issue Feb 14, 2025

release-25.1: roachtest: run post test health checks on system #141521

Merged

DarrylWong mentioned this issue Feb 18, 2025

release-24.3: roachtest: run post test health checks on system #141635

Merged

celeste-cockroachdb bot added target-release-25.1.1 target-release-24.3.7 target-release-24.3.8 and removed target-release-24.3.7 labels Feb 18, 2025

blathers-crl bot mentioned this issue Feb 20, 2025

release-25.1: cli: fully and properly drain target node of decommission #141769

Merged

celeste-cockroachdb bot added target-release-25.1.2 and removed target-release-24.3.8 labels Mar 11, 2025

blathers-crl bot mentioned this issue Mar 13, 2025

release-24.1: cli: fully and properly drain target node of decommission #142816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: decommission/mixed-versions failed #140774

roachtest: decommission/mixed-versions failed #140774

cockroach-teamcity commented Feb 8, 2025 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Feb 10, 2025

tbg commented Feb 10, 2025

arulajmani commented Feb 12, 2025

tbg commented Feb 12, 2025

cockroach-teamcity commented Feb 13, 2025

tbg commented Feb 13, 2025

tbg commented Feb 13, 2025 •

edited

Loading

blathers-crl bot commented Feb 13, 2025

blathers-crl bot commented Feb 13, 2025

DarrylWong commented Feb 13, 2025 •

edited

Loading

roachtest: decommission/mixed-versions failed #140774

roachtest: decommission/mixed-versions failed #140774

Comments

cockroach-teamcity commented Feb 8, 2025 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Feb 10, 2025

tbg commented Feb 10, 2025

arulajmani commented Feb 12, 2025

tbg commented Feb 12, 2025

cockroach-teamcity commented Feb 13, 2025

tbg commented Feb 13, 2025

tbg commented Feb 13, 2025 • edited Loading

Footnotes

blathers-crl bot commented Feb 13, 2025

blathers-crl bot commented Feb 13, 2025

DarrylWong commented Feb 13, 2025 • edited Loading

cockroach-teamcity commented Feb 8, 2025 •

edited by cockroach-jira-scripts

Loading

tbg commented Feb 13, 2025 •

edited

Loading

DarrylWong commented Feb 13, 2025 •

edited

Loading