Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MgmtRestore nemesis is failing with Data verification stress command, triggered by the 'mgmt_restore' nemesis, has failed assertion #4092

Open
dimakr opened this issue Oct 29, 2024 · 3 comments
Assignees
Labels
qa should be used for qa team testing tasks

Comments

@dimakr
Copy link

dimakr commented Oct 29, 2024

Starting from 2024.1.12 patch release the longevity-50gb-3days-test SCT test started to fail during disrupt_mgmt_restore disruption with the error:

2024-10-29 11:31:45.988: (DisruptionEvent Severity.ERROR) period_type=end event_id=01af3296-0b08-40ae-afc6-e9ad49a4d221 duration=8m51s: nemesis_name=MgmtRestore target_node=Node longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-7 [54.82.73.101 | 10.12.8.71] errors=Data verification stress command, triggered by the 'mgmt_restore' nemesis, has failed
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5129, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2942, in disrupt_mgmt_restore
assert is_passed, (
AssertionError: Data verification stress command, triggered by the 'mgmt_restore' nemesis, has failed

The sequence of nemeses in this test is that disrupt_mgmt_restore disruption is performed 2 times one by one, where the 1st one succeeds and the 2nd is failing, e.g.:

disrupt_mgmt_restore	longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-7	Failed	2024-10-29 11:22:54	2024-10-29 11:31:45
disrupt_terminate_kubernetes_host_then_replace_scylla_node	longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-14	Skipped	2024-10-29 11:20:21	2024-10-29 11:20:22
disrupt_mgmt_restore	longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-12	Succeeded	2024-10-29 10:36:58	2024-10-29 11:07:43

but up until 2024.1.12 release there was no problem with this sequence.

As per SCT logs the 2nd restore task is finished successfully, but then the confirmation stress command is failing:

< t:2024-10-29 11:31:45,048 f:base.py         l:147  c:RemoteLibSSH2CmdRunner p:ERROR > <10.12.8.211>: Error executing command: "sudo  docker exec a8bc6b748e1ad392c8ab0b33457e4ac6da8742022bd9096967673c413f331abe /bin/sh -c 'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=PD3KR3TC3TA9EFFYOK6F; cassandra-stress read no-warmup cl=QUORUM n=1747626 -schema '"'"'keyspace=5gb_sizetiered_6_0 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)'"'"' -mode cql3 native   user=cassandra password=cassandra -rate threads=50 -col '"'"'size=FIXED(64) n=FIXED(16)'"'"' -pop seq=1..1747626 -transport '"'"'truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra'"'"' -node ip-10-12-10-151.ec2.internal,ip-10-12-8-150.ec2.internal,ip-10-12-8-71.ec2.internal,ip-10-12-8-65.ec2.internal,ip-10-12-10-209.ec2.internal,ip-10-12-11-81.ec2.internal -errors skip-unsupported-columns'"; Exit status: 137
< t:2024-10-29 11:31:45,048 f:base.py         l:150  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.8.211>: STDOUT:  23.0,  0.04753,      0,      0,       0,       0,       0,       0
< t:2024-10-29 11:31:45,048 f:base.py         l:150  c:RemoteLibSSH2CmdRunner p:DEBUG > total,        740364,   34941,   34941,   34941,     1.4,     1.2,     2.5,     3.1,     4.1,     6.2,   24.0,  0.04562,      0,      0,       0,       0,       0,       0
< t:2024-10-29 11:31:45,048 f:base.py         l:152  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.8.211>: STDERR: Failed to connect over JMX; not collecting these stats

< t:2024-10-29 11:31:45,060 f:grafana.py      l:80   c:sdcm.sct_events.grafana p:DEBUG > GrafanaEventAggregator start a new time window (90 sec)
< t:2024-10-29 11:31:45,169 f:tester.py       l:2143 c:LongevityTest        p:WARNING > There is no stress results, probably stress thread has failed.

As per the loader nodes logs seems like stress command was not finished/stuck on one of loader nodes,which probably caused the fail of verification stress command. E.g. in build #31 of the test, the loader node-1 cassandra-stress-read-l1-c0-k1-e37ba171-92fa-4c59-be6f-17c4e3e07e60.log shows that the stress command execution was not finished (below is the very end of log):

Running READ with 50 threads for 1747626 iteration
...
===== Using optimized driver!!! =====
...
Connected to cluster: longevity-tls-50gb-3d-2024-1-db-cluster-7abc7f75, max pending requests per connection null, max connections per host 8
...
WARN  [main] 2024-10-29 11:31:20,023 HostConnectionPool.java:183 - Not using advanced port-based shard awareness with ip-10-12-10-209.ec2.internal/10.12.10.209:9042 because we're missing port-based shard awareness port on the server
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
Failed to connect over JMX; not collecting these stats
total,          8080,    8080,    8080,    8080,     4.7,     2.7,    15.2,    26.3,    40.0,    53.4,    1.0,  0.00000,      0,      0,       0,       0,       0,       0
total,         25107,   17027,   17027,   17027,     2.9,     2.1,     7.2,    14.4,    28.3,    35.2,    2.0,  0.24608,      0,      0,       0,       0,       0,       0
total,         45447,   20340,   20340,   20340,     2.4,     1.9,     5.4,     8.3,    12.4,    19.4,    3.0,  0.19124,      0,      0,       0,       0,       0,       0
total,         68212,   22765,   22765,   22765,     2.2,     1.8,     4.5,     6.4,     8.7,    10.8,    4.0,  0.15747,      0,      0,       0,       0,       0,       0
total,         92353,   24141,   24141,   24141,     2.0,     1.7,     3.9,     6.1,    24.3,    27.6,    5.0,  0.13530,      0,      0,       0,       0,       0,       0
total,        119421,   27068,   27068,   27068,     1.8,     1.6,     3.3,     4.5,     7.2,     9.4,    6.0,  0.12226,      0,      0,       0,       0,       0,       0
total,        152720,   33299,   33299,   33299,     1.5,     1.3,     2.6,     3.4,     6.0,     8.9,    7.0,  0.12410,      0,      0,       0,       0,       0,       0
total,        186701,   33981,   33981,   33981,     1.4,     1.3,     2.5,     3.2,     4.6,     6.4,    8.0,  0.11757,      0,      0,       0,       0,       0,       0
total,        220423,   33722,   33722,   33722,     1.5,     1.3,     2.5,     3.3,     4.5,     5.7,    9.0,  0.10858,      0,      0,       0,       0,       0,       0
total,        254412,   33989,   33989,   33989,     1.4,     1.3,     2.5,     3.2,     4.5,     6.0,   10.0,  0.10024,      0,      0,       0,       0,       0,       0
total,        288271,   33859,   33859,   33859,     1.4,     1.3,     2.5,     3.2,     4.3,     5.3,   11.0,  0.09255,      0,      0,       0,       0,       0,       0
total,        322536,   34265,   34265,   34265,     1.4,     1.3,     2.5,     3.2,     4.8,     6.7,   12.0,  0.08599,      0,      0,       0,       0,       0,       0
total,        357208,   34672,   34672,   34672,     1.4,     1.3,     2.5,     3.2,     4.2,     6.4,   13.0,  0.08033,      0,      0,       0,       0,       0,       0
total,        392007,   34799,   34799,   34799,     1.4,     1.2,     2.5,     3.2,     4.6,     8.3,   14.0,  0.07533,      0,      0,       0,       0,       0,       0
total,        426952,   34945,   34945,   34945,     1.4,     1.2,     2.5,     3.2,     4.4,     5.6,   15.0,  0.07086,      0,      0,       0,       0,       0,       0
total,        461679,   34727,   34727,   34727,     1.4,     1.2,     2.5,     3.2,     4.4,     6.4,   16.0,  0.06679,      0,      0,       0,       0,       0,       0
total,        496495,   34816,   34816,   34816,     1.4,     1.3,     2.5,     3.1,     4.2,     6.3,   17.0,  0.06316,      0,      0,       0,       0,       0,       0
total,        530839,   34344,   34344,   34344,     1.4,     1.2,     2.5,     3.3,     5.1,    32.0,   18.0,  0.05997,      0,      0,       0,       0,       0,       0
total,        565681,   34842,   34842,   34842,     1.4,     1.2,     2.5,     3.1,     4.3,     6.9,   19.0,  0.05699,      0,      0,       0,       0,       0,       0
total,        600299,   34618,   34618,   34618,     1.4,     1.3,     2.5,     3.2,     4.5,     5.9,   20.0,  0.05425,      0,      0,       0,       0,       0,       0
total,        635197,   34898,   34898,   34898,     1.4,     1.2,     2.5,     3.2,     4.5,     7.0,   21.0,  0.05180,      0,      0,       0,       0,       0,       0
total,        670344,   35147,   35147,   35147,     1.4,     1.2,     2.5,     3.1,     4.3,     6.5,   22.0,  0.04958,      0,      0,       0,       0,       0,       0
total,        705423,   35079,   35079,   35079,     1.4,     1.2,     2.4,     3.2,     4.9,     6.4,   23.0,  0.04753,      0,      0,       0,       0,       0,       0
total,        740364,   34941,   34941,   34941,     1.4,     1.2,     2.5,     3.1,     4.1,     6.2,   24.0,  0.04562,      0,      0,       0,       0,       0,       0

Packages

Scylla version: 2024.1.12-20241023.6140bb5b2d0a with build-id 8c924fad33d4abafe401b24355fd2ad7458df89b
Kernel Version: 5.15.0-1071-aws

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-9 (3.91.76.239 | 10.12.10.91) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-8 (54.208.202.249 | 10.12.9.124) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-7 (54.82.73.101 | 10.12.8.71) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-6 (54.234.31.94 | 10.12.11.214) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-5 (3.82.60.181 | 10.12.10.79) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-4 (54.235.22.198 | 10.12.8.150) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-3 (54.196.216.38 | 10.12.11.173) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-2 (34.224.67.186 | 10.12.10.151) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-15 (52.91.185.105 | 10.12.11.81) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-14 (54.210.182.207 | 10.12.10.209) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-13 (54.198.130.153 | 10.12.9.218) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-12 (100.27.189.179 | 10.12.8.65) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-11 (54.152.25.50 | 10.12.10.129) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-10 (3.91.190.175 | 10.12.9.234) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-7abc7f75-1 (34.201.103.7 | 10.12.10.252) (shards: 14)

OS / Image: ami-08b1ade2fc79cebbf (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 7abc7f75-4694-44d1-b32e-a45482a28938
Test name: enterprise-2024.1/longevity/longevity-50gb-3days-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 7abc7f75-4694-44d1-b32e-a45482a28938
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7abc7f75-4694-44d1-b32e-a45482a28938

Logs:

Jenkins job URL
Argus

@dimakr dimakr added the qa should be used for qa team testing tasks label Oct 29, 2024
@Michal-Leszczynski
Copy link
Collaborator

@mikliapko is this issue more SM related or test setup related?

@Michal-Leszczynski
Copy link
Collaborator

I wonder if this affects the SM 3.4.0 release. Those tests were run against SM 3.3.0, right?

@mikliapko
Copy link

@mikliapko is this issue more SM related or test setup related?

It's hard to say without digging deeper.
The test uses Manager 3.3.3. I suppose we can skip it for now and release 3.4.0 at first since it's not something that was introduced in 3.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qa should be used for qa team testing tasks
Projects
None yet
Development

No branches or pull requests

3 participants