Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test TestAuthLeaseTimeToLive #18585

Closed
lucasrod16 opened this issue Sep 13, 2024 · 8 comments
Closed

Flaky test TestAuthLeaseTimeToLive #18585

lucasrod16 opened this issue Sep 13, 2024 · 8 comments

Comments

@lucasrod16
Copy link
Contributor

Which Github Action / Prow Jobs are flaking?

e2e-386

Which tests are flaking?

TestAuthLeaseTimeToLive

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/18574/pull-etcd-e2e-386/1834665788858961920

Reason for failure (if possible)

auth_test.go:847: 
        	Error Trace:	/home/prow/go/src/github.com/etcd-io/etcd/tests/common/auth_test.go:847
        	            				/home/prow/go/src/github.com/etcd-io/etcd/tests/framework/testutils/execute.go:38
        	            				/home/prow/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_386.s:1363
        	Error:      	An error is expected but got nil.
        	Test:       	TestAuthLeaseTimeToLive
        	Messages:   	test-user must not be able to access to the lease, because it's attached to the key bar

Anything else we need to know?

No response

@jmhbnz
Copy link
Member

jmhbnz commented Sep 26, 2024

Discussed during sig-etcd triage meeting. Next step for this would be to try and establish a flake rate percentage using stress.

@ghouscht
Copy link
Contributor

ghouscht commented Sep 27, 2024

Discussed during sig-etcd triage meeting. Next step for this would be to try and establish a flake rate percentage using stress.

I keep reading about this stress tool in etcd issues/repo. One example is in the contributing guide here. However I'm not really able to figure out which stress tool is meant by that. Could you maybe clarify this (sorry if this is a dumb question, I'm not an active maintainer/contributer of etcd)?
I'm aware of the following stress tools but neither of them seems to be the right one (at least they don't support the flags/arguments as documented in the contributing guide):

Might also be good to add this to the contributing guide afterwards so other newbies like me aren't lost 🙂 (I can take care of that!)

Sorry for the question being a bit off-topic, I'd really like to help a bit with this issue but once again I'm confused about the referred to stress tooling. Maybe I'm also overthinking this a bit and you simply meant to create some system load with stress and then try to reproduce the test failure?

@lucasrod16
Copy link
Contributor Author

I keep reading about this stress tool in etcd issues/repo. One example is in the contributing guide here. However I'm not really able to figure out which stress tool is meant by that. Could you maybe clarify this (sorry if this is a dumb question, I'm not an active maintainer/contributer of etcd)? I'm aware of the following stress tools but neither of them seems to be the right one (at least they don't support the flags/arguments as documented in the contributing guide):

* https://github.com/resurrecting-open-source-projects/stress

* https://github.com/ColinIanKing/stress-ng

Might also be good to add this to the contributing guide afterwards so other newbies like me aren't lost 🙂 (I can take care of that!)

Sorry for the question being a bit off-topic, I'd really like to help a bit with this issue but once again I'm confused about the referred to stress tooling. Maybe I'm also overthinking this a bit and you simply meant to create some system load with stress and then try to reproduce the test failure?

@ghouscht this is the stress tool being referenced: https://pkg.go.dev/golang.org/x/tools/cmd/stress

You can install it by running:

go install golang.org/x/tools/cmd/stress@latest

It would be helpful to clarify in the contributing guide

@ghouscht
Copy link
Contributor

/assign

Thank you, I opened a small PR that improves the contributing guide with the above info. Now that I know which stress tool to use I'll try to investigate this issue.

@ghouscht
Copy link
Contributor

While preparing to run the tests with stress I ran them a few times manually (in e2e mode) and I guess I was lucky and saw this failure (same as described in this issue):

➜  common git:(issue-18585) go test -v -count 1 -run "^TestAuthLeaseTimeToLive$" -tags=e2e 
=== RUN   TestAuthLeaseTimeToLive
    before.go:36: Changing working directory to: /var/folders/p4/ddf4p2tn7n5_62ynylp1hbf80000gn/T/TestAuthLeaseTimeToLive870828400/001
    logger.go:146: 2024-09-30T10:02:31.369+0200 INFO    starting server...      {"name": "TestAuthLeaseTimeToLive-test-0"}
    logger.go:146: 2024-09-30T10:02:31.369+0200 INFO    spawning process        {"args": ["/Users/thomas/Documents/github.com/ghouscht/etcd/bin/etcd", "--name=TestAuthLeaseTimeToLive-test-0", "--listen-client-urls=http://localhost:20000", "--advertise-client-urls=http://localhost:20000", "--listen-peer-urls=http://localhost:20001", "--initial-advertise-peer-urls=http://localhost:20001", "--initial-cluster-token=new", "--data-dir", "/var/folders/p4/ddf4p2tn7n5_62ynylp1hbf80000gn/T/TestAuthLeaseTimeToLive870828400/002", "--snapshot-count=10000", "--strict-reconfig-check=false", "--auth-token=jwt,pub-key=/Users/thomas/Documents/github.com/ghouscht/etcd/tests/fixtures/server.crt,priv-key=/Users/thomas/Documents/github.com/ghouscht/etcd/tests/fixtures/server.key.insecure,sign-method=RS256,ttl=1s", "--initial-cluster-token=new", "--initial-cluster=TestAuthLeaseTimeToLive-test-0=http://localhost:20001", "--initial-cluster-state=new"], "working-dir": "/private/var/folders/p4/ddf4p2tn7n5_62ynylp1hbf80000gn/T/TestAuthLeaseTimeToLive870828400/001", "name": "TestAuthLeaseTimeToLive-test-0", "environment-variables": ["PATH=/opt/homebrew/Cellar/go/1.23.1/libexec/bin:/Users/thomas/go/bin:/Users/thomas/.krew/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Users/thomas/go/bin:/Users/thomas/.krew/bin:/Applications/iTerm.app/Contents/Resources/utilities", "ETCD_UNSUPPORTED_ARCH=arm64", "ETCD_VERIFY=all"]}
    logger.go:146: 2024-09-30T10:02:31.978+0200 INFO    started server. {"name": "TestAuthLeaseTimeToLive-test-0", "pid": 72953}
    auth_test.go:847: 
                Error Trace:    /Users/thomas/Documents/github.com/ghouscht/etcd/tests/common/auth_test.go:847
                                                        /Users/thomas/Documents/github.com/ghouscht/etcd/tests/framework/testutils/execute.go:38
                                                        /opt/homebrew/Cellar/go/1.23.1/libexec/src/runtime/asm_arm64.s:1223
                Error:          An error is expected but got nil.
                Test:           TestAuthLeaseTimeToLive
                Messages:       test-user must not be able to access to the lease, because it's attached to the key bar
    logger.go:146: 2024-09-30T10:02:33.098+0200 INFO    closing test cluster...
    logger.go:146: 2024-09-30T10:02:33.098+0200 INFO    stopping server...      {"name": "TestAuthLeaseTimeToLive-test-0"}
    logger.go:146: 2024-09-30T10:02:33.134+0200 INFO    stopped server. {"name": "TestAuthLeaseTimeToLive-test-0"}
    logger.go:146: 2024-09-30T10:02:33.134+0200 INFO    closing server...       {"name": "TestAuthLeaseTimeToLive-test-0"}
    logger.go:146: 2024-09-30T10:02:33.134+0200 INFO    removing directory      {"data-dir": "/var/folders/p4/ddf4p2tn7n5_62ynylp1hbf80000gn/T/TestAuthLeaseTimeToLive870828400/002"}
    logger.go:146: 2024-09-30T10:02:33.135+0200 INFO    closed test cluster.
--- FAIL: TestAuthLeaseTimeToLive (1.77s)
FAIL
exit status 1
FAIL    go.etcd.io/etcd/tests/v3/common 2.133s

However, running the tests in e2e mode with stress is not possible without modification because each test run tries to start its own etcd server, binding the same port which obviusly doesn't work. In integration mode I wasn't able to reproduce this issue (8 parallel runs, >10k executed with 0 failures).

To be able to run the tests in e2e mode with stress I made the following modifications in tests/framework/e2e/e2e.go to "randomize" the ports of the etcd server (obviously not a very good solution but good enough as a quick fix) :

 func (e e2eRunner) NewCluster(ctx context.Context, t testing.TB, opts ...config.ClusterOption) intf.Cluster {
        cfg := config.NewClusterConfig(opts...)
+
+       lis, err := net.Listen("tcp", ":0")
+       if err != nil {
+               panic(err)
+       }
+       lis.Close()
+
+       _, port, err := net.SplitHostPort(lis.Addr().String())
+       if err != nil {
+               panic(err)
+       }
+
+       t.Logf("using base port: %s", port)
+
+       p, _ := strconv.Atoi(port)
+
        e2eConfig := NewConfig(
                WithClusterSize(cfg.ClusterSize),
                WithQuotaBackendBytes(cfg.QuotaBackendBytes),
                WithStrictReconfigCheck(cfg.StrictReconfigCheck),
                WithAuthTokenOpts(cfg.AuthToken),
                WithSnapshotCount(cfg.SnapshotCount),
+               WithBasePort(p),
        )
 
        if cfg.ClusterContext != nil {

In 1021 runs, I saw 37 failures in total, 3 of them were due to the described issue and the remaining 34 were due to port conflicts (race condition beteween asking the kernel for a free port and actually using it in the parallel test runs). So that is roughly a 0.3% failure rate for the given test

@ghouscht
Copy link
Contributor

ghouscht commented Sep 30, 2024

Some more observations I made in the meantime. When this happens etcdctl logs:

{"level":"warn","ts":"2024-09-13T19:12:48.002655Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xa264488/localhost:20000","method":"/etcdserverpb.Lease/LeaseTimeToLive","attempt":0,"error":"rpc error: code = Unauthenticated desc = etcdserver: invalid auth token"}

and the etcd server:

{"level":"warn","ts":"2024-09-13T19:12:48.002305Z","caller":"auth/jwt.go:66","msg":"failed to parse a JWT token","error":"Token is expired"}
{"level":"warn","ts":"2024-09-13T19:12:48.002422Z","caller":"auth/store.go:1076","msg":"invalid auth token","token":"eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3MjYyNTQ3NjgsInJldmlzaW9uIjo3LCJ1c2VybmFtZSI6InRlc3QtdXNlciJ9.IDVFHDy0c481j9Le0KhBbf_7hsXm-MUq0XzF9H9QxUVFHQ2HNtksLo8_3BlV4Y3KF1VphtkD1_u6jAkVrq6UyRRxCm5j8Td871GhmVp5ci0bZwtNqRC0P3mw-GzuEWmcrIxSA_O7HgwFgfh8qEECF9IerthDmTrDdGfTx14og9wA5rlmKtt7W8DSkTsebSIWUo2RZEhVh37adSzxJ3TCmXescfNeGybDtvS4mBNXh7KslP-5o_jcfWJBWZNWe4eejjLHfj2cGkCsk4pqQy4TggaEKOVYn8_-xuCaQyowmTafkZtipNvNzruPkjG8cJwG_AfXUucNPeSDkX1sXBIbVQ"}

This can be seen in the prow job output as well. Not sure why the token is not valid but I think in this case etcdctl's exit code is 0 which leads to the error in the test being nil. I suppose that the response to the testUserAuthClient.TimeToLive (*clientv3.LeaseTimeToLiveResponse) call is then composed of all struct fields being the zero value of the corresponding Go type because the e2e test infrastructure tries to unmarshal the logged JSON error (which obviously doesn't match the JSON struct tags of the response).

Edit: Apparently the TimeToLive etcdctl cmd tries to match the etcdctl output against id (https://github.com/ghouscht/etcd/blob/issue-18585/tests/framework/e2e/etcdctl.go#L438 in e2e mode). In case the cmd fails due to an invalid auth token etcdctl logs: ... etcdserver: invalid auth token"} which is a match and thus the TimeToLive function does not return an error in e2e mode. Obviously this is entirely different in integration mode and because of that I was never able to reproduce the issue with that mode.

I think increasing the token TTL a bit should solve the issue. We should probably also think of improving the expect.ExpectedResponse matching for such simple terms like id - wdyt?

@ghouscht
Copy link
Contributor

ghouscht commented Oct 7, 2024

Hey @lucasrod16 I think now that my PR was merged I think we can close this issue, do you agree? I also had a quick look at the etcd test grid and to me the test seems to be stable now.

@lucasrod16
Copy link
Contributor Author

@ghouscht nice work! Closing this issue, fixed in #18663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants