-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PKI: Progressive performance degradation upon CA/issuer rotation #29083
Comments
I realize that the example of 60 CA/issuer rotations as quickly as Vault is capable is unrealistic, however the fact that performance never recovers indicates that this will be encountered in the future. |
I've also tested this on a fresh EKS cluster with the ebs-csi-provisioner platform storage backend and an actual AWS KMS key, and observed the same effect. |
Hi @aescaler-raft, thanks for filing the issue. I could have sworn we had an open issue around this problem already but my search turned up empty. This is a known issue around having many issuers and rebuilding all the CRLs which always happens when a new issuer is created. This shouldn't be a huge impact on day to day operations if the issuer count is kept low, which we highly recommend doing for various reasons see: https://developer.hashicorp.com/vault/docs/secrets/pki/considerations#one-ca-certificate-one-secrets-engine I'll keep the issue open for visibility, and as another reminder that we need to make the CRL building smarter, more efficient within the PKI engine. |
Hi @stevendpclark, Thanks for validating my observations, and providing a recommended path forward. Regarding the documentation, is there any specific reason why this isn't documented in the page linked? |
Hi @aescaler-raft, I really appreciate the offer to work on this. This won't be a trivial fix to address overall within the PKI engine at this stage. Historically the PKI engine only supported a single issuer, we added multi-issuer support to help with rotating that issuer, it was never meant to have a large number of distinct issuers within the mount itself. CRL rebuilding on initial CA creation is one item but depending on how many issuers you are talking about the following items might also need to be tackled, off the top of my head (this isn't an exhaustive list)
So this will be a pretty significant effort to make the PKI engine do what you want.
I can't think of any particular reason, this should probably be called out though in that same section as another reason we do not recommend running a large number of issuers within an individual mount. |
I can certainly resonate with the history of the PKI engine. In order to meet my customer's requirements, I don't see any other options than:
--or--
I'm partial to contributing directly to Vault, but this of course assumes appetite from maintainers and contributors such as yourself. Please advise. |
I've brought it up internally for discussion, out of curiosity what sort of time frame would you need this for? |
@stevendpclark before Q2 of 2025 would be ideal. |
Describe the bug
HA Vault PKI engine with raft storage experiences progressive and permanent performance degradation upon CA/issuer generate (root) or import (root or intermediate) operations.
Separate PKI engines are not impacted (but experience the same progressive and permanent performance degradation).
Performance can be restored by disabling and then re-enabling the engine at the previous endpoint.
Testing parameters:
internal
v1.24.15+rke2r1
This performance issue is not experienced for CA CSR generation.
To me, this indicates that this issue is not tied to the private key, but to the issuer.
Observations:
%iowait
less than ~0.07 on all pods)/proc/cpuinfo
aes
flag is present, confirming support for hardware-accelerated cryptosudo cat /proc/cpuinfo | grep rand
showedrdrand
available on all nodessudo cat /proc/sys/kernel/random/entropy_avail
showed >3500 on all cluster nodesrng-tools
package installed to test/dev/random
and/dev/urandom
rngtest -c 10000 </dev/urandom
showed >1000x speed overrngtest -c 10000 </dev/random
(expected)rngd
service withsudo systemctl enable --now rngd
, observed significant performance improvement in/dev/random
speed withrngtest
rngtest
on/dev/random
and/dev/urandom
now within one order of magnitude (26s vs 2.6s, respectively)Tracing:
On a Vault cluster that is experiencing PKI engine performance degradation, I called the trace endpoint
/v1/sys/pprof/trace
with the parameterseconds=60
, after which I called the PKI engine's generate root endpoint.Analysis under
go tool trace <file>
showed high latency in the synchronization blocking profile, specifically this goroutine:github.com/hashicorp/raft.(*raftState).goFunc.func1
.Under the synchronization blocking profile page, I noticed the graph showed edges with times greater than 60s and nodes showed
0 of <time> (<percent>%)
where<time>
and<percent>
are non-zero.To Reproduce
Steps to reproduce the behavior:
vault secrets enable pki
for i in $(seq 1 60); do time curl -k -X POST "https://<vault_url>/v1/pki/root/generate/internal" -H "X-Vault-Token: <vault_token>" --data '{"common_name": "TEST CURL ROOT", "key_type": "rsa", "key_bits": 2048}'; done
sleep 600; time curl -k -X POST "https://<vault_url>/v1/pki/root/generate/internal" -H "X-Vault-Token: <vault_token>" --data '{"common_name": "TEST CURL ROOT", "key_type": "rsa", "key_bits": 2048}'
Expected behavior
Vault PKI engine performance should not degrade, or at least recover.
Environment
vault status
): 1.18.1 built 2024-10-29T14:21:31Zvault version
): Vault v1.18.1 (f479e5c), built 2024-10-29T14:21:31ZVault server configuration file(s)
Additional context
Attempted remediations:
vault write sys/mounts/pki/config/tune options=worker_count=4
- no effectvault secrets tune -max-lease-ttl=1h pki
- no effectvault write pki/tidy tidy_cert_store=true tidy_revoked_certs=true
- no effectvault operator raft snapshot save snapshot.snap; vault operator raft snapshot restore snapshot.snap
vault secrets disable pki; vault secrets enable -path=pki pki
- greatest effect, non-permanentThe text was updated successfully, but these errors were encountered: