Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(vault): fix vault config neg_ttl behavior #14157

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cshuaimin
Copy link

As per documentation, neg_ttl specifies the time to cache a vault miss. However in the current implementation the secret miss is not cached for this duration and is fetched from vault backend every minute.

This PR first fixes the check in the secret rotation timer to not fetch negatively cached vaules unconditionally, but only after the neg_ttl. Then it changes the shdict ttl for negative cache from neg_ttl to neg_ttl + SECRETS_CACHE_MIN_TTL, or else the negative cache will expire from shdict and there's no chance to update it after neg_ttl.

Summary

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix FTI-6240

@CLAassistant
Copy link

CLAassistant commented Jan 14, 2025

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions github-actions bot added core/pdk cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Jan 14, 2025
@bungle
Copy link
Member

bungle commented Jan 14, 2025

As per documentation, neg_ttl specifies the time to cache a vault miss. However in the current implementation the secret miss is not cached for this duration and is fetched from vault backend every minute.

I think the reason was that we don't have a clear picture whether something is miss or something else, like a network error. Thus misses we decided to fetch every rotation cycle. We have talked also about n-number of failures, or crowing the time gradually on continuous failures, and ultimately removing the secret from the rotation.

I do not have strong feeling on any direction of this though.

@cshuaimin
Copy link
Author

Yes you are true. If there's a network error when fetching vault, it should not be cached for long time, but retry every minute (or using exponential backoffs).
How about handle these cases separately, i.e. add an new error entry on failure to cache. We still refresh error caches every minute, but for true missing cache we respect neg_ttl config.

@cshuaimin
Copy link
Author

Fixed test and rebased onto master in the force push.

As per documentation, neg_ttl specifies the time to cache a vault miss.
However in the current implementation the secret miss is not cached for
this duration and is fetched from vault backend every minute.

This PR first fixes the check in the secret rotation timer to not fetch
negatively cached vaules unconditionally, but only after the neg_ttl.
Then it changes the shdict ttl for negative cache from neg_ttl to
neg_ttl + SECRETS_CACHE_MIN_TTL, or else the negative cache will expire
from shdict and there's no chance to update it after neg_ttl.
@bungle
Copy link
Member

bungle commented Jan 15, 2025

add an new error entry on failure to cache. We still refresh error caches every minute, but for true missing cache we respect neg_ttl config.

Yes, but how to know that it was an error or missing vault key (we may need to consult each vault implementation about it, if even possible)? E.g. 404 does that come from ill configured proxy or from vault (thus we may need to check the payload, there is no standards, so each vault may be different)? But sure if you want to explore this option, I have nothing against it.

Copy link
Member

@windmgc windmgc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with nitpicks

I think the simpler way would be just caching the negative value for neg_ttl time regardless of what kind of error it encounters. IMO this seems to be more aligned with what the name of this config field "neg_ttl" describes.

kong/pdk/vault.lua Show resolved Hide resolved
kong/pdk/vault.lua Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee core/pdk size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants