You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thinking more deeply about this, I think we have a general race condition. Let's pretend:
constraint controller is fast
CT controller is slow (likely, since it may require compilation of Rego/CEL)
there is some constraint Foo
It is backed by some template FooTemplate
there are 3 events for FooTemplate: no-generate-vap, generate-vap, no-generate-vap
FooTemplate reconciler is slow enough that only two events get reconciled: no-generate-vap, no-generate-vap
constraint controller gets a reconcile while generate-vap is set
If this sequence of events happens, you could get a dangling VAP binding b/c when constraint controller reconciles, it's gonna run client.Get() on the informer cache and see the annotation to create a binding, but the constraint template controller sees no change (from its view no-generate-vap has always been set), so it will not trigger a reconcile.
I think I see a way out of this, but it's not super pretty:
create a shared cache between the constraint template and constraint controllers that stores the CT controller's opinion as to whether a constraint kind should/should not generate VAP objects
create a function on the CT reconciler called shouldGenerateVAP(kind string, shouldGenerate bool), when shouldGenerate changes for a particular kind, that's what updates the cache and triggers a re-reconcile of all constraints for a given template
call shouldGenerateVAP() where we are currently running list calls (might need to think about the timing)
The constraint controller reads from the cache, rather than the object
Because the results of reconciliation are dependent on internal state, we should run reconciliation as a singleton process (i.e. throw it on the status/audit Pod), this will prevent thrashing of the k8s resources whenever the pods' watch events are out of sync
(5) is probably a good thing to address generally, since there is also a chance of thrashing on G8r upgrade. I'd suggest that any controller that creates a downstream object (CRDs, VAP objects), the creation runs as a singleton (we can carve it out as an operation, pehaps, like audit or status updates). The controllers should still run for the purposes of caching state (e.g. loading constraint templates into CF), and they should still write out pod status objects.
This is medium-to-large body of work. I would recommend we do this before graduating to beta (or at least before allowing users to set failure policy to Fail). However, I don't think we need to block the current PR on this because:
This is an in-development feature that is off-by-default
We currently fail open, so a dangling binding will not cause any issues
If this sequence of events happens, you could get a dangling VAP binding b/c when constraint controller reconciles, it's gonna run
client.Get()
on the informer cache and see the annotation to create a binding, but the constraint template controller sees no change (from its viewno-generate-vap
has always been set), so it will not trigger a reconcile.I think I see a way out of this, but it's not super pretty:
shouldGenerateVAP(kind string, shouldGenerate bool)
, whenshouldGenerate
changes for a particular kind, that's what updates the cache and triggers a re-reconcile of all constraints for a given templateshouldGenerateVAP()
where we are currently running list calls (might need to think about the timing)(5) is probably a good thing to address generally, since there is also a chance of thrashing on G8r upgrade. I'd suggest that any controller that creates a downstream object (CRDs, VAP objects), the creation runs as a singleton (we can carve it out as an operation, pehaps, like audit or status updates). The controllers should still run for the purposes of caching state (e.g. loading constraint templates into CF), and they should still write out pod status objects.
This is medium-to-large body of work. I would recommend we do this before graduating to beta (or at least before allowing users to set failure policy to Fail). However, I don't think we need to block the current PR on this because:
Originally posted by @maxsmythe in #3266 (comment)
The text was updated successfully, but these errors were encountered: