Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Add a design for a priority queue #3013

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alvaroaleman
Copy link
Member

@alvaroaleman alvaroaleman commented Nov 17, 2024

This change describes the motivation and implementation details for a priority queue in controller-runtime.

Ref #2374

POC of the changes is in #3014

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 17, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 17, 2024
@alvaroaleman
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2024
// Adding an item that is already there may update its wait
// period to the lowest of existing and new wait period or
// its priority to the highest of existing and new priority.
AddWithOpts(o AddOpts, items ...T)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main alternate to this would be to have AddWithPriority, AddAfterWithPriority AddRatelimitedWithPriority etc - I found this easier but not a strong opinion either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a lot easier to extend without breaking changes

reasons to prioritize some events") will always require implementation of a custom
handler or eventsource in order to inject the appropriate priority.

## Implementation stages

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the default controller be modified to make use of this new queue (if used), or will it rely on using a custom controller implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will be updated to make use of it - the only thing it needs is to re-use the priority though. Once we make it the default, I would also like to add a Priority parameter to the reconcile.Result - But really only once its enabled by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also to the Request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also to the Request?

Ah, interesting thought. Is the idea because you want to be able to set the priority in a handler?

The way this currently works is that the workqueue doesn't have any understanding of the request object (and we should keep it that way IMHO). We could probably provide a thin wrapper for it that will use the priority from the request if the AddWithOptions call doesn't have one yet and inject that in the builder if its typed to reconcile.Request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the priority could also be useful information for the Reconcile func. But maybe it's a bad idea not sure 😀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Because then it can use that as an input when returning a priority?

Copy link
Member

@sbueringer sbueringer Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one use case. Another one would be to reconcile Requests of different priorities differently :)

Maybe it makes sense in some cases to pass down the priority (if some other components are involved in reconciliation).

Maybe a controller would act differently if it can infer based on the priority if this is just a periodic resync (or something similar) vs an actual change. Or in general prioritize Requests with higher priority higher (if it has to schedule some "tasks" in other systems)

But I'm really not sure if this is a good idea or opening pandora's box :)

// object is already in the workqueue, the priority will be updated
// to the highest of existing and new priority.
Priority int
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m wondering if we need to consider starvation scenarios whether some item is never retrieved because there’s always something of higher priority in the queue. It might be fair to say this is “user error”, but is there a way to detect / avoid / recover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. I don't think avoid or recover, but maybe detect. I'll think about it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a bit about this. This can only happen if the controller is unable to drain its q, which would be a problem by itself regardless of priority q. I would generally expect alerts on this.

The implementation of the queue I have in the other PR happens to have timestamps for when an object was added stored so in theory we could use that to emit metrics or logs if an object is in the queue for too long. In practise however, defining what "too long" is seems difficult because it varies. I tend to think that the queue depth metric is overall good enough to deal with this problem. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queue depth is definitely good enough in general to detect that the queue cannot be drained. Isn't there also a metric for the longest time something is in the queue?

Independent of that we could also consider increasing priority after each resync, so it eventually reaches or maybe even exceeds (not sure) the priority of regular events

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent of that we could also consider increasing priority after each resync

I wouldn't want something like that in the queue (or probably even by default), because that seems a bit too magic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me!

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm. Looking forward to using this 😀

In order to fix the issue described in point one of the motivation section,
we have to be able to differentiate events stemming from the initial list
during startup and from resyncs from other events. In both these cases, the
informer emits an artifical create. The suggestion is to use a heuristic that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resyncs are updates (without resourceVersion changes)

Maybe we could make this better mid-term by flagging these events explicitly in client-go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resyncs are updates (without resourceVersion changes)

You are right, fascinating, TILd. Will update

Maybe we could make this better mid-term by flagging these events explicitly in client-go

That would make sense but I think it might be difficult due to how client-go is factored (But would love to be wrong on this).

reasons to prioritize some events") will always require implementation of a custom
handler or eventsource in order to inject the appropriate priority.

## Implementation stages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also to the Request?

designs/priorityqueue.md Outdated Show resolved Hide resolved
// GetWithPriority returns an item and its priority. It allows
// a controller to re-use the priority if it enqueues an item
// again.
GetWithPriority() (item T, priority int, shutdown bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the story for the shutdown return parameter? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied it from the existing Get, we use it to exit the controller:

if shutdown {
// Stop working
return false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Probably worth extending the godoc of this func to explain it a bit (~ the same comment as on Typed.Get)

This change describes the motivation and implementation details
for a priority queue in controller-runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants