-
Notifications
You must be signed in to change notification settings - Fork 87
Discoverer should not retry deleting a service/endpoint that does not exist #134
Comments
How many threads are set up on your discoverer? Could you retest with a higher number? Just curious how that might have (a positive) effect on the performance of the queue. |
Originally had 2, bumped it up to 16 but still seeing this issue. Will try bumping up even higher, but not sure how much it'll help. |
Having more should process the queue faster since they should operate in parallel. But just a thought before going down a design route to see how to resolve. |
I haven't thought through the implications, but I think we should be able to swallow the "does not exist" error when we are trying to delete, instead of raising it up to the queue for re-queuing. What's unclear to me is if we can naturally get into this state by somehow processing a delete notification before processing the corresponding add, and thus end up with an extra service that should have been deleted. |
Ohh I understand now, I didn't realize that we were re-queuing each time on error. I think we could handle the "does not exist" or put a max retry on the queue, once an item is attempted 2 times, then just drop from the queue. That would solve the same scenario where maybe an update or add errors (for whatever reason) and we keep trying. |
Yeah, I actually think we need both... It doesn't look like we have a max retry on the queue item. I'll pick this up and look into it more. |
Not sure if we'd want a |
Now that we know we were not actually retrying before #141, the issue title was somewhat misleading. What was actually going on is that the namespace is deleted from under the discoverer on both clusters, resulting in the API server from the backend cluster sending a long list of DELETE watch notifications to the discoverer. This is the expected behavior. Now that we have added retries though, we should probably not attempt to retry a deletion for a service or endpoint that does not exist. Another idea (which needs a bit more thought): Have a separate queue per namespace, so that the operations performed by one team do not affect other teams. This would be a larger change that requires discussion and design though (cc @rosskukulinski @stevesloka). |
The more I think about this, the more I am second guessing whether we should do it. It seems like the deletion of a namespace with thousands of services is an unlikely scenario when running Gimbal in production. Is that a reasonable statement? @stevesloka @rosskukulinski |
My guess that This was my reason for the re-queue param, you may not want to retry more than once if something fails. |
While doing performance testing, I noticed that the discovery components are wasting cycles trying to delete services/endpoints that no longer exist. This is currently impacting other tests (which can be thought of as other gimbal teams) because the discoverer is busy trying to delete non-existent services, instead of discovering new services for other teams.
How does this happen? The performance test creates a namespace where it creates hundreds of services and endpoints in the backend cluster. These are then discovered by Gimbal and everything is fine.
Once the test is done, instead of deleting services/endpoints one by one, the performance test script destroys the test namespace on both the gimbal cluster and the backend cluster. The deletion of the namespace in the backend cluster results in watch notifications sent to the discoverers about the deletion of services. The problem is that the services/endpoints have already been deleted in the gimbal cluster, because the entire namespace was deleted.
The text was updated successfully, but these errors were encountered: