-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove TRAVIS_WORKER_GCE_RATE_LIMIT_*
from gce-production-4 worker.env
#526
Conversation
The main risk is that if we do hit our quota, there will be requeues. And because there is afaik currently no backoff or limit to requeues, we could get into a hairy situation with constant requeueing. At least I think that is a thing that could in theory happen. Might be worth improving our handling of errors/requeues in worker before we roll this out. |
@igorwwwwwwwwwwwwwwwwwwww In other words--isn't this kind of what requeues are for? I suppose it depends on which step in the process a job is in--if we hit a rate limit error on an instance creation request, then a requeue seems as good as a sleep-and-try-again. If we hit it at, say, instance deletion, then obviously a requeue is not what we want. (I haven't checked the code, but I sure hope we don't trigger a requeue in that case.) If there's any possibility of requeues making the problem exponentially worse, then that's a very real problem and we should definitely modify worker to handle it. Otherwise--I think I'd be in favor of just trying without the rate limiter and seeing what happens. Thoughts? |
Yes! :) However, afaik there is no backoff currently. So that is what I would add. There are some tricks with rabbitmq TTLs and dead letter exchanges that can be used to implement backoff (example). Thing is, I'm not sure if this is really a problem in practice. I'm happy to try removing the limit first and only implementing backoff if it actually becomes a problem. |
I just thought I'd mention that we tried this before in June and we had to revert it: #465 |
LOL!! |
"Compute API, requests per 100 seconds" iirc. We got that quota raised a couple weeks ago, but it might still be too low for our peak load. |
I looked into this a bit more, and learned things. So apparently requeues do not go directly through rabbitmq, but perform a whole round trip of hub => scheduler => rabbit => worker. This means we have more flexibility in how we implement backoff. I've created a ticket (https://github.com/travis-ci/reliability/issues/166) and will be taking a look at the possibility of implementing backoff in scheduler. |
Outdated and now-invalid! 😿 💔 |
What is the problem that this PR is trying to fix?
The PR is a step towards removing a dependency on an external Redis.
What approach did you choose and why?
If
RATE_LIMIT_REDIS_URL
is not present in the environment, worker falls back to a null rate limiter.By removing
RATE_LIMIT_REDIS_URL
there should be no rate limiting at all.What feedback would you like, if any?
What are the risks here? What actually happens if we exceed a rate limit?
@igorwwwwwwwwwwwwwwwwwwww say:
Addresses travis-ci/worker#491