Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove TRAVIS_WORKER_GCE_RATE_LIMIT_* from gce-production-4 worker.env #526

Closed
wants to merge 3 commits into from

Conversation

soulshake
Copy link
Contributor

@soulshake soulshake commented Aug 29, 2018

What is the problem that this PR is trying to fix?

The PR is a step towards removing a dependency on an external Redis.

What approach did you choose and why?

If RATE_LIMIT_REDIS_URL is not present in the environment, worker falls back to a null rate limiter.

By removing RATE_LIMIT_REDIS_URL there should be no rate limiting at all.

What feedback would you like, if any?

What are the risks here? What actually happens if we exceed a rate limit?

@igorwwwwwwwwwwwwwwwwwwww say:

According to the Quotas page we are well within current quota limits. According to our redis rate limiting metrics we are still heavily rate limiting ourselves every now and then, but mostly are not applying any rate limits.

Addresses travis-ci/worker#491

@igorwwwwwwwwwwwwwwwwwwww
Copy link
Contributor

The main risk is that if we do hit our quota, there will be requeues. And because there is afaik currently no backoff or limit to requeues, we could get into a hairy situation with constant requeueing.

At least I think that is a thing that could in theory happen. Might be worth improving our handling of errors/requeues in worker before we roll this out.

@soulshake
Copy link
Contributor Author

@igorwwwwwwwwwwwwwwwwwwww
How should worker react when it hits a rate limit error? Sleep and try again? If so, wouldn't a requeue have the same effect?

In other words--isn't this kind of what requeues are for?

I suppose it depends on which step in the process a job is in--if we hit a rate limit error on an instance creation request, then a requeue seems as good as a sleep-and-try-again. If we hit it at, say, instance deletion, then obviously a requeue is not what we want. (I haven't checked the code, but I sure hope we don't trigger a requeue in that case.)

If there's any possibility of requeues making the problem exponentially worse, then that's a very real problem and we should definitely modify worker to handle it. Otherwise--I think I'd be in favor of just trying without the rate limiter and seeing what happens.

Thoughts?

@igorwwwwwwwwwwwwwwwwwwww
Copy link
Contributor

igorwwwwwwwwwwwwwwwwwwww commented Aug 30, 2018

How should worker react when it hits a rate limit error? Sleep and try again? If so, wouldn't a requeue have the same effect?

In other words--isn't this kind of what requeues are for?

Yes! :) However, afaik there is no backoff currently. So that is what I would add. There are some tricks with rabbitmq TTLs and dead letter exchanges that can be used to implement backoff (example).

Thing is, I'm not sure if this is really a problem in practice. I'm happy to try removing the limit first and only implementing backoff if it actually becomes a problem.

@bogdanap
Copy link
Contributor

I just thought I'd mention that we tried this before in June and we had to revert it: #465

@soulshake
Copy link
Contributor Author

soulshake commented Aug 30, 2018

LOL!!
Thank you @bogdanap
Does anyone remember more details about how/which rate limits we were hitting?

@igorwwwwwwwwwwwwwwwwwwww
Copy link
Contributor

igorwwwwwwwwwwwwwwwwwwww commented Aug 30, 2018

"Compute API, requests per 100 seconds" iirc. We got that quota raised a couple weeks ago, but it might still be too low for our peak load.

@igorwwwwwwwwwwwwwwwwwwww
Copy link
Contributor

I looked into this a bit more, and learned things.

So apparently requeues do not go directly through rabbitmq, but perform a whole round trip of hub => scheduler => rabbit => worker. This means we have more flexibility in how we implement backoff.

I've created a ticket (https://github.com/travis-ci/reliability/issues/166) and will be taking a look at the possibility of implementing backoff in scheduler.

@meatballhat
Copy link
Contributor

Outdated and now-invalid! 😿 💔

@meatballhat meatballhat deleted the aj-null-rate-limiter branch November 19, 2018 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants