Remove `TRAVIS_WORKER_GCE_RATE_LIMIT_*` from gce-production-4 `worker.env` #526

soulshake · 2018-08-29T15:43:39Z

What is the problem that this PR is trying to fix?

The PR is a step towards removing a dependency on an external Redis.

What approach did you choose and why?

If RATE_LIMIT_REDIS_URL is not present in the environment, worker falls back to a null rate limiter.

By removing RATE_LIMIT_REDIS_URL there should be no rate limiting at all.

What feedback would you like, if any?

What are the risks here? What actually happens if we exceed a rate limit?

@igorwwwwwwwwwwwwwwwwwwww say:

According to the Quotas page we are well within current quota limits. According to our redis rate limiting metrics we are still heavily rate limiting ourselves every now and then, but mostly are not applying any rate limits.

Addresses travis-ci/worker#491

igorwwwwwwwwwwwwwwwwwwww · 2018-08-29T15:48:44Z

The main risk is that if we do hit our quota, there will be requeues. And because there is afaik currently no backoff or limit to requeues, we could get into a hairy situation with constant requeueing.

At least I think that is a thing that could in theory happen. Might be worth improving our handling of errors/requeues in worker before we roll this out.

soulshake · 2018-08-30T13:34:58Z

@igorwwwwwwwwwwwwwwwwwwww
How should worker react when it hits a rate limit error? Sleep and try again? If so, wouldn't a requeue have the same effect?

In other words--isn't this kind of what requeues are for?

I suppose it depends on which step in the process a job is in--if we hit a rate limit error on an instance creation request, then a requeue seems as good as a sleep-and-try-again. If we hit it at, say, instance deletion, then obviously a requeue is not what we want. (I haven't checked the code, but I sure hope we don't trigger a requeue in that case.)

If there's any possibility of requeues making the problem exponentially worse, then that's a very real problem and we should definitely modify worker to handle it. Otherwise--I think I'd be in favor of just trying without the rate limiter and seeing what happens.

Thoughts?

igorwwwwwwwwwwwwwwwwwwww · 2018-08-30T13:55:40Z

How should worker react when it hits a rate limit error? Sleep and try again? If so, wouldn't a requeue have the same effect?

In other words--isn't this kind of what requeues are for?

Yes! :) However, afaik there is no backoff currently. So that is what I would add. There are some tricks with rabbitmq TTLs and dead letter exchanges that can be used to implement backoff (example).

Thing is, I'm not sure if this is really a problem in practice. I'm happy to try removing the limit first and only implementing backoff if it actually becomes a problem.

bogdanap · 2018-08-30T14:00:06Z

I just thought I'd mention that we tried this before in June and we had to revert it: #465

soulshake · 2018-08-30T14:52:55Z

LOL!!
Thank you @bogdanap
Does anyone remember more details about how/which rate limits we were hitting?

igorwwwwwwwwwwwwwwwwwwww · 2018-08-30T14:56:04Z

"Compute API, requests per 100 seconds" iirc. We got that quota raised a couple weeks ago, but it might still be too low for our peak load.

igorwwwwwwwwwwwwwwwwwwww · 2018-09-03T10:37:24Z

I looked into this a bit more, and learned things.

So apparently requeues do not go directly through rabbitmq, but perform a whole round trip of hub => scheduler => rabbit => worker. This means we have more flexibility in how we implement backoff.

I've created a ticket (https://github.com/travis-ci/reliability/issues/166) and will be taking a look at the possibility of implementing backoff in scheduler.

meatballhat · 2018-11-19T15:32:40Z

Outdated and now-invalid! 😿 💔

soulshake added 2 commits August 29, 2018 17:39

Remove TRAVIS_WORKER_GCE_RATE_LIMIT_\* from gce-production-4 worker.env

acdc212

Also remove from other projects

a3922b2

soulshake requested review from bogdanap, igorwwwwwwwwwwwwwwwwwwww and meatballhat August 29, 2018 15:48

Merge branch 'master' into aj-null-rate-limiter

993a5d4

meatballhat closed this Nov 19, 2018

meatballhat deleted the aj-null-rate-limiter branch November 19, 2018 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `TRAVIS_WORKER_GCE_RATE_LIMIT_*` from gce-production-4 `worker.env` #526

Remove `TRAVIS_WORKER_GCE_RATE_LIMIT_*` from gce-production-4 `worker.env` #526

soulshake commented Aug 29, 2018 •

edited

Loading

igorwwwwwwwwwwwwwwwwwwww commented Aug 29, 2018

soulshake commented Aug 30, 2018

igorwwwwwwwwwwwwwwwwwwww commented Aug 30, 2018 •

edited

Loading

bogdanap commented Aug 30, 2018

soulshake commented Aug 30, 2018 •

edited

Loading

igorwwwwwwwwwwwwwwwwwwww commented Aug 30, 2018 •

edited

Loading

igorwwwwwwwwwwwwwwwwwwww commented Sep 3, 2018

meatballhat commented Nov 19, 2018

Remove TRAVIS_WORKER_GCE_RATE_LIMIT_* from gce-production-4 worker.env #526

Remove TRAVIS_WORKER_GCE_RATE_LIMIT_* from gce-production-4 worker.env #526

Conversation

soulshake commented Aug 29, 2018 • edited Loading

What is the problem that this PR is trying to fix?

What approach did you choose and why?

What feedback would you like, if any?

igorwwwwwwwwwwwwwwwwwwww commented Aug 29, 2018

soulshake commented Aug 30, 2018

igorwwwwwwwwwwwwwwwwwwww commented Aug 30, 2018 • edited Loading

bogdanap commented Aug 30, 2018

soulshake commented Aug 30, 2018 • edited Loading

igorwwwwwwwwwwwwwwwwwwww commented Aug 30, 2018 • edited Loading

igorwwwwwwwwwwwwwwwwwwww commented Sep 3, 2018

meatballhat commented Nov 19, 2018

Remove `TRAVIS_WORKER_GCE_RATE_LIMIT_*` from gce-production-4 `worker.env` #526

Remove `TRAVIS_WORKER_GCE_RATE_LIMIT_*` from gce-production-4 `worker.env` #526

soulshake commented Aug 29, 2018 •

edited

Loading

igorwwwwwwwwwwwwwwwwwwww commented Aug 30, 2018 •

edited

Loading

soulshake commented Aug 30, 2018 •

edited

Loading

igorwwwwwwwwwwwwwwwwwwww commented Aug 30, 2018 •

edited

Loading