-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retries to allow for rate limits on WIF #2
Conversation
62f46fa
to
7b64e60
Compare
3092fa1
to
6c50d93
Compare
I think this is what you had in mind - it'll retry the token request once, with 3-5 second gap before attempts for jitter. LMK if this is what you expected and I'll run some tests. |
del chunk, buf # pyright: ignore[reportUnboundVariable] | ||
del chunk, buf # pyright: ignore[reportPossiblyUnboundVariable] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was breaking PR pyright :(
@@ -769,7 +769,7 @@ def write(self, b: bytes) -> int: # type: ignore | |||
size = self._upload_buf(mv) | |||
self._buf = bytearray(mv[size:]) | |||
finally: | |||
del mv # pyright: ignore[reportUnboundVariable] | |||
del mv # pyright: ignore[reportPossiblyUnboundVariable] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was breaking PR pyright :(
@@ -19,3 +19,4 @@ dependencies: | |||
- boto3==1.15.18 | |||
- lxml-stubs==0.4.0 | |||
- xmltodict==0.13.0 | |||
- tenacity==8.2.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same version as we use in the monorepo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise LG tho
retry=retry_if_exception_type(ValueError), | ||
stop=stop_after_attempt(1), | ||
wait=wait_fixed(3) + wait_random(0, 2), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we should:
- try more than once/twice (is that one retry, or one total attempt?)
- back off expontentially e.g. something like:
wait_random_exponential(multiplier=1, max=60)
(cribbed from here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took this from our existing approach that @l1n suggested - happy to retry more. how about trying ~4 times with exponential backoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and yeah, it's one retry for a max of two attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly either way sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was :sus: about 1 retry, so more sgtm. adding in, will test manually.
ok, i dove in here a bit more and i notice i'm confused:
We can calibrate this if we want to, e.g. adding more retries, but it seems like 60 is probably enough, and the root cause here was quota issues? |
oh humm good find. I think we should test this out to make sure. maybe mock out some inner function and raise inside it... |
ok, confirmed this does retry with exponential backoff and jitter. i did this in the maximally hacky way:
going to close this out since adding additional retries here seems unnecessary <3 Diff to retry failed login + give logs on retries
Script to trigger retriesRun this in the root directory of the blobfile repo (i.e. if you clone it to
LogsExited after waiting ~30 seconds
|
nice writeup! |
This matches what we did in the monorepo for fixing WIF in tensorstore (https://github.com/anthropics/anthropic/pull/28039/files) but moves it into Blobfile to ensure all users can take advantage.