Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to allow for rate limits on WIF #2

Closed
wants to merge 1 commit into from

Conversation

steveryb
Copy link

@steveryb steveryb commented Feb 1, 2024

This matches what we did in the monorepo for fixing WIF in tensorstore (https://github.com/anthropics/anthropic/pull/28039/files) but moves it into Blobfile to ensure all users can take advantage.

@steveryb steveryb requested a review from keriwarr February 2, 2024 00:07
@steveryb steveryb marked this pull request as ready for review February 2, 2024 00:07
@steveryb
Copy link
Author

steveryb commented Feb 2, 2024

I think this is what you had in mind - it'll retry the token request once, with 3-5 second gap before attempts for jitter. LMK if this is what you expected and I'll run some tests.

del chunk, buf # pyright: ignore[reportUnboundVariable]
del chunk, buf # pyright: ignore[reportPossiblyUnboundVariable]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was breaking PR pyright :(

@@ -769,7 +769,7 @@ def write(self, b: bytes) -> int: # type: ignore
size = self._upload_buf(mv)
self._buf = bytearray(mv[size:])
finally:
del mv # pyright: ignore[reportUnboundVariable]
del mv # pyright: ignore[reportPossiblyUnboundVariable]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was breaking PR pyright :(

@@ -19,3 +19,4 @@ dependencies:
- boto3==1.15.18
- lxml-stubs==0.4.0
- xmltodict==0.13.0
- tenacity==8.2.2
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same version as we use in the monorepo.

Copy link
Contributor

@keriwarr keriwarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LG tho

retry=retry_if_exception_type(ValueError),
stop=stop_after_attempt(1),
wait=wait_fixed(3) + wait_random(0, 2),
)
Copy link
Contributor

@keriwarr keriwarr Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should:

  • try more than once/twice (is that one retry, or one total attempt?)
  • back off expontentially e.g. something like: wait_random_exponential(multiplier=1, max=60) (cribbed from here)

Copy link
Author

@steveryb steveryb Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took this from our existing approach that @l1n suggested - happy to retry more. how about trying ~4 times with exponential backoff?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yeah, it's one retry for a max of two attempts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly either way sgtm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was :sus: about 1 retry, so more sgtm. adding in, will test manually.

@steveryb
Copy link
Author

steveryb commented Feb 2, 2024

ok, i dove in here a bit more and i notice i'm confused:

We can calibrate this if we want to, e.g. adding more retries, but it seems like 60 is probably enough, and the root cause here was quota issues?

@keriwarr
Copy link
Contributor

keriwarr commented Feb 2, 2024

oh humm good find. I think we should test this out to make sure. maybe mock out some inner function and raise inside it...

@steveryb
Copy link
Author

steveryb commented Feb 9, 2024

ok, confirmed this does retry with exponential backoff and jitter. i did this in the maximally hacky way:

  1. ensured that i didn't have valid gcloud credentials, and verified this gives blobfile a 400 when it tries to get them
  2. edited the code to have it retry that 400, and give me logs about it (see diff at bottom)
  3. imported the library, tried to see if a file existed, watched it retry

going to close this out since adding additional retries here seems unnecessary <3

Diff to retry failed login + give logs on retries

diff --git a/blobfile/_common.py b/blobfile/_common.py
index cdbf451..c5bab1b 100644
--- a/blobfile/_common.py
+++ b/blobfile/_common.py
@@ -49,7 +49,7 @@ HOSTNAME_STATUS_UNKNOWN = 2
 
 GCP_BASE_URL = "https://storage.googleapis.com"
 
-DEFAULT_RETRY_CODES = (408, 429, 500, 502, 503, 504)
+DEFAULT_RETRY_CODES = (408, 429, 500, 502, 503, 504, 400)
 
 
 # https://github.com/christopher-hesse/blobfile/issues/153
@@ -467,6 +467,7 @@ def _read_with_deadline(
 
 def execute_request(conf: Config, build_req: Callable[[], Request]) -> "urllib3.BaseHTTPResponse":
     for attempt, backoff in enumerate(exponential_sleep_generator()):
+        print("Trying request")
         req = build_req()
         url = req.url
         if req.params is not None:
@@ -561,6 +562,7 @@ def execute_request(conf: Config, build_req: Callable[[], Request]) -> "urllib3.
                     message=message, request=req, response=resp
                 )
                 if resp.status not in req.retry_codes:
+                    print("Error", resp.status)
                     raise err
         except (
             urllib3.exceptions.ConnectTimeoutError,
diff --git a/blobfile/_gcp.py b/blobfile/_gcp.py
index e8eb80b..8fb67d2 100644
--- a/blobfile/_gcp.py
+++ b/blobfile/_gcp.py
@@ -440,7 +440,7 @@ def _get_service_account_access_token(conf: Config, creds: Dict[str, Any]):
         req = _create_access_token_request(
             creds=creds, scopes=["https://www.googleapis.com/auth/devstorage.full_control"]
         )
-        req.success_codes = (200, 400)
+        req.success_codes = (200,)
         return req
 
     resp = common.execute_request(conf, build_req)

Script to trigger retries

Run this in the root directory of the blobfile repo (i.e. if you clone it to ~/code/blobfile, put the file there and run python <yourname>.py

import blobfile
with blobfile.BlobFile("gs://anthropic-sodium-us-east5/foo") as f:
	f.exists()

Logs

Exited after waiting ~30 seconds

blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Pragma: no-cache, Date: Fri, 09 Feb 2024 01:34:59 GMT, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Expires: Mon, 01 Jan 1990 00:00:00 GMT, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 0, sleeping for 0.0 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Date: Fri, 09 Feb 2024 01:34:59 GMT, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Pragma: no-cache, Expires: Mon, 01 Jan 1990 00:00:00 GMT, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 1, sleeping for 0.0 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Expires: Mon, 01 Jan 1990 00:00:00 GMT, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Pragma: no-cache, Date: Fri, 09 Feb 2024 01:34:59 GMT, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 2, sleeping for 0.1 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Pragma: no-cache, Date: Fri, 09 Feb 2024 01:34:59 GMT, Expires: Mon, 01 Jan 1990 00:00:00 GMT, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 3, sleeping for 0.2 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Pragma: no-cache, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Date: Fri, 09 Feb 2024 01:35:00 GMT, Expires: Mon, 01 Jan 1990 00:00:00 GMT, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 4, sleeping for 0.2 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Date: Fri, 09 Feb 2024 01:35:00 GMT, Expires: Mon, 01 Jan 1990 00:00:00 GMT, Pragma: no-cache, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 5, sleeping for 0.6 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Pragma: no-cache, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Expires: Mon, 01 Jan 1990 00:00:00 GMT, Date: Fri, 09 Feb 2024 01:35:01 GMT, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 6, sleeping for 1.0 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Expires: Mon, 01 Jan 1990 00:00:00 GMT, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Date: Fri, 09 Feb 2024 01:35:02 GMT, Pragma: no-cache, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 7, sleeping for 1.5 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Pragma: no-cache, Expires: Mon, 01 Jan 1990 00:00:00 GMT, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Date: Fri, 09 Feb 2024 01:35:03 GMT, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 8, sleeping for 12.7 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Expires: Mon, 01 Jan 1990 00:00:00 GMT, Date: Fri, 09 Feb 2024 01:35:16 GMT, Pragma: no-cache, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 9, sleeping for 8.3 seconds before retrying
Trying request
blobfile: error message=unexpected status 400, request=<Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None>, status=400, error=invalid_grant, error_description=reauth related error (invalid_rapt), error_headers=Pragma: no-cache, Date: Fri, 09 Feb 2024 01:35:25 GMT, Cache-Control: no-cache, no-store, max-age=0, must-revalidate, Expires: Mon, 01 Jan 1990 00:00:00 GMT, Content-Type: application/json; charset=utf-8, Vary: X-Origin, Vary: Referer, Vary: Origin,Accept-Encoding, Server: scaffolding on HTTPServer2, X-XSS-Protection: 0, X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000, Accept-Ranges: none, Transfer-Encoding: chunked when executing http request <Request method=POST url=https://www.googleapis.com/oauth2/v4/token params=None> attempt 10, sleeping for 45.8 seconds before retrying

@steveryb steveryb closed this Feb 9, 2024
@keriwarr
Copy link
Contributor

keriwarr commented Feb 9, 2024

nice writeup!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants