Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPI datasource fall back to simple API if JSON API fails #26483

Open
rarkins opened this issue Jan 3, 2024 · 2 comments · May be fixed by #31771 or #32024
Open

PyPI datasource fall back to simple API if JSON API fails #26483

rarkins opened this issue Jan 3, 2024 · 2 comments · May be fixed by #31771 or #32024
Labels
datasource:pypi priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others type:refactor Refactoring or improving of existing code

Comments

@rarkins
Copy link
Collaborator

rarkins commented Jan 3, 2024

Describe the proposed change(s).

Similar to how we query rubygems.org, we should use the simple API to get the list / check for changes and then secondarily see if the rich API exists for querying metadata.

Ie instead of falling back to simple as we do today, we should fall back to API instead, or enrich using it

Algorithm:

  • Query simple API first
  • If simple fully fails, try the JSON API and return results
  • If simple succeeds, check JSON API cache to see if we have metadata for all releases. If we do, no need to query the JSON API and we can return the combined results
  • If the JSON API cache does not have metadata for all, query it, cache the new results, and return the merged data
  • If JSON API fails, but simple worked, return simple results
@rarkins rarkins added priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others datasource:pypi labels Jan 3, 2024
@HonkingGoose HonkingGoose added the type:refactor Refactoring or improving of existing code label Jul 30, 2024
@samgiz samgiz linked a pull request Oct 3, 2024 that will close this issue
6 tasks
@rarkins
Copy link
Collaborator Author

rarkins commented Oct 18, 2024

It would seem like we could do an "isSimpleIndex()" type check, e.g. if you look at https://download.pytorch.org/whl/cpu/ and https://pypi.org/simple/, however the latter is 28MB in size. We could potentially just HEAD to see if it's html and/or take the first X kilobytes and do a simple regex text. Unfortunately we maybe can't be sure that all pypi registries support HEAD either.

Here's PyPI's simple index HEAD'd:

HTTP/1.1 200 OK
Connection: close
Content-Length: 29655064
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Content-Type, If-Match, If-Modified-Since, If-None-Match, If-Unmodified-Since
Access-Control-Allow-Methods: GET
Access-Control-Max-Age: 86400
Access-Control-Expose-Headers: X-PyPI-Last-Serial
X-PyPI-Last-Serial: 25523517
Cache-Control: max-age=600, public
ETag: "rc/wt+NfDaQWSmuG7IPFrg"
Content-Security-Policy: default-src 'none'; sandbox allow-top-navigation
Referrer-Policy: origin-when-cross-origin
Accept-Ranges: bytes
Date: Fri, 18 Oct 2024 05:37:30 GMT
X-Served-By: cache-iad-kjyo7100076-IAD, cache-bma1670-BMA
X-Cache: HIT, HIT
X-Cache-Hits: 5284, 0
X-Timer: S1729229850.248675,VS0,VE1
Content-Type: text/html
Vary: Accept, Accept-Encoding
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Permitted-Cross-Domain-Policies: none
Permissions-Policy: publickey-credentials-create=(self),publickey-credentials-get=(self),accelerometer=(),ambient-light-sensor=(),autoplay=(),battery=(),camera=(),display-capture=(),document-domain=(),encrypted-media=(),execution-while-not-rendered=(),execution-while-out-of-viewport=(),fullscreen=(),gamepad=(),geolocation=(),gyroscope=(),hid=(),identity-credentials-get=(),idle-detection=(),local-fonts=(),magnetometer=(),microphone=(),midi=(),otp-credentials=(),payment=(),picture-in-picture=(),screen-wake-lock=(),serial=(),speaker-selection=(),storage-access=(),usb=(),web-share=(),xr-spatial-tracking=()

Here's pytorch's:

HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 2123
Connection: close
Date: Fri, 18 Oct 2024 05:38:59 GMT
Last-Modified: Fri, 18 Oct 2024 05:36:48 GMT
ETag: "9134fda89ba79a6952cef4713acd5e62"
x-amz-server-side-encryption: AES256
Cache-Control: no-cache,no-store,must-revalidate
x-amz-version-id: 8A1SuuzbJjFXnbG4ghIcKp7Aj2F1lIQb
Accept-Ranges: bytes
Server: AmazonS3
X-Cache: Miss from cloudfront
Via: 1.1 cdd16a503d54c28f3f13bc34669e77be.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: ARN53-P1
X-Amz-Cf-Id: H6A7nsx-YbXElg5qTMZglnmaOABGR52f2yuXL9EWNvQRrYXsB3GgQg==

@rarkins rarkins linked a pull request Oct 19, 2024 that will close this issue
6 tasks
@rarkins rarkins changed the title PyPI datasource try simple first PyPI datasource fall back to simple API if JSON API fails Oct 19, 2024
@rarkins
Copy link
Collaborator Author

rarkins commented Oct 19, 2024

Update: change of approach in #32024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource:pypi priority-3-medium Default priority, "should be done" but isn't prioritised ahead of others type:refactor Refactoring or improving of existing code
Projects
None yet
2 participants