Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AllowOffsiteMiddleware: allow all provider requests #108

Merged
merged 1 commit into from
Dec 24, 2024

Conversation

Gallaecio
Copy link
Contributor

No description provided.

Comment on lines +145 to +149
if "zyte_api" in request.meta:
# The request looks like a dependency injection request, and any
# domain-based filtering should have been handled in the original
# request handling, before dependency injection.
return True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement is a bit hacky. I wonder if we should implement something in scrapy-poet or in scrapy-zyte-api to allow a better way to check if a request is coming from a provider, for downloader middlewares that are only supposed to work on original requests and should not affect provider requests.

Copy link
Contributor

@kmike kmike Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Sorry, could you please elaborate - why should AllowOffsiteMiddleware skip all provider requests, even if they are offsite?

It also looks like "zyte_api" in request.meta check is true for regular requests which use Zyte API - does it mean that the middleware is now disabled for all requests, if transparent mode is configured? Or is there some trick with middleware order, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should AllowOffsiteMiddleware skip all provider requests, even if they are offsite?

Provider requests are triggered by regular requests. Regular requests reach the middleware first (due to the middleware order compared to the provider middleware), and if a regular request is dropped by the middleware, the provider requests that it would trigger are never triggered. If the middleware does allow a regular request, currently the middleware then gets the provider requests, and drops them because they do not have the "allow_offsite": True meta.

I believe, for this specific middleware, that the intention is for the "allow_offsite": True meta of the regular request to also apply to triggered provider requests.

Since regular requests are already filtered out by the middleware before they trigger provider requests, allowing all provider requests is one way to achieve that.

We could also figure out a way to get the "allow_offsite": True meta from regular requests injected into the triggered provider requests. But I don’t think there is a straightforward way to do that with the current API, and hence me wondering if we should make changes to allow for that (or for some other alternative approach cleaner than the one proposed here).

It also looks like "zyte_api" in request.meta check is true for regular requests which use Zyte API - does it mean that the middleware is now disabled for all requests, if transparent mode is configured? Or is there some trick with middleware order, etc.?

It’s kind of a trick of the middleware order. Unless you send a request specifically with zyte_api, requests will never have zyte_api set before they reach the download handler, which is the one that reads the params from the request, including zyte_api_automap, so zyte_api is never really set automatically in transparent mode (except by the scrapy-zyte-api provider when building its own requests).

The proposed approach would only become a problem is zyte_api is manually set on a request, either by us in a future iteration of a spider where for some reason we decide not to use automatic parameter mapping, or by users in custom spiders or subclasses. Which is indeed a problem, and I do think in the long term we should find a better solution.

@Gallaecio Gallaecio merged commit 67e644b into zytedata:main Dec 24, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants