-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AllowOffsiteMiddleware: allow all provider requests #108
Conversation
if "zyte_api" in request.meta: | ||
# The request looks like a dependency injection request, and any | ||
# domain-based filtering should have been handled in the original | ||
# request handling, before dependency injection. | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if statement is a bit hacky. I wonder if we should implement something in scrapy-poet or in scrapy-zyte-api to allow a better way to check if a request is coming from a provider, for downloader middlewares that are only supposed to work on original requests and should not affect provider requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Sorry, could you please elaborate - why should AllowOffsiteMiddleware skip all provider requests, even if they are offsite?
It also looks like "zyte_api" in request.meta
check is true for regular requests which use Zyte API - does it mean that the middleware is now disabled for all requests, if transparent mode is configured? Or is there some trick with middleware order, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should AllowOffsiteMiddleware skip all provider requests, even if they are offsite?
Provider requests are triggered by regular requests. Regular requests reach the middleware first (due to the middleware order compared to the provider middleware), and if a regular request is dropped by the middleware, the provider requests that it would trigger are never triggered. If the middleware does allow a regular request, currently the middleware then gets the provider requests, and drops them because they do not have the "allow_offsite": True
meta.
I believe, for this specific middleware, that the intention is for the "allow_offsite": True
meta of the regular request to also apply to triggered provider requests.
Since regular requests are already filtered out by the middleware before they trigger provider requests, allowing all provider requests is one way to achieve that.
We could also figure out a way to get the "allow_offsite": True
meta from regular requests injected into the triggered provider requests. But I don’t think there is a straightforward way to do that with the current API, and hence me wondering if we should make changes to allow for that (or for some other alternative approach cleaner than the one proposed here).
It also looks like
"zyte_api" in request.meta
check is true for regular requests which use Zyte API - does it mean that the middleware is now disabled for all requests, if transparent mode is configured? Or is there some trick with middleware order, etc.?
It’s kind of a trick of the middleware order. Unless you send a request specifically with zyte_api
, requests will never have zyte_api
set before they reach the download handler, which is the one that reads the params from the request, including zyte_api_automap
, so zyte_api
is never really set automatically in transparent mode (except by the scrapy-zyte-api provider when building its own requests).
The proposed approach would only become a problem is zyte_api
is manually set on a request, either by us in a future iteration of a spider where for some reason we decide not to use automatic parameter mapping, or by users in custom spiders or subclasses. Which is indeed a problem, and I do think in the long term we should find a better solution.
No description provided.