This package provides a Scrapy Downloader Middleware to transparently interact with the Crawlera Fetch API.
- Python 3.5+
- Scrapy 1.6+
Not yet available on PyPI. However, it can be installed directly from GitHub:
pip install git+ssh://[email protected]/scrapy-plugins/scrapy-crawlera-fetch.git
or
pip install git+https://github.com/scrapy-plugins/scrapy-crawlera-fetch.git
Enable the CrawleraFetchMiddleware
via the
DOWNLOADER_MIDDLEWARES
setting:
DOWNLOADER_MIDDLEWARES = {
"crawlera_fetch.CrawleraFetchMiddleware": 585,
}
Please note that the middleware needs to be placed before the built-in HttpCompressionMiddleware
middleware (which has a priority of 590), otherwise incoming responses will be compressed and the
Crawlera middleware won't be able to handle them.
-
CRAWLERA_FETCH_ENABLED
(typebool
, defaultFalse
). Whether or not the middleware will be enabled, i.e. requests should be downloaded using the Crawlera Fetch API -
CRAWLERA_FETCH_APIKEY
(typestr
). API key to be used to authenticate against the Crawlera endpoint (mandatory if enabled) -
CRAWLERA_FETCH_URL
(Typestr
, default"http://fetch.crawlera.com:8010/fetch/v2/"
). The endpoint of a specific Crawlera instance -
CRAWLERA_FETCH_RAISE_ON_ERROR
(typebool
, defaultTrue
). Whether or not the middleware will raise an exception if an error occurs while downloading or decoding a request. IfFalse
, a warning will be logged and the raw upstream response will be returned upon encountering an error. -
CRAWLERA_FETCH_DOWNLOAD_SLOT_POLICY
(typeenum.Enum
-crawlera_fetch.DownloadSlotPolicy
, defaultDownloadSlotPolicy.Domain
). Possible values areDownloadSlotPolicy.Domain
,DownloadSlotPolicy.Single
,DownloadSlotPolicydefault
(Scrapy default). If set toDownloadSlotPolicy.Domain
, please consider settingSCHEDULER_PRIORITY_QUEUE="scrapy.pqueues.DownloaderAwarePriorityQueue"
to make better usage of concurrency options and avoid delays. -
CRAWLERA_FETCH_DEFAULT_ARGS
(typedict
, default{}
) Default values to be sent to the Crawlera Fetch API. For instance, set to{"device": "mobile"}
to render all requests with a mobile profile.
Since the URL for outgoing requests is modified by the middleware, by default the logs will show
the URL for the Crawlera endpoint. To revert this behaviour you can enable the provided
log formatter by overriding the LOG_FORMATTER
setting:
LOG_FORMATTER = "crawlera_fetch.CrawleraFetchLogFormatter"
Note that the ability to override the error messages for spider and download errors was added
in Scrapy 2.0. When using a previous version, the middleware will add the original request URL
to the Request.flags
attribute, which is shown in the logs by default.
If the middleware is enabled, by default all requests will be redirected to the specified
Crawlera Fetch endpoint, and modified to comply with the format expected by the Crawlera Fetch API.
The three basic processed arguments are method
, url
and body
.
For instance, the following request:
Request(url="https://httpbin.org/post", method="POST", body="foo=bar")
will be converted to:
Request(url="<Crawlera Fetch API endpoint>", method="POST",
body='{"url": "https://httpbin.org/post", "method": "POST", "body": "foo=bar"}',
headers={"Authorization": "Basic <derived from APIKEY>",
"Content-Type": "application/json",
"Accept": "application/json"})
Additional arguments could be specified under the crawlera_fetch.args
Request.meta
key. For instance:
Request(
url="https://example.org",
meta={"crawlera_fetch": {"args": {"region": "us", "device": "mobile"}}},
)
is translated into the following body:
'{"url": "https://example.org", "method": "GET", "body": "", "region": "us", "device": "mobile"}'
Arguments set for a specific request through the crawlera_fetch.args
key override those
set with the CRAWLERA_FETCH_DEFAULT_ARGS
setting.
The url
, method
, headers
and body
attributes of the original request are available under
the crawlera_fetch.original_request
Response.meta
key.
The status
, headers
and body
attributes of the upstream Crawlera response are available under
the crawlera_fetch.upstream_response
Response.meta
key.
You can instruct the middleware to skip a specific request by setting the crawlera_fetch.skip
Request.meta
key:
Request(
url="https://example.org",
meta={"crawlera_fetch": {"skip": True}},
)