- When the :ref:`dynamic dependencies <dynamic-deps>` are annotated with :data:`typing.Annotated`, the keys in the resulting :class:`~.DynamicDeps` instance are now not annotated.
- Improved the error message when passing incorrect values in the
"inject"
meta key. - Fixed documentation builds with
sphinx-rtd-theme
3.0.0+.
- Added support for :ref:`specifying callback dependencies dynamically <dynamic-deps>`.
- Raising :class:`~web_poet.exceptions.core.Retry` now also works as expected when callbacks ask for an item instead of asking for its page object.
- When :class:`~web_poet.exceptions.core.Retry` is raised with a message, that
message now becomes the retry reason, replacing the default
(
page_object_retry
).
- :ref:`Additional requests <additional-requests>`, when mapped to
:class:`scrapy.Request <scrapy.http.Request>` objects, now get their
dont_filter
parameter set toTrue
, to ask downloader middlewares like :class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` not to drop those requests.
- :func:`scrapy_poet.utils.testing.make_crawler` now respects setting priorities when it receives a :class:`~scrapy.settings.Settings` object instead of a :class:`dict`.
- :class:`~scrapy_poet.page_input_providers.HttpRequestProvider`, added in 0.17.0, is now actually enabled by default.
- Fixed
scrapy savefixture
not finding page object modules when used outside a Scrapy project.
- Now requires
web-poet >= 0.17.0
andtime_machine >= 2.7.1
. - Removed
scrapy_poet.AnnotatedResult
, use :class:`web_poet.annotated.AnnotatedInstance` instead. - Added support for annotated dependencies to the
scrapy savefixture
command. - Test improvements.
- Added a
.weak_cache
to :class:`scrapy_poet.injection.Injector` which stores instances created by providers as long as the :class:`scrapy.Request <scrapy.http.Request>` exists. - Fixed the incorrect value of
downloader/response_count
in the stats due to additional counting of :class:`scrapy_poet.api.DummyResponse`. - Fixed the detection of :class:`scrapy_poet.api.DummyResponse` when some type hints are annotated using strings.
- :class:`~scrapy_poet.ScrapyPoetRequestFingerprinter` now supports item dependencies.
- Add :class:`~scrapy_poet.ScrapyPoetRequestFingerprinter`, a request fingerprinter that uses request dependencies in the fingerprint generation.
- Now requires
andi >= 0.6.0
. - Changed the implementation of resolving and building item dependencies from
page objects. Now
andi
custom builders are used to create a single plan that includes building page objects and items. This fixes problems such as providers being called multiple times.- :class:`~scrapy_poet.page_input_providers.ItemProvider` is now no-op. It's no longer enabled by default and users should also stop enabling it.
PageObjectInputProvider.allow_prev_instances
and code related to it were removed so custom providers may need updating.
- Fixed some tests.
- Now requires
andi >= 0.5.0
. - Add support for dependency metadata via
typing.Annotated
(requires Python 3.9+).
- Now requires
web-poet >= 0.15.1
. - :class:`~web_poet.page_inputs.http.HttpRequest` dependencies are now supported, via :class:`~scrapy_poet.page_input_providers.HttpRequestProvider` (enabled by default).
- Enable :class:`~scrapy_poet.page_input_providers.StatsProvider`, which provides :class:`~web_poet.page_inputs.stats.Stats` dependencies, by default.
- More robust disabling of
:class:`~scrapy_poet.downloadermiddlewares.InjectionMiddleware` in the
scrapy savefixture
command. - Official support for Python 3.12.
- Fix the bug that caused requests produced by
:class:`~scrapy_poet.page_input_providers.HttpClientProvider` to
be treated as if they need arguments of the
parse
callback as dependencies, which could cause returning an empty response and/or making extra provider calls.
- Now requires
time_machine >= 2.2.0
. ItemProvider
now supports page objects that declare a dependency on the same type of item that they return, as long as there is an earlier page object input provider that can provide such dependency.- Fix running tests with Scrapy 2.11.
- :ref:`scrapy-poet stats <stats>` now also include counters for injected
dependencies (
poet/injector/<dependency import path>
). - All scrapy-poet stats that used to be prefixed with
scrapy-poet/
are now prefixed withpoet/
instead.
- Now requires
web-poet >= 0.15.0
. - :external+web-poet:ref:`Web-poet stats <stats>` are now :ref:`supported <stats>`.
Python 3.7 support has been dropped.
Caching is now built on top of web-poet serialization, extending caching support to additional inputs, while making our code simpler, more reliable, and more future-proof.
This has resulted in a few backward-incompatible changes:
- The
scrapy_poet.page_input_providers.CacheDataProviderMixin
mixin class has been removed. Providers no longer need to use it or reimplement its methods. - The
SCRAPY_POET_CACHE_GZIP
setting has been removed.
- The
Added
scrapy_poet.utils.open_in_browser
, an alternative toscrapy.utils.response.open_in_browser
that supports scrapy-poet.Fixed some documentation links.
- Now requires
web-poet >= 0.12.0
. - The
scrapy savefixture
command now uses the adapter from theSCRAPY_POET_TESTS_ADAPTER
setting to save the fixture. - Fix a typo in the docs.
- Now requires
web-poet >= 0.11.0
. - The
scrapy savefixture
command can now generate tests that expect thatto_item()
raises a specific exception (only :class:`web_poet.exceptions.PageObjectAction` and its descendants are expected). - Fixed an error when using
scrapy shell
with :class:`scrapy_poet.InjectionMiddleware` enabled. - Add a
twine check
CI check.
- The
scrapy savefixture
command can now generate a fixture :ref:`using an existing spider <fixture-spiders>`.
- More robust time freezing in
scrapy savefixture
command.
- Now requires
web-poet >= 0.8.0
. - The
savefixture
command now also saves requests made via the :class:`web_poet.page_inputs.client.HttpClient` dependency and their responses.
Added support for item classes which are used as dependencies in page objects and spider callbacks. The following is now possible:
import attrs import scrapy from web_poet import WebPage, handle_urls, field from scrapy_poet import DummyResponse @attrs.define class Image: url: str @handle_urls("example.com") class ProductImagePage(WebPage[Image]): @field def url(self) -> str: return self.css("#product img ::attr(href)").get("") @attrs.define class Product: name: str image: Image @handle_urls("example.com") @attrs.define class ProductPage(WebPage[Product]): # ✨ NEW: The page object can ask for items as dependencies. An instance # of ``Image`` is injected behind the scenes by calling the ``.to_item()`` # method of ``ProductImagePage``. image_item: Image @field def name(self) -> str: return self.css("h1.name ::text").get("") @field def image(self) -> Image: return self.image_item class MySpider(scrapy.Spider): name = "myspider" def start_requests(self): yield scrapy.Request( "https://example.com/products/some-product", self.parse_product ) # ✨ NEW: We can directly use the item here instead of the page object. def parse_product(self, response: DummyResponse, item: Product) -> Product: return item
In line with this, the following new features were made:
- New :class:`scrapy_poet.page_input_providers.ItemProvider` which makes the usage above possible.
- An item class is now supported by :func:`scrapy_poet.callback_for` alongside the usual page objects. This means that it won't raise a :class:`TypeError` anymore when not passing a subclass of :class:`web_poet.pages.ItemPage`.
- New exception: :class:`scrapy_poet.injection_errors.ProviderDependencyDeadlockError`. This is raised when it's not possible to create the dependencies due to a deadlock in their sub-dependencies, e.g. due to a circular dependency between page objects.
New setting named
SCRAPY_POET_RULES
having a default value of :meth:`web_poet.default_registry.get_rules <web_poet.rules.RulesRegistry.get_rules>`. This deprecatesSCRAPY_POET_OVERRIDES
.New setting named
SCRAPY_POET_DISCOVER
to ensure thatSCRAPY_POET_RULES
have properly loaded all intended rules annotated with the@handle_urls
decorator.New utility functions in
scrapy_poet.utils.testing
.The
frozen_time
value inside the :ref:`test fixtures <testing>` won't contain microseconds anymore.Supports the new :func:`scrapy.http.request.NO_CALLBACK` introduced in Scrapy 2.8. This means that the :ref:`pitfalls` (introduced in
scrapy-poet==0.7.0
) doesn't apply when you're using Scrapy >= 2.8, unless you're using third-party middlewares which directly uses the downloader to add :class:`scrapy.Request <scrapy.http.Request>` instances with callback set toNone
. Otherwise, you need to set the callback value to :func:`scrapy.http.request.NO_CALLBACK`.Fix the :class:`TypeError` that's raised when using Twisted <= 21.7.0 since scrapy-poet was using
twisted.internet.defer.Deferred[object]
type annotation before which was not subscriptable in the early Twisted versions.Fix the
twisted.internet.error.ReactorAlreadyInstalledError
error raised when using thescrapy savefixture
command and Twisted < 21.2.0 is installed.Fix test configuration that doesn't follow the intended commands and dependencies in these tox environments:
min
,asyncio-min
, andasyncio
. This ensures that page objects usingasyncio
should work properly, alongside the minimum specified Twisted version.Various improvements to tests and documentation.
Backward incompatible changes:
For the :class:`scrapy_poet.page_input_providers.PageObjectInputProvider` base class:
- It now accepts an instance of :class:`scrapy_poet.injection.Injector`
in its constructor instead of :class:`scrapy.crawler.Crawler`. Although
you can still access the :class:`scrapy.crawler.Crawler` via the
Injector.crawler
attribute. - :meth:`scrapy_poet.page_input_providers.PageObjectInputProvider.is_provided` is now an instance method instead of a class method.
- It now accepts an instance of :class:`scrapy_poet.injection.Injector`
in its constructor instead of :class:`scrapy.crawler.Crawler`. Although
you can still access the :class:`scrapy.crawler.Crawler` via the
The :class:`scrapy_poet.injection.Injector`'s attribute and constructor parameter called
overrides_registry
is now simply calledregistry
.Removed the
SCRAPY_POET_OVERRIDES_REGISTRY
setting which overrides the default registry.The
scrapy_poet.overrides
module which containedOverridesRegistryBase
andOverridesRegistry
has now been removed. Instead, scrapy-poet directly uses :class:`web_poet.rules.RulesRegistry`.Everything should pretty much the same except for :meth:`web_poet.rules.RulesRegistry.overrides_for` now accepts :class:`str`, :class:`web_poet.page_inputs.http.RequestUrl`, or :class:`web_poet.page_inputs.http.ResponseUrl` instead of :class:`scrapy.http.Request`.
This also means that the registry doesn't accept tuples as rules anymore. Only :class:`web_poet.rules.ApplyRule` instances are allowed. The same goes for
SCRAPY_POET_RULES
(and the deprecatedSCRAPY_POET_OVERRIDES
).The following type aliases have been removed:
scrapy_poet.overrides.RuleAsTuple
scrapy_poet.overrides.RuleFromUser
- Now requires
web-poet >= 0.7.0
andtime_machine
. - Added a
savefixture
command that creates a test for a page object. See :ref:`testing` for more information.
Fixed the issue where a new page object containing a new response data is not properly created when :class:`web_poet.exceptions.core.Retry` is raised.
In order for the above fix to be possible, overriding the callback dependencies created by scrapy-poet via :attr:`scrapy.http.Request.cb_kwargs` is now unsupported. This is a backward incompatible change.
Fixed the broken :meth:`scrapy_poet.page_input_providers.HttpResponseProvider.fingerprint` which errors out when running a Scrapy job using the
SCRAPY_POET_CACHE
enabled.Improved behavior when
spider.parse()
method arguments are supposed to be provided by scrapy-poet. Previously, it was causing unnecessary work in unexpected places like :class:`scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware`, :class:`scrapy.pipelines.images.ImagesPipeline` or :class:`scrapy.pipelines.files.FilesPipeline`. It is also a reason :class:`web_poet.page_inputs.client.HttpClient` might not be working in page objects. Now these cases are detected, and a warning is issued.As of Scrapy 2.7, it is not possible to fix the issue completely in scrapy-poet. Fixing it would require Scrapy changes; some 3rd party libraries may also need to be updated.
Note
The root of the issue is that when request.callback is
None
,parse()
callback is assumed normally. But sometimes callback=None is used when :class:`scrapy.http.Request` is added to the Scrapy's downloader directly, in which case no callback is used. Middlewares, including scrapy-poet's, can't distinguish between these two cases, which causes all kinds of issues.We recommend all scrapy-poet users to modify their code to avoid the issue. Please don't define
parse()
method with arguments which are supposed to be filled by scrapy-poet, and rename the existingparse()
methods if they have such arguments. Any other name is fine. It avoids all possible issues, including incompatibility with 3rd party middlewares or pipelines.See the new :ref:`pitfalls` documentation for more information.
There are backwards-incompatible changes related to this issue. They only affect you if you don't follow the advice of not using
parse()
method with scrapy-poet.When the
parse()
method has its response argument annotated with :class:`scrapy_poet.api.DummyResponse`, for instance:def parse(self, response: DummyResponse)
, the response is downloaded instead of being skipped.When the
parse()
method has dependencies that are provided by scrapy-poet, the :class:`scrapy_poet.downloadermiddlewares.InjectionMiddleware` won't attempt to build any dependencies anymore.This causes the following code to have this error
TypeError: parse() missing 1 required positional argument: 'page'.
:class MySpider(scrapy.Spider): name = "my_spider" start_urls = ["https://books.toscrape.com"] def parse(self, response: scrapy.http.Response, page: MyPage): ...
:func:`scrapy_poet.injection.is_callback_requiring_scrapy_response` now accepts an optional
raw_callback
parameter meant to represent the actual callback attribute value of :class:`scrapy.http.Request` since the originalcallback
parameter could be normalized to the spider'sparse()
method when the :class:`scrapy.http.Request` hascallback
set toNone
.Official support for Python 3.11
Various updates and improvements on docs and examples.
Now requires
web-poet >= 0.6.0
.- All examples in the docs and tests now use
web_poet.WebPage
instead ofweb_poet.ItemWebPage
. - The new
instead_of
parameter of the@handle_urls
decorator is now preferred instead of the deprecatedoverrides
parameter. scrapy_poet.callback_for
doesn't require an implementedto_item
method anymore.- The new
web_poet.rules.RulesRegistry
is used instead of the oldweb_poet.overrides.PageObjectRegistry
. - The Registry now uses
web_poet.ApplyRule
instead ofweb_poet.OverrideRule
.
- All examples in the docs and tests now use
Provider for
web_poet.ResponseUrl
is added, which allows to access the response URL in the page object. This triggers a download unlike the provider forweb_poet.RequestUrl
.Fixes the error when using
scrapy shell
while thescrapy_poet.InjectionMiddleware
is enabled.Fixes and improvements on code and docs.
Fixes the minimum web-poet version being 0.5.0 instead of 0.4.0.
This release implements support for page object retries, introduced in web-poet 0.4.0.
To enable retry support, you need to configure a new spider middleware in your Scrapy settings:
SPIDER_MIDDLEWARES = { "scrapy_poet.RetryMiddleware": 275, }
web-poet 0.4.0 is now the minimum required version of web-poet.
This release is backwards incompatible, following backwards-incompatible changes in web-poet 0.2.0.
The main new feature is support for web-poet >= 0.2.0
, including
support for async def to_item
methods, making additional requests
in the to_item
method, new Page Object dependencies, and the new way
to configure overrides.
Changes in line with web-poet >= 0.2.0
:
web_poet.HttpResponse
replacesweb_poet.ResponseData
as a dependency to use.Additional requests inside Page Objects: a provider for
web_poet.HttpClient
, as well asweb_poet.HttpClient
backend implementation, which uses Scrapy downloader.callback_for
now supports Page Objects which defineasync def to_item
method.Provider for
web_poet.PageParams
is added, which usesrequest.meta["page_params"]
value.Provider for
web_poet.RequestUrl
is added, which allows to access the request URL in the page object without triggering the download.We have these backward incompatible changes since the
web_poet.OverrideRule
follow a different structure:- Deprecated
PerDomainOverridesRegistry
in lieu of the newerOverridesRegistry
which provides a wide variety of features for better URL matching. - This resuls in a newer format in the
SCRAPY_POET_OVERRIDES
setting.
- Deprecated
Other changes:
- New
scrapy_poet/dummy_response_count
value appears in Scrapy stats; it is the number of timesDummyResponse
is used instead of downloading the response as usual. scrapy.utils.reqser
deprecated module is no longer used by scrapy-poet.
Dependency updates:
- The minimum supported Scrapy version is now
2.6.0
. - The minimum supported web-poet version is now
0.2.0
.
- Cache mechanism using
SCRAPY_POET_CACHE
- Fixed and improved docs
- removed support for Python 3.6
- added support for Python 3.10
- Improved logging message for DummyResponse
- various internal cleanups
- Overrides support
- New providers interface
- One provider can provide many types at once
- Single instance during the whole spider lifespan
- Registration is now explicit and done in the spider settings
- CI is migrated from Travis to Github Actions
- Python 3.9 support
- Documentation improvements
- providers can now access various Scrapy objects: Crawler, Settings, Spider, Request, Response, StatsCollector
The repository is renamed to scrapy-poet
, and split into two:
web-poet
(https://github.com/scrapinghub/web-poet) contains definitions and code useful for writing Page Objects for web data extraction - it is not tied to Scrapy;scrapy-poet
(this package) provides Scrapy integration for such Page Objects.
API of the library changed in a backwards incompatible way; see README and examples.
New features:
DummyResponse
annotation allows to skip downloading of scrapy Response.callback_for
works for Scrapy disk queues if it is used to create a spider method (but not in its inline form)- Page objects may require page objects as dependencies; dependencies are resolved recursively and built as needed.
- InjectionMiddleware supports
async def
and asyncio providers.
Initial release.