diff --git a/CHANGELOG.rst b/CHANGELOG.rst index 54b14cdc..7b886681 100644 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -61,61 +61,67 @@ in page objects and spider callbacks. The following is now possible: In line with this, the following changes were made: - * Added a new ``scrapy_poet.page_input_providers.ItemProvider`` which makes - the usage above possible. - * Multiple changes to the ``scrapy_poet.PageObjectInputProvider`` base class - which are backward incompatible: - - * It now accepts an instance of ``scrapy_poet.injection.Injector`` in its - constructor instead of ``scrapy.crawler.Crawler``. Although you can - still access the ``scrapy.crawler.Crawler`` via the ``Injector.crawler`` - attribute. - * ``is_provided()`` is now an instance method instead of a class - method. - - * The ``scrapy_poet.injection.Injector``'s attribute and constructor parameter - called ``overrides_registry`` is now simply called ``registry``. + * Added a new :class:`scrapy_poet.page_input_providers.ItemProvider` which + makes the usage above possible. + * Multiple changes to the + :class:`scrapy_poet.page_input_providers.PageObjectInputProvider` base + class which are backward incompatible: + + * It now accepts an instance of :class:`scrapy_poet.injection.Injector` + in its constructor instead of :class:`scrapy.crawler.Crawler`. Although + you can still access the :class:`scrapy.crawler.Crawler` via the + ``Injector.crawler`` attribute. + * :meth:`scrapy_poet.page_input_providers.PageObjectInputProvider.is_provided` + is now an instance method instead of a class method. + + * The :class:`scrapy_poet.injection.Injector`'s attribute and constructor + parameter called ``overrides_registry`` is now simply called ``registry``. This is backwards incompatible. - * An item class is now supported by ``scrapy_poet.callback_for`` alongside - the usual page objects. This means that it won't raise a ``TypeError`` - anymore when not passing a subclass of ``web_poet.ItemPage``. - * ``scrapy_poet.overrides.OverridesRegistry`` has been deprecated and - overhauled into ``scrapy_poet.registry.OverridesAndItemRegistry``: - - * It is now subclassed from ``web_poet.RulesRegistry`` which allows - outright access to its registry methods. - * It now allows retrieval of rules based on the returned item class. - * The registry doesn't accept tuples as rules anymore. Only - ``web_poet.ApplyRule`` instances are allowed. The same goes for - ``SCRAPY_POET_RULES`` (and the deprecated ``SCRAPY_POET_OVERRIDES``). - - * As a result, the following type aliases have been removed: - ``scrapy_poet.overrides.RuleAsTuple`` and - ``scrapy_poet.overrides.RuleFromUser`` - * These changes are backward incompatible. - - * New exception: ``scrapy_poet.injector_error.ProviderDependencyDeadlockError``. + * An item class is now supported by :func:`scrapy_poet.callback_for` + alongside the usual page objects. This means that it won't raise a + :class:`TypeError` anymore when not passing a subclass of + :class:`web_poet.pages.ItemPage`. + * New exception: :class:`scrapy_poet.injection_errors.ProviderDependencyDeadlockError`. This is raised when it's not possible to create the dependencies due to a deadlock in their sub-dependencies, e.g. due to a circular dependency between page objects. Other changes: + * Now requires ``web-poet >= 0.7.0``. + * In line with web-poet's new features, the ``scrapy_poet.overrides`` module + which contained ``OverridesRegistryBase`` and ``OverridesRegistry`` has now + been removed. Instead, scrapy-poet directly uses + :class:`web_poet.rules.RulesRegistry`. + + Everything should pretty much the same except for + :meth:`web_poet.rules.RulesRegistry.overrides_for` now accepts :class:`str`, + :class:`web_poet.page_inputs.http.RequestUrl`, or + :class:`web_poet.page_inputs.http.ResponseUrl` instead of + :class:`scrapy.http.Request`. + + * This also means that the registry doesn't accept tuples as rules anymore. + Only :class:`web_poet.rules.ApplyRule` instances are allowed. The same goes + for ``SCRAPY_POET_RULES`` (and the deprecated ``SCRAPY_POET_OVERRIDES``). + As a result, the following type aliases have been removed: + + * ``scrapy_poet.overrides.RuleAsTuple`` + * ``scrapy_poet.overrides.RuleFromUser`` + + These changes are backward incompatible. + * Moved some of the utility functions from the test module into ``scrapy_poet.utils.testing``. * Documentation improvements. + * Official support for Python 3.11 Deprecations: - * The ``scrapy_poet.overrides`` module has been replaced by - ``scrapy_poet.registry``. - * The ``scrapy_poet.overrides.OverridesRegistry`` class is now replaced by - ``scrapy_poet.registry.OverridesAndItemRegistry``. * The ``SCRAPY_POET_OVERRIDES_REGISTRY`` setting has been replaced by ``SCRAPY_POET_REGISTRY``. * The ``SCRAPY_POET_OVERRIDES`` setting has been replaced by ``SCRAPY_POET_RULES``. - * Official support for Python 3.11 + 0.6.0 (2022-11-24) ------------------ diff --git a/docs/api_reference.rst b/docs/api_reference.rst index 012dea47..ce58b44e 100644 --- a/docs/api_reference.rst +++ b/docs/api_reference.rst @@ -43,10 +43,3 @@ Injection errors .. automodule:: scrapy_poet.injection_errors :members: - -Registry -======== - -.. automodule:: scrapy_poet.registry - :members: - :show-inheritance: diff --git a/docs/rules-from-web-poet.rst b/docs/rules-from-web-poet.rst index c8a814bc..2dbbb3c4 100644 --- a/docs/rules-from-web-poet.rst +++ b/docs/rules-from-web-poet.rst @@ -5,7 +5,7 @@ Rules from web-poet =================== scrapy-poet fully supports the functionalities of :class:`web_poet.rules.ApplyRule`. -It has its own registry called :class:`scrapy_poet.registry.OverridesAndItemRegistry` +It uses the registry from web_poet called :class:`web_poet.rules.RulesRegistry` which provides functionalties for: * Returning the page object override if it exists for a given URL. @@ -296,9 +296,10 @@ regarding :ref:`rules-item-class-example`. Registry ======== -As mentioned above, scrapy-poet has its own registry called -:class:`scrapy_poet.registry.OverridesAndItemRegistry`. +As mentioned above, scrapy-poet uses the registry from web-poet called +:class:`web_poet.rules.RulesRegistry`. + This registry implementation can be changed if needed. A different registry can be configured by passing its class path to the ``SCRAPY_POET_REGISTRY`` setting. -Such registries must be a subclass of :class:`scrapy_poet.registry.OverridesRegistryBase` -and must implement the :meth:`scrapy_poet.registry.OverridesRegistryBase.overrides_for` method. +Such registries must be a subclass of :class:`web_poet.rules.RulesRegistry` +to ensure the expected methods and its types are properly accounted for. diff --git a/docs/settings.rst b/docs/settings.rst index e20f1619..46cf2b66 100644 --- a/docs/settings.rst +++ b/docs/settings.rst @@ -29,9 +29,9 @@ SCRAPY_POET_RULES Default: ``None`` Mapping of overrides for each domain. The format of the such ``dict`` mapping -depends on the currently set Registry. The default is currently -:class:`~.OverridesAndItemRegistry`. This can be overriden by the setting below: -``SCRAPY_POET_OVERRIDES_REGISTRY``. +depends on the currently set registry. The default is currently +:class:`web_poet.rules.RulesRegistry`. This can be overriden by the setting below: +``SCRAPY_POET_REGISTRY``. There are sections dedicated for this at :ref:`intro-tutorial` and :ref:`rules-from-web-poet`. @@ -46,9 +46,9 @@ SCRAPY_POET_REGISTRY Defaut: ``None`` -Sets an alternative Registry to replace the default :class:`~.OverridesAndItemRegistry`. -To use this, set a ``str`` which denotes the absolute object path of the new -Registry. +Sets an alternative Registry to replace the default +:class:`web_poet.rules.RulesRegistry`. To use this, set a ``str`` which denotes +the absolute object path of the new registry. More info at :ref:`rules-from-web-poet`. diff --git a/scrapy_poet/downloadermiddlewares.py b/scrapy_poet/downloadermiddlewares.py index fed6ecb4..cd0105e3 100644 --- a/scrapy_poet/downloadermiddlewares.py +++ b/scrapy_poet/downloadermiddlewares.py @@ -9,8 +9,9 @@ from scrapy import Spider, signals from scrapy.crawler import Crawler from scrapy.http import Request, Response -from scrapy.utils.misc import create_instance, load_object +from scrapy.utils.misc import load_object from twisted.internet.defer import Deferred, inlineCallbacks +from web_poet import RulesRegistry from .api import DummyResponse from .injection import Injector @@ -22,7 +23,7 @@ RequestUrlProvider, ResponseUrlProvider, ) -from .registry import OverridesAndItemRegistry +from .utils import create_registry_instance logger = logging.getLogger(__name__) @@ -60,12 +61,10 @@ def __init__(self, crawler: Crawler) -> None: registry_cls = load_object( settings.get( "SCRAPY_POET_REGISTRY", - settings.get( - "SCRAPY_POET_OVERRIDES_REGISTRY", OverridesAndItemRegistry - ), + settings.get("SCRAPY_POET_OVERRIDES_REGISTRY", RulesRegistry), ) ) - self.registry = create_instance(registry_cls, settings, crawler) + self.registry = create_registry_instance(registry_cls, crawler) self.injector = Injector( crawler, default_providers=DEFAULT_PROVIDERS, diff --git a/scrapy_poet/injection.py b/scrapy_poet/injection.py index d4a89b0b..fb124b4f 100644 --- a/scrapy_poet/injection.py +++ b/scrapy_poet/injection.py @@ -12,8 +12,9 @@ from scrapy.statscollectors import StatsCollector from scrapy.utils.conf import build_component_list from scrapy.utils.defer import maybeDeferred_coro -from scrapy.utils.misc import create_instance, load_object +from scrapy.utils.misc import load_object from twisted.internet.defer import inlineCallbacks +from web_poet import RulesRegistry from web_poet.pages import is_injectable from scrapy_poet.api import _CALLBACK_FOR_MARKER, DummyResponse @@ -24,9 +25,8 @@ UndeclaredProvidedTypeError, ) from scrapy_poet.page_input_providers import PageObjectInputProvider -from scrapy_poet.registry import OverridesAndItemRegistry, OverridesRegistryBase -from .utils import get_scrapy_data_path +from .utils import create_registry_instance, get_scrapy_data_path logger = logging.getLogger(__name__) @@ -42,11 +42,11 @@ def __init__( crawler: Crawler, *, default_providers: Optional[Mapping] = None, - registry: Optional[OverridesRegistryBase] = None, + registry: Optional[RulesRegistry] = None, ): self.crawler = crawler self.spider = crawler.spider - self.registry = registry or OverridesAndItemRegistry() + self.registry = registry or RulesRegistry() self.load_providers(default_providers) self.init_cache() @@ -138,7 +138,11 @@ def build_plan(self, request: Request) -> andi.Plan: callback, is_injectable=is_injectable, externally_provided=self.is_class_provided_by_any_provider, - overrides=self.registry.overrides_for(request).get, + # Ignore the type since andi.plan expects overrides to be + # Callable[[Callable], Optional[Callable]] but the registry + # returns a more accurate typing for this scenario: + # Mapping[Type[ItemPage], Type[ItemPage]] + overrides=self.registry.overrides_for(request.url).get, # type: ignore[arg-type] ) @inlineCallbacks @@ -360,7 +364,7 @@ def is_provider_requiring_scrapy_response(provider): def get_injector_for_testing( providers: Mapping, additional_settings: Optional[Dict] = None, - registry: Optional[OverridesRegistryBase] = None, + registry: Optional[RulesRegistry] = None, ) -> Injector: """ Return an :class:`Injector` using a fake crawler. @@ -379,7 +383,7 @@ class MySpider(Spider): spider.settings = settings crawler.spider = spider if not registry: - registry = create_instance(OverridesAndItemRegistry, settings, crawler) + registry = create_registry_instance(RulesRegistry, crawler) return Injector(crawler, registry=registry) diff --git a/scrapy_poet/overrides.py b/scrapy_poet/overrides.py deleted file mode 100644 index e1304fc9..00000000 --- a/scrapy_poet/overrides.py +++ /dev/null @@ -1,6 +0,0 @@ -import warnings - -from .registry import OverridesRegistry # noqa: F401 - -msg = "The 'scrapy_poet.overrides' module has been moved into 'scrapy_poet.registry'." -warnings.warn(msg, DeprecationWarning, stacklevel=2) diff --git a/scrapy_poet/registry.py b/scrapy_poet/registry.py deleted file mode 100644 index 96e46d40..00000000 --- a/scrapy_poet/registry.py +++ /dev/null @@ -1,148 +0,0 @@ -import logging -from abc import ABC, abstractmethod -from collections import defaultdict -from typing import Any, Callable, Dict, Iterable, Mapping, Optional, Type -from warnings import warn - -from scrapy import Request -from scrapy.crawler import Crawler -from url_matcher import URLMatcher -from web_poet import ItemPage, RulesRegistry -from web_poet.rules import ApplyRule -from web_poet.utils import _create_deprecated_class - -logger = logging.getLogger(__name__) - - -class OverridesRegistryBase(ABC): - @abstractmethod - def overrides_for(self, request: Request) -> Mapping[Callable, Callable]: - """ - Return a ``Mapping`` (e.g. a ``dict``) with type translation rules. - The key is the source type that is wanted to be replaced by - value, which is also a type. - """ - pass - - -class OverridesAndItemRegistry(OverridesRegistryBase, RulesRegistry): - """This registry implements the functionalities for returning the override - for a given request as well as returning the page object capable of returning - an item class for a given URL. - - Read the :ref:`rules-from-web-poet` tutorial for more information. - - It reads the rules from the ``SCRAPY_POET_RULES`` setting which is a list of - :py:class:`web_poet.rules.ApplyRule` instances: - - .. code-block:: python - - from web_poet import default_registry - - SCRAPY_POET_RULES = default_registry.get_rules() - - .. _web-poet: https://web-poet.readthedocs.io - - Calling ``default_registry.get_rules()`` finds all the rules annotated using - web-poet_'s :py:func:`web_poet.handle_urls` as a decorator that were registered - into ``web_poet.default_registry`` (an instance of - :py:class:`web_poet.rules.RulesRegistry`). - - .. warning:: - - The :py:func:`web_poet.rules.consume_modules` must be called first - if you need to properly import rules from other packages. For example: - - .. code-block:: python - - from web_poet import default_registry, consume_modules - - consume_modules("external_package_A.po", "another_ext_package.lib") - SCRAPY_POET_RULES = default_registry.get_rules() - - """ - - @classmethod - def from_crawler(cls, crawler: Crawler) -> Crawler: - if "SCRAPY_POET_OVERRIDES" in crawler.settings: - msg = ( - "The SCRAPY_POET_OVERRIDES setting is deprecated. " - "Use SCRAPY_POET_RULES instead." - ) - warn(msg, DeprecationWarning, stacklevel=2) - rules = crawler.settings.getlist( - "SCRAPY_POET_RULES", - crawler.settings.getlist("SCRAPY_POET_OVERRIDES", []), - ) - return cls(rules=rules) - - def __init__(self, rules: Optional[Iterable[ApplyRule]] = None) -> None: - super().__init__(rules=rules) - self.overrides_matcher: Dict[Type[ItemPage], URLMatcher] = defaultdict( - URLMatcher - ) - self.item_matcher: Dict[Optional[Type], URLMatcher] = defaultdict(URLMatcher) - for rule_id, rule in enumerate(self._rules): - self.add_rule(rule_id, rule) - logger.debug(f"List of parsed ApplyRules:\n{self._rules}") - - def add_rule(self, rule_id: int, rule: ApplyRule) -> None: - """Registers an :class:`web_poet.rules.ApplyRule` instance against the - given rule ID. - """ - # A common case when a page object subclasses another one with the same - # URL pattern. See the ``test_item_return_subclass()`` test case in - # ``tests/test_web_poet_rules.py``. - matched = self.item_matcher[rule.to_return] - pattern_dupes = [ - pattern - for pattern in matched.patterns.values() - if pattern == rule.for_patterns - ] - if pattern_dupes: - rules = [ - r - for p in pattern_dupes - for r in self.search(for_patterns=p, to_return=rule.to_return) - ] - warn( - f"Similar URL patterns {pattern_dupes} were declared earlier " - f"that use to_return={rule.to_return}. The first, highest-priority " - f"rule added to SCRAPY_POET_REGISTRY will be used when matching " - f"against URLs. Consider updating the priority of these rules: " - f"{rules}." - ) - - if rule.instead_of: - self.overrides_matcher[rule.instead_of].add_or_update( - rule_id, rule.for_patterns - ) - if rule.to_return: - self.item_matcher[rule.to_return].add_or_update(rule_id, rule.for_patterns) - - # TODO: These URL matching functionalities could be moved to web-poet. - # ``overrides_for`` in web-poet could be ``str`` or ``_Url`` subclass. - - def _rules_for_url( - self, url: str, url_matcher: Dict[Any, URLMatcher] - ) -> Mapping[Callable, Callable]: - result: Dict[Callable, Callable] = {} - for target, matcher in url_matcher.items(): - rule_id = matcher.match(url) - if rule_id is not None: - result[target] = self._rules[rule_id].use - return result - - def overrides_for(self, request: Request) -> Mapping[Callable, Callable]: - return self._rules_for_url(request.url, self.overrides_matcher) - - def page_object_for_item( - self, url: str, po_cls: Type[Any] - ) -> Optional[Callable[..., Any]]: - """Returns the page object class for a given URL and item class.""" - return self._rules_for_url(url, self.item_matcher).get(po_cls) - - -OverridesRegistry = _create_deprecated_class( - "OverridesRegistry", OverridesAndItemRegistry, warn_once=False -) diff --git a/scrapy_poet/utils/__init__.py b/scrapy_poet/utils/__init__.py index 5b38c534..e1017e07 100644 --- a/scrapy_poet/utils/__init__.py +++ b/scrapy_poet/utils/__init__.py @@ -1,7 +1,10 @@ import os +from typing import Type +from scrapy.crawler import Crawler from scrapy.http import Request, Response from scrapy.utils.project import inside_project, project_data_dir +from warning import warn from web_poet import HttpRequest, HttpResponse, HttpResponseHeaders @@ -43,3 +46,17 @@ def scrapy_response_to_http_response(response: Response) -> HttpResponse: headers=HttpResponseHeaders.from_bytes_dict(response.headers), **kwargs, ) + + +def create_registry_instance(cls: Type, crawler: Crawler): + if "SCRAPY_POET_OVERRIDES" in crawler.settings: + msg = ( + "The SCRAPY_POET_OVERRIDES setting is deprecated. " + "Use SCRAPY_POET_RULES instead." + ) + warn(msg, DeprecationWarning, stacklevel=2) + rules = crawler.settings.getlist( + "SCRAPY_POET_RULES", + crawler.settings.getlist("SCRAPY_POET_OVERRIDES", []), + ) + return cls(rules=rules) diff --git a/setup.py b/setup.py index db04b889..5fe453c0 100755 --- a/setup.py +++ b/setup.py @@ -18,7 +18,7 @@ "sqlitedict >= 1.5.0", "twisted >= 18.9.0", "url-matcher >= 0.2.0", - "web-poet >= 0.6.0", + "web-poet @ git+https://git@github.com/scrapinghub/web-poet@registry-features#egg=web-poet", ], classifiers=[ "Development Status :: 3 - Alpha", diff --git a/tests/test_injection.py b/tests/test_injection.py index 6d375ba6..84fb4605 100644 --- a/tests/test_injection.py +++ b/tests/test_injection.py @@ -8,7 +8,7 @@ from scrapy.http import Response from url_matcher import Patterns from url_matcher.util import get_domain -from web_poet import Injectable, ItemPage +from web_poet import Injectable, ItemPage, RulesRegistry from web_poet.mixins import ResponseShortcutsMixin from web_poet.rules import ApplyRule @@ -29,7 +29,6 @@ NonCallableProviderError, UndeclaredProvidedTypeError, ) -from scrapy_poet.registry import OverridesAndItemRegistry def get_provider(classes, content=None): @@ -322,7 +321,7 @@ def test_overrides(self, providers, override_should_happen): Patterns([domain]), use=OtherEurDollarRate, instead_of=EurDollarRate ), ] - registry = OverridesAndItemRegistry(rules=rules) + registry = RulesRegistry(rules=rules) injector = get_injector_for_testing(providers, registry=registry) def callback( diff --git a/tests/test_middleware.py b/tests/test_middleware.py index 81a74a12..ee160c53 100644 --- a/tests/test_middleware.py +++ b/tests/test_middleware.py @@ -12,12 +12,19 @@ from scrapy.utils.test import get_crawler from twisted.internet.threads import deferToThread from url_matcher.util import get_domain -from web_poet import ApplyRule, HttpResponse, ItemPage, RequestUrl, ResponseUrl, WebPage +from web_poet import ( + ApplyRule, + HttpResponse, + ItemPage, + RequestUrl, + ResponseUrl, + RulesRegistry, + WebPage, +) from scrapy_poet import DummyResponse, InjectionMiddleware, callback_for from scrapy_poet.cache import SqlitedictCache from scrapy_poet.page_input_providers import PageObjectInputProvider -from scrapy_poet.registry import OverridesAndItemRegistry from scrapy_poet.utils.mockserver import get_ephemeral_port from scrapy_poet.utils.testing import ( HtmlResource, @@ -126,7 +133,7 @@ def test_overrides(settings): def test_deprecation_setting_SCRAPY_POET_OVERRIDES_REGISTRY(settings) -> None: - settings["SCRAPY_POET_OVERRIDES_REGISTRY"] = OverridesAndItemRegistry + settings["SCRAPY_POET_OVERRIDES_REGISTRY"] = RulesRegistry crawler = get_crawler(Spider, settings) msg = ( @@ -135,7 +142,7 @@ def test_deprecation_setting_SCRAPY_POET_OVERRIDES_REGISTRY(settings) -> None: ) with pytest.warns(DeprecationWarning, match=msg): middleware = InjectionMiddleware(crawler) - assert isinstance(middleware.registry, OverridesAndItemRegistry) + assert isinstance(middleware.registry, RulesRegistry) # If both settings are present, the newer SCRAPY_POET_REGISTRY setting is used. diff --git a/tests/test_registry.py b/tests/test_registry.py deleted file mode 100644 index 99b9f12b..00000000 --- a/tests/test_registry.py +++ /dev/null @@ -1,50 +0,0 @@ -import pytest -from scrapy import Spider -from scrapy.utils.test import get_crawler -from web_poet import ApplyRule, ItemPage - - -def test_OverridesRegistry() -> None: - from scrapy_poet.overrides import OverridesRegistry - - msg = ( - "scrapy_poet.registry.OverridesRegistry is deprecated, " - "instantiate scrapy_poet.registry.OverridesAndItemRegistry instead." - ) - with pytest.warns(DeprecationWarning, match=msg): - OverridesRegistry() - - -def test_deprecation_setting_SCRAPY_POET_OVERRIDES(settings) -> None: - from scrapy_poet.registry import OverridesAndItemRegistry - - class FakePageObjectA(ItemPage): - pass - - rule_a = ApplyRule("https://example.com", use=FakePageObjectA) - - settings["SCRAPY_POET_OVERRIDES"] = [rule_a] - crawler = get_crawler(Spider, settings) - - msg = ( - "The SCRAPY_POET_OVERRIDES setting is deprecated. " - "Use SCRAPY_POET_RULES instead." - ) - with pytest.warns(DeprecationWarning, match=msg): - registry = OverridesAndItemRegistry.from_crawler(crawler) - - assert registry.get_rules() == [rule_a] - - # If both settings are present, the newer SCRAPY_POET_RULES setting is used. - - class FakePageObjectB(ItemPage): - pass - - rule_b = ApplyRule("https://example.com", use=FakePageObjectB) - settings["SCRAPY_POET_RULES"] = [rule_b] - crawler = get_crawler(Spider, settings) - - with pytest.warns(DeprecationWarning, match=msg): - registry = OverridesAndItemRegistry.from_crawler(crawler) - - assert registry.get_rules() == [rule_b] diff --git a/tox.ini b/tox.ini index 43ed1278..a06e61bb 100644 --- a/tox.ini +++ b/tox.ini @@ -27,7 +27,7 @@ deps = scrapy==2.6.0 sqlitedict==1.5.0 url-matcher==0.2.0 - web-poet==0.6.0 + web-poet @ git+https://git@github.com/scrapinghub/web-poet@registry-features#egg=web-poet [testenv:asyncio-min] basepython = python3.7