Skip to content

Commit

Permalink
Merge pull request #67 from scrapinghub/responsedata-to-httpresponse
Browse files Browse the repository at this point in the history
Use the new HttpResponse that replaces ResponseData in web_poet
  • Loading branch information
BurnzZ authored May 7, 2022
2 parents c1a7cb8 + a254ba4 commit 67a788b
Show file tree
Hide file tree
Showing 14 changed files with 106 additions and 73 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
Changelog
=========

TBR
---

* Use the new ``web_poet.HttpResponse`` which replaces ``web_poet.ResponseData``.


0.3.0 (2022-01-28)
------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/overrides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ using the following Page Object:
class ISBNBookPage(ItemWebPage):
def __init__(self, response: ResponseData, book_page: BookPage):
def __init__(self, response: HttpResponse, book_page: BookPage):
super().__init__(response)
self.book_page = book_page
Expand Down
58 changes: 33 additions & 25 deletions docs/providers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,34 +20,34 @@ Creating providers
==================

Providers are responsible for building dependencies needed by Injectable
objects. A good example would be the ``ResponseDataProvider``,
which builds and provides a ``ResponseData`` instance for Injectables
that need it, like the ``ItemWebPage``.
objects. A good example would be the ``HttpResponseProvider``,
which builds and provides a ``web_poet.HttpResponse`` instance for Injectables
that need it, like the ``web_poet.ItemWebPage``.

.. code-block:: python
import attr
from typing import Set, Callable
import web_poet
from scrapy_poet.page_input_providers import PageObjectInputProvider
from scrapy import Response
@attr.define
class ResponseData:
"""Represents a response containing its URL and HTML content."""
url: str
html: str
class ResponseDataProvider(PageObjectInputProvider):
"""This class provides ``web_poet.page_inputs.ResponseData`` instances."""
provided_classes = {ResponseData}
class HttpResponseProvider(PageObjectInputProvider):
"""This class provides ``web_poet.HttpResponse`` instances."""
provided_classes = {web_poet.HttpResponse}
def __call__(self, to_provide: Set[Callable], response: Response):
"""Build a ``ResponseData`` instance using a Scrapy ``Response``"""
return [ResponseData(url=response.url, html=response.text)]
"""Build a ``web_poet.HttpResponse`` instance using a Scrapy ``Response``"""
return [
HttpResponse(
url=response.url,
body=response.body,
status=response.status,
headers=web_poet.HttpResponseHeaders.from_bytes_dict(response.headers),
)
]
You can implement your own providers in order to extend or override current
``scrapy-poet`` behavior. All providers should inherit from this base class:
Expand All @@ -61,7 +61,7 @@ Cache Suppport in Providers
===========================

``scrapy-poet`` also supports caching of the provided dependencies from the
providers. For example, :class:`~.ResponseDataProvider` supports this right off
providers. For example, :class:`~.HttpResponseProvider` supports this right off
the bat. It's able to do this by inheriting the :class:`~.CacheDataProviderMixin`
and implementing all of its ``abstractmethods``.

Expand All @@ -70,18 +70,26 @@ would lead to the following code:

.. code-block:: python
import web_poet
from scrapy_poet.page_input_providers import (
CacheDataProviderMixin,
PageObjectInputProvider,
)
class ResponseDataProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.page_inputs.ResponseData`` instances."""
provided_classes = {ResponseData}
class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.HttpResponse`` instances."""
provided_classes = {web_poet.HttpResponse}
def __call__(self, to_provide: Set[Callable], response: Response):
"""Build a ``ResponseData`` instance using a Scrapy ``Response``"""
return [ResponseData(url=response.url, html=response.text)]
"""Build a ``web_poet.HttpResponse`` instance using a Scrapy ``Response``"""
return [
web_poet.HttpResponse(
url=response.url,
body=response.body,
status=response.status,
headers=web_poet.HttpResponseHeaders.from_bytes_dict(response.headers),
)
]
def fingerprint(self, to_provide: Set[Callable], request: Request) -> str:
"""Returns a fingerprint to identify the specific request."""
Expand Down Expand Up @@ -136,7 +144,7 @@ configuration dictionaries for more information.
.. note::

The providers in :const:`scrapy_poet.DEFAULT_PROVIDERS`,
which includes a provider for :class:`~ResponseData`, are always
which includes a provider for :class:`~HttpResponse`, are always
included by default. You can disable any of them by listing it
in the configuration with the priority `None`.

Expand Down Expand Up @@ -264,8 +272,8 @@ Page Object uses it, the request is not ignored, for example:

The code above is just for example purposes. If you need to use ``Response``
instances in your Page Objects, use built-in ``ItemWebPage`` - it has
``response`` attribute with ``ResponseData``; no additional configuration
is needed, as there is ``ResponseDataProvider`` enabled in ``scrapy-poet``
``response`` attribute with ``HttpResponse``; no additional configuration
is needed, as there is ``HttpResponseProvider`` enabled in ``scrapy-poet``
by default.

Requests concurrency
Expand Down
2 changes: 1 addition & 1 deletion scrapy_poet/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@
from .page_input_providers import (
PageObjectInputProvider,
CacheDataProviderMixin,
ResponseDataProvider,
HttpResponseProvider,
)
4 changes: 2 additions & 2 deletions scrapy_poet/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@
from scrapy.utils.misc import create_instance, load_object
from .api import DummyResponse
from .overrides import PerDomainOverridesRegistry
from .page_input_providers import ResponseDataProvider
from .page_input_providers import HttpResponseProvider
from .injection import Injector


logger = logging.getLogger(__name__)


DEFAULT_PROVIDERS = {
ResponseDataProvider: 500
HttpResponseProvider: 500
}

InjectionMiddlewareTV = TypeVar("InjectionMiddlewareTV", bound="InjectionMiddleware")
Expand Down
32 changes: 24 additions & 8 deletions scrapy_poet/page_input_providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
have created a repository of ``PageObjectInputProviders``.
The current module implements a ``PageObjectInputProviders`` for
:class:`web_poet.page_inputs.ResponseData`, which is in charge of providing the response
:class:`web_poet.page_inputs.HttpResponse`, which is in charge of providing the response
HTML from Scrapy. You could also implement different providers in order to
acquire data from multiple external sources, for example,
Splash or Auto Extract API.
Expand All @@ -20,7 +20,7 @@
from scrapy.utils.request import request_fingerprint

from scrapy_poet.injection_errors import MalformedProvidedClassesError
from web_poet import ResponseData
from web_poet import HttpResponse, HttpResponseHeaders


class PageObjectInputProvider:
Expand Down Expand Up @@ -154,15 +154,22 @@ def has_cache_support(self):
return True


class ResponseDataProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.page_inputs.ResponseData`` instances."""
class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.page_inputs.HttpResponse`` instances."""

provided_classes = {ResponseData}
provided_classes = {HttpResponse}
name = "response_data"

def __call__(self, to_provide: Set[Callable], response: Response):
"""Builds a ``ResponseData`` instance using a Scrapy ``Response``"""
return [ResponseData(url=response.url, html=response.text)]
"""Builds a ``HttpResponse`` instance using a Scrapy ``Response``"""
return [
HttpResponse(
url=response.url,
body=response.body,
status=response.status,
headers=HttpResponseHeaders.from_bytes_dict(response.headers),
)
]

def fingerprint(self, to_provide: Set[Callable], request: Request) -> str:
request_keys = {"url", "method", "body"}
Expand All @@ -181,4 +188,13 @@ def serialize(self, result: Sequence[Any]) -> Any:
return [attr.asdict(response_data) for response_data in result]

def deserialize(self, data: Any) -> Sequence[Any]:
return [ResponseData(**response_data) for response_data in data]
return [
HttpResponse(
response_data["url"],
response_data["body"],
status=response_data["status"],
headers=HttpResponseHeaders.from_bytes_dict(response_data["headers"]),
encoding=response_data["_encoding"],
)
for response_data in data
]
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
'andi >= 0.4.1',
'attrs',
'parsel',
'web-poet',
'web-poet @ git+https://[email protected]/scrapinghub/web-poet@master#egg=web-poet',
'tldextract',
'sqlitedict',
],
Expand Down
2 changes: 1 addition & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import pytest
from scrapy.settings import Settings

from scrapy_poet.page_input_providers import ResponseDataProvider
from scrapy_poet.page_input_providers import HttpResponseProvider


@pytest.fixture()
Expand Down
17 changes: 11 additions & 6 deletions tests/test_injection.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@
from pytest_twisted import inlineCallbacks
import weakref

import parsel
from scrapy import Request
from scrapy.http import Response
from scrapy_poet.utils import get_domain

from scrapy_poet import CacheDataProviderMixin, ResponseDataProvider, PageObjectInputProvider, \
from scrapy_poet import CacheDataProviderMixin, HttpResponseProvider, PageObjectInputProvider, \
DummyResponse
from scrapy_poet.injection import check_all_providers_are_callable, is_class_provided_by_any_provider_fn, \
get_injector_for_testing, get_response_for_testing
Expand Down Expand Up @@ -264,6 +265,10 @@ class Html(Injectable):
url = "http://example.com"
html = """<html><body>Price: <span class="price">22</span>€</body></html>"""

@property
def selector(self):
return parsel.Selector(self.html)


class EurDollarRate(Injectable):
rate = 1.1
Expand Down Expand Up @@ -332,17 +337,17 @@ def callback(response: DummyResponse, price_po: PricePO, rate_po: EurDollarRate)


def test_load_provider_classes():
provider_as_string = f"{ResponseDataProvider.__module__}.{ResponseDataProvider.__name__}"
injector = get_injector_for_testing({provider_as_string: 2, ResponseDataProvider: 1})
assert all(type(prov) == ResponseDataProvider for prov in injector.providers)
provider_as_string = f"{HttpResponseProvider.__module__}.{HttpResponseProvider.__name__}"
injector = get_injector_for_testing({provider_as_string: 2, HttpResponseProvider: 1})
assert all(type(prov) == HttpResponseProvider for prov in injector.providers)
assert len(injector.providers) == 2


def test_check_all_providers_are_callable():
check_all_providers_are_callable([ResponseDataProvider(None)])
check_all_providers_are_callable([HttpResponseProvider(None)])
with pytest.raises(NonCallableProviderError) as exinf:
check_all_providers_are_callable([PageObjectInputProvider(None),
ResponseDataProvider(None)])
HttpResponseProvider(None)])

assert "PageObjectInputProvider" in str(exinf.value)
assert "not callable" in str(exinf.value)
Expand Down
14 changes: 7 additions & 7 deletions tests/test_middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from scrapy_poet.page_input_providers import (
PageObjectInputProvider
)
from web_poet.page_inputs import ResponseData
from web_poet.page_inputs import HttpResponse
from scrapy_poet import DummyResponse
from tests.utils import (HtmlResource,
crawl_items,
Expand Down Expand Up @@ -125,10 +125,10 @@ class OptionalAndUnionPage(ItemWebPage):
breadcrumbs: BreadcrumbsExtraction
opt_check_1: Optional[BreadcrumbsExtraction]
opt_check_2: Optional[str] # str is not Injectable, so None expected here
union_check_1: Union[BreadcrumbsExtraction, ResponseData] # Breadcrumbs is injected
union_check_2: Union[str, ResponseData] # ResponseData is injected
union_check_3: Union[Optional[str], ResponseData] # None is injected
union_check_4: Union[None, str, ResponseData] # None is injected
union_check_1: Union[BreadcrumbsExtraction, HttpResponse] # Breadcrumbs is injected
union_check_2: Union[str, HttpResponse] # HttpResponse is injected
union_check_3: Union[Optional[str], HttpResponse] # None is injected
union_check_4: Union[None, str, HttpResponse] # None is injected
union_check_5: Union[BreadcrumbsExtraction, None, str] # Breadcrumbs is injected

def to_item(self):
Expand All @@ -151,7 +151,7 @@ def test_optional_and_unions(settings):
@attr.s(auto_attribs=True)
class ProvidedWithDeferred:
msg: str
response: ResponseData # it should be None because this class is provided
response: HttpResponse # it should be None because this class is provided


@attr.s(auto_attribs=True)
Expand Down Expand Up @@ -198,7 +198,7 @@ def __call__(self, to_provide):
# we're returning a class that's not listed in self.provided_classes
return {
ExtraClassData: ExtraClassData("this should be returned"),
ResponseData: ResponseData("example.com", "this shouldn't"),
HttpResponse: HttpResponse("example.com", b"this shouldn't"),
}


Expand Down
Loading

0 comments on commit 67a788b

Please sign in to comment.