Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the new HttpResponse that replaces ResponseData in web_poet #67

Merged
merged 5 commits into from
May 7, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
Changelog
=========

TBR
---

* Use the new ``web_poet.HttpResponse`` which replaces ``web_poet.ResponseData``.


0.3.0 (2022-01-28)
------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/overrides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ using the following Page Object:

class ISBNBookPage(ItemWebPage):

def __init__(self, response: ResponseData, book_page: BookPage):
def __init__(self, response: HttpResponse, book_page: BookPage):
super().__init__(response)
self.book_page = book_page

Expand Down
58 changes: 33 additions & 25 deletions docs/providers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,34 +20,34 @@ Creating providers
==================

Providers are responsible for building dependencies needed by Injectable
objects. A good example would be the ``ResponseDataProvider``,
which builds and provides a ``ResponseData`` instance for Injectables
that need it, like the ``ItemWebPage``.
objects. A good example would be the ``HttpResponseProvider``,
which builds and provides a ``web_poet.HttpResponse`` instance for Injectables
that need it, like the ``web_poet.ItemWebPage``.

.. code-block:: python

import attr
from typing import Set, Callable

import web_poet
from scrapy_poet.page_input_providers import PageObjectInputProvider
from scrapy import Response


@attr.define
class ResponseData:
"""Represents a response containing its URL and HTML content."""
url: str
html: str


class ResponseDataProvider(PageObjectInputProvider):
"""This class provides ``web_poet.page_inputs.ResponseData`` instances."""
provided_classes = {ResponseData}
class HttpResponseProvider(PageObjectInputProvider):
"""This class provides ``web_poet.HttpResponse`` instances."""
kmike marked this conversation as resolved.
Show resolved Hide resolved
provided_classes = {web_poet.HttpResponse}

def __call__(self, to_provide: Set[Callable], response: Response):
"""Build a ``ResponseData`` instance using a Scrapy ``Response``"""
return [ResponseData(url=response.url, html=response.text)]

"""Build a ``web_poet.HttpResponse`` instance using a Scrapy ``Response``"""
return [
HttpResponse(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HttpResponse(
web_poet.HttpResponse(

url=response.url,
body=response.body,
status=response.status,
headers=web_poet.HttpResponseHeaders.from_bytes_dict(response.headers),
)
]

You can implement your own providers in order to extend or override current
``scrapy-poet`` behavior. All providers should inherit from this base class:
Expand All @@ -61,7 +61,7 @@ Cache Suppport in Providers
===========================

``scrapy-poet`` also supports caching of the provided dependencies from the
providers. For example, :class:`~.ResponseDataProvider` supports this right off
providers. For example, :class:`~.HttpResponseProvider` supports this right off
the bat. It's able to do this by inheriting the :class:`~.CacheDataProviderMixin`
and implementing all of its ``abstractmethods``.

Expand All @@ -70,18 +70,26 @@ would lead to the following code:

.. code-block:: python

import web_poet
from scrapy_poet.page_input_providers import (
CacheDataProviderMixin,
PageObjectInputProvider,
)

class ResponseDataProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.page_inputs.ResponseData`` instances."""
provided_classes = {ResponseData}
class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.HttpResponse`` instances."""
provided_classes = {web_poet.HttpResponse}

def __call__(self, to_provide: Set[Callable], response: Response):
"""Build a ``ResponseData`` instance using a Scrapy ``Response``"""
return [ResponseData(url=response.url, html=response.text)]
"""Build a ``web_poet.HttpResponse`` instance using a Scrapy ``Response``"""
return [
HttpResponse(
BurnzZ marked this conversation as resolved.
Show resolved Hide resolved
url=response.url,
body=response.body,
status=response.status,
headers=web_poet.HttpResponseHeaders.from_bytes_dict(response.headers),
)
]

def fingerprint(self, to_provide: Set[Callable], request: Request) -> str:
"""Returns a fingerprint to identify the specific request."""
Expand Down Expand Up @@ -136,7 +144,7 @@ configuration dictionaries for more information.
.. note::

The providers in :const:`scrapy_poet.DEFAULT_PROVIDERS`,
which includes a provider for :class:`~ResponseData`, are always
which includes a provider for :class:`~HttpResponse`, are always
included by default. You can disable any of them by listing it
in the configuration with the priority `None`.

Expand Down Expand Up @@ -264,8 +272,8 @@ Page Object uses it, the request is not ignored, for example:

The code above is just for example purposes. If you need to use ``Response``
instances in your Page Objects, use built-in ``ItemWebPage`` - it has
``response`` attribute with ``ResponseData``; no additional configuration
is needed, as there is ``ResponseDataProvider`` enabled in ``scrapy-poet``
``response`` attribute with ``HttpResponse``; no additional configuration
is needed, as there is ``HttpResponseProvider`` enabled in ``scrapy-poet``
by default.

Requests concurrency
Expand Down
2 changes: 1 addition & 1 deletion scrapy_poet/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@
from .page_input_providers import (
PageObjectInputProvider,
CacheDataProviderMixin,
ResponseDataProvider,
HttpResponseProvider,
)
4 changes: 2 additions & 2 deletions scrapy_poet/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@
from scrapy.utils.misc import create_instance, load_object
from .api import DummyResponse
from .overrides import PerDomainOverridesRegistry
from .page_input_providers import ResponseDataProvider
from .page_input_providers import HttpResponseProvider
from .injection import Injector


logger = logging.getLogger(__name__)


DEFAULT_PROVIDERS = {
ResponseDataProvider: 500
HttpResponseProvider: 500
}

InjectionMiddlewareTV = TypeVar("InjectionMiddlewareTV", bound="InjectionMiddleware")
Expand Down
32 changes: 24 additions & 8 deletions scrapy_poet/page_input_providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
have created a repository of ``PageObjectInputProviders``.

The current module implements a ``PageObjectInputProviders`` for
:class:`web_poet.page_inputs.ResponseData`, which is in charge of providing the response
:class:`web_poet.page_inputs.HttpResponse`, which is in charge of providing the response
HTML from Scrapy. You could also implement different providers in order to
acquire data from multiple external sources, for example,
Splash or Auto Extract API.
Expand All @@ -20,7 +20,7 @@
from scrapy.utils.request import request_fingerprint

from scrapy_poet.injection_errors import MalformedProvidedClassesError
from web_poet import ResponseData
from web_poet import HttpResponse, HttpResponseHeaders


class PageObjectInputProvider:
Expand Down Expand Up @@ -154,15 +154,22 @@ def has_cache_support(self):
return True


class ResponseDataProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.page_inputs.ResponseData`` instances."""
class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.page_inputs.HttpResponse`` instances."""

provided_classes = {ResponseData}
provided_classes = {HttpResponse}
name = "response_data"

def __call__(self, to_provide: Set[Callable], response: Response):
"""Builds a ``ResponseData`` instance using a Scrapy ``Response``"""
return [ResponseData(url=response.url, html=response.text)]
"""Builds a ``HttpResponse`` instance using a Scrapy ``Response``"""
return [
HttpResponse(
url=response.url,
body=response.body,
status=response.status,
headers=HttpResponseHeaders.from_bytes_dict(response.headers),
)
]

def fingerprint(self, to_provide: Set[Callable], request: Request) -> str:
request_keys = {"url", "method", "body"}
Expand All @@ -181,4 +188,13 @@ def serialize(self, result: Sequence[Any]) -> Any:
return [attr.asdict(response_data) for response_data in result]

def deserialize(self, data: Any) -> Sequence[Any]:
return [ResponseData(**response_data) for response_data in data]
return [
HttpResponse(
response_data["url"],
response_data["body"],
status=response_data["status"],
headers=HttpResponseHeaders.from_bytes_dict(response_data["headers"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is HttpResponseHeaders.from_bytes_dict needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike It's because Scrapy headers look like this:

scrapy_headers = {
    b"Content-Encoding": [b"gzip", b"br"],
    b"Content-Type": [b"text/html"],
    b"content-length": b"648",
}

Reference for the motivation: scrapinghub/web-poet#33 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong, here the result is not coming from Scrapy, it's a part of serialization / deserialization for cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I see what you mean. Nice catch @kmike ! 🙌 Fixed this in #72

encoding=response_data["_encoding"],
)
for response_data in data
]
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
'andi >= 0.4.1',
'attrs',
'parsel',
'web-poet',
'web-poet @ git+https://[email protected]/scrapinghub/web-poet@master#egg=web-poet',
'tldextract',
'sqlitedict',
],
Expand Down
2 changes: 1 addition & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import pytest
from scrapy.settings import Settings

from scrapy_poet.page_input_providers import ResponseDataProvider
from scrapy_poet.page_input_providers import HttpResponseProvider


@pytest.fixture()
Expand Down
17 changes: 11 additions & 6 deletions tests/test_injection.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@
from pytest_twisted import inlineCallbacks
import weakref

import parsel
from scrapy import Request
from scrapy.http import Response
from scrapy_poet.utils import get_domain

from scrapy_poet import CacheDataProviderMixin, ResponseDataProvider, PageObjectInputProvider, \
from scrapy_poet import CacheDataProviderMixin, HttpResponseProvider, PageObjectInputProvider, \
DummyResponse
from scrapy_poet.injection import check_all_providers_are_callable, is_class_provided_by_any_provider_fn, \
get_injector_for_testing, get_response_for_testing
Expand Down Expand Up @@ -264,6 +265,10 @@ class Html(Injectable):
url = "http://example.com"
html = """<html><body>Price: <span class="price">22</span>€</body></html>"""

@property
def selector(self):
return parsel.Selector(self.html)


class EurDollarRate(Injectable):
rate = 1.1
Expand Down Expand Up @@ -332,17 +337,17 @@ def callback(response: DummyResponse, price_po: PricePO, rate_po: EurDollarRate)


def test_load_provider_classes():
provider_as_string = f"{ResponseDataProvider.__module__}.{ResponseDataProvider.__name__}"
injector = get_injector_for_testing({provider_as_string: 2, ResponseDataProvider: 1})
assert all(type(prov) == ResponseDataProvider for prov in injector.providers)
provider_as_string = f"{HttpResponseProvider.__module__}.{HttpResponseProvider.__name__}"
injector = get_injector_for_testing({provider_as_string: 2, HttpResponseProvider: 1})
assert all(type(prov) == HttpResponseProvider for prov in injector.providers)
assert len(injector.providers) == 2


def test_check_all_providers_are_callable():
check_all_providers_are_callable([ResponseDataProvider(None)])
check_all_providers_are_callable([HttpResponseProvider(None)])
with pytest.raises(NonCallableProviderError) as exinf:
check_all_providers_are_callable([PageObjectInputProvider(None),
ResponseDataProvider(None)])
HttpResponseProvider(None)])

assert "PageObjectInputProvider" in str(exinf.value)
assert "not callable" in str(exinf.value)
Expand Down
14 changes: 7 additions & 7 deletions tests/test_middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from scrapy_poet.page_input_providers import (
PageObjectInputProvider
)
from web_poet.page_inputs import ResponseData
from web_poet.page_inputs import HttpResponse
from scrapy_poet import DummyResponse
from tests.utils import (HtmlResource,
crawl_items,
Expand Down Expand Up @@ -125,10 +125,10 @@ class OptionalAndUnionPage(ItemWebPage):
breadcrumbs: BreadcrumbsExtraction
opt_check_1: Optional[BreadcrumbsExtraction]
opt_check_2: Optional[str] # str is not Injectable, so None expected here
union_check_1: Union[BreadcrumbsExtraction, ResponseData] # Breadcrumbs is injected
union_check_2: Union[str, ResponseData] # ResponseData is injected
union_check_3: Union[Optional[str], ResponseData] # None is injected
union_check_4: Union[None, str, ResponseData] # None is injected
union_check_1: Union[BreadcrumbsExtraction, HttpResponse] # Breadcrumbs is injected
union_check_2: Union[str, HttpResponse] # HttpResponse is injected
union_check_3: Union[Optional[str], HttpResponse] # None is injected
union_check_4: Union[None, str, HttpResponse] # None is injected
union_check_5: Union[BreadcrumbsExtraction, None, str] # Breadcrumbs is injected

def to_item(self):
Expand All @@ -151,7 +151,7 @@ def test_optional_and_unions(settings):
@attr.s(auto_attribs=True)
class ProvidedWithDeferred:
msg: str
response: ResponseData # it should be None because this class is provided
response: HttpResponse # it should be None because this class is provided


@attr.s(auto_attribs=True)
Expand Down Expand Up @@ -198,7 +198,7 @@ def __call__(self, to_provide):
# we're returning a class that's not listed in self.provided_classes
return {
ExtraClassData: ExtraClassData("this should be returned"),
ResponseData: ResponseData("example.com", "this shouldn't"),
HttpResponse: HttpResponse("example.com", b"this shouldn't"),
}


Expand Down
Loading