Skip to content

Commit

Permalink
Merge pull request #62 from scrapinghub/po-additional-requests
Browse files Browse the repository at this point in the history
integration for web-poet's support on additional requests and Meta
  • Loading branch information
kmike authored Jun 16, 2022
2 parents 0aaa262 + 98ce454 commit e4589e6
Show file tree
Hide file tree
Showing 22 changed files with 1,165 additions and 41 deletions.
18 changes: 17 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,21 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ['3.7', '3.8', '3.9', '3.10']
include:
- python-version: "3.7"
toxenv: "min"
- python-version: "3.7"
toxenv: "asyncio-min"

- python-version: "3.8"
toxenv: "py"
- python-version: "3.9"
toxenv: "py"

- python-version: "3.10"
toxenv: "py"
- python-version: "3.10"
toxenv: "asyncio"

steps:
- uses: actions/checkout@v2
Expand All @@ -29,6 +43,8 @@ jobs:
python -m pip install --upgrade pip
python -m pip install tox
- name: tox
env:
TOXENV: ${{ matrix.toxenv }}
run: |
tox -e py
- name: coverage
Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ TBR
---

* Use the new ``web_poet.HttpResponse`` which replaces ``web_poet.ResponseData``.
* Support for the new features in ``web_poet>=0.2.0`` for supporting additional
requests inside Page Objects:

* Created new providers for ``web_poet.PageParams`` and
``web_poet.HttpClient``.
* The minimum Scrapy version is now ``2.6.0``.
* We have these **backward incompatible** changes since the
``web_poet.OverrideRule`` follow a different structure:

Expand All @@ -15,6 +21,8 @@ TBR
* This resuls in a newer format in the ``SCRAPY_POET_OVERRIDES`` setting.
* Removal of this deprecated module: ``scrapy.utils.reqser``

* add ``async`` support for ``callback_for``.


0.3.0 (2022-01-28)
------------------
Expand Down
24 changes: 24 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,27 @@ License is BSD 3-clause.
* Issue tracker: https://github.com/scrapinghub/scrapy-poet/issues

.. _`web-poet`: https://github.com/scrapinghub/web-poet


Quick Start
***********

Installation
============

.. code-block::
pip install scrapy-poet
Requires **Python 3.7+** and **Scrapy >= 2.6.0**.

Usage in a Scrapy Project
=========================

Add the following inside Scrapy's ``settings.py`` file:

.. code-block:: python
DOWNLOADER_MIDDLEWARES = {
"scrapy_poet.InjectionMiddleware": 543,
}
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
language = "en"

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
Expand Down
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
:maxdepth: 1

intro/install
intro/tutorial
intro/basic-tutorial
intro/advanced-tutorial

.. toctree::
:caption: Advanced
Expand Down
168 changes: 168 additions & 0 deletions docs/intro/advanced-tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
.. _`intro-advanced-tutorial`:

=================
Advanced Tutorial
=================

This section intends to go over the supported features in **web-poet** by
**scrapy-poet**:

* ``web_poet.HttpClient``
* ``web_poet.PageParams``

These are mainly achieved by **scrapy-poet** implementing **providers** for them:

* :class:`scrapy_poet.page_input_providers.HttpClientProvider`
* :class:`scrapy_poet.page_input_providers.PageParamsProvider`

.. _`intro-additional-requests`:

Additional Requests
===================

Using Page Objects using additional requests doesn't need anything special from
the spider. It would work as-is because of the readily available
:class:`scrapy_poet.page_input_providers.HttpClientProvider` that is enabled
out of the box.

This supplies the Page Object with the necessary ``web_poet.HttpClient`` instance.

The HTTP client implementation that **scrapy-poet** provides to
``web_poet.HttpClient`` handles requests as follows:

- Requests go through downloader middlewares, but they do not go through
spider middlewares or through the scheduler.

- Duplicate requests are not filtered out.

- In line with the web-poet specification for additional requests,
``Request.meta['dont_redirect']`` is set to ``True`` for requests with the
``HEAD`` HTTP method.

Suppose we have the following Page Object:

.. code-block:: python
import attr
import web_poet
@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}
# Simulates clicking on a button that says "View All Images"
response: web_poet.HttpResponse = await self.http_client.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
item["images"] = response.css(".product-images img::attr(src)").getall()
return item
It can be directly used inside the spider as:

.. code-block:: python
import scrapy
class ProductSpider(scrapy.Spider):
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_poet.InjectionMiddleware": 543,
}
}
start_urls = [
"https://example.com/category/product/item?id=123",
"https://example.com/category/product/item?id=989",
]
async def parse(self, response, page: ProductPage):
return await page.to_item()
Note that we needed to update the ``parse()`` method to be an ``async`` method,
since the ``to_item()`` method of the Page Object we're using is an ``async``
method as well.


Page params
===========

Using ``web_poet.PageParams`` allows the Scrapy spider to pass any arbitrary
information into the Page Object.

Suppose we update the earlier Page Object to control the additional request.
This basically acts as a switch to update the behavior of the Page Object:

.. code-block:: python
import attr
import web_poet
@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
page_params: web_poet.PageParams
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}
# Simulates clicking on a button that says "View All Images"
if self.page_params.get("enable_extracting_all_images")
response: web_poet.HttpResponse = await self.http_client.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
item["images"] = response.css(".product-images img::attr(src)").getall()
return item
Passing the ``enable_extracting_all_images`` page parameter from the spider
into the Page Object can be achieved by using **Scrapy's** ``Request.meta``
attribute. Specifically, any ``dict`` value inside the ``page_params``
parameter inside **Scrapy's** ``Request.meta`` will be passed into
``web_poet.PageParams``.

Let's see it in action:

.. code-block:: python
import scrapy
class ProductSpider(scrapy.Spider):
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_poet.InjectionMiddleware": 543,
}
}
start_urls = [
"https://example.com/category/product/item?id=123",
"https://example.com/category/product/item?id=989",
]
def start_requests(self):
for url in start_urls:
yield scrapy.Request(
url=url,
callback=self.parse,
meta={"page_params": {"enable_extracting_all_images": True}}
)
async def parse(self, response, page: ProductPage):
return await page.to_item()
44 changes: 40 additions & 4 deletions docs/intro/tutorial.rst → docs/intro/basic-tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _`intro-tutorial`:
.. _`intro-basic-tutorial`:

========
Tutorial
========
==============
Basic Tutorial
==============

In this tutorial, we’ll assume that ``scrapy-poet`` is already installed on your
system. If that’s not the case, see :ref:`intro-install`.
Expand Down Expand Up @@ -198,6 +198,42 @@ returning the result of the ``to_item`` method call. We could use
``response.follow_all(links, callback_for(BookPage))``, without creating
an attribute, but currently it won't work with Scrapy disk queues.

.. tip::

:func:`~.callback_for` also supports `async generators`. So if we have the
following:

.. code-block:: python
class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
links = response.css('.image_container a')
yield from response.follow_all(links, self.parse_book)
async def parse_book(self, response: DummyResponse, page: BookPage):
yield await page.to_item()
It could be turned into:

.. code-block:: python
class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
links = response.css('.image_container a')
yield from response.follow_all(links, self.parse_book)
parse_book = callback_for(BookPage)
This is useful when the Page Objects uses additional requests, which rely
heavily on ``async/await`` syntax. More info on this in this tutorial
section: :ref:`intro-additional-requests`.

Final result
============

Expand Down
2 changes: 1 addition & 1 deletion docs/intro/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ If you’re already familiar with installation of Python packages, you can insta

pip install scrapy-poet

Scrapy 2.1.0 or above is required and it has to be installed separately.
Scrapy 2.6.0 or above is required and it has to be installed separately.

Things that are good to know
============================
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
Scrapy >= 2.1.0
Scrapy >= 2.6.0
Sphinx >= 3.0.3
sphinx-rtd-theme >= 0.4
40 changes: 40 additions & 0 deletions scrapy_poet/api.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from typing import Callable, Optional, Type
from inspect import iscoroutinefunction

from scrapy.http import Request, Response

Expand Down Expand Up @@ -55,6 +56,38 @@ def parse_book(self, response: DummyResponse, page: BookPage):
It allows to write this:
.. code-block:: python
class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
links = response.css('.image_container a')
yield from response.follow_all(links, self.parse_book)
parse_book = callback_for(BookPage)
It also supports producing an async generator callable if the Page Objects's
``to_item()`` method is a coroutine which uses the ``async/await`` syntax.
So if we have the following:
.. code-block:: python
class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
links = response.css('.image_container a')
yield from response.follow_all(links, self.parse_book)
async def parse_book(self, response: DummyResponse, page: BookPage):
yield await page.to_item()
It could be turned into:
.. code-block:: python
class BooksSpider(scrapy.Spider):
Expand Down Expand Up @@ -90,5 +123,12 @@ def parse(self, response):
def parse(*args, page: page_cls, **kwargs): # type: ignore
yield page.to_item() # type: ignore

async def async_parse(*args, page: page_cls, **kwargs): # type: ignore
yield await page.to_item() # type: ignore

if iscoroutinefunction(page_cls.to_item):
setattr(async_parse, _CALLBACK_FOR_MARKER, True)
return async_parse

setattr(parse, _CALLBACK_FOR_MARKER, True)
return parse
Loading

0 comments on commit e4589e6

Please sign in to comment.