Merge pull request #62 from scrapinghub/po-additional-requests

integration for web-poet's support on additional requests and Meta
scrapinghub · Jun 16, 2022 · e4589e6 · e4589e6
2 parents 0aaa262 + 98ce454
commit e4589e6
Show file tree

Hide file tree

Showing 22 changed files with 1,165 additions and 41 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -16,7 +16,21 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ['3.7', '3.8', '3.9', '3.10']
+        include:
+        - python-version: "3.7"
+          toxenv: "min"
+        - python-version: "3.7"
+          toxenv: "asyncio-min"
+
+        - python-version: "3.8"
+          toxenv: "py"
+        - python-version: "3.9"
+          toxenv: "py"
+
+        - python-version: "3.10"
+          toxenv: "py"
+        - python-version: "3.10"
+          toxenv: "asyncio"
 
     steps:
     - uses: actions/checkout@v2
@@ -29,6 +43,8 @@ jobs:
         python -m pip install --upgrade pip
         python -m pip install tox
     - name: tox
+      env:
+        TOXENV: ${{ matrix.toxenv }}
       run: |
         tox -e py
     - name: coverage

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -6,6 +6,12 @@ TBR
 ---
 
 * Use the new ``web_poet.HttpResponse`` which replaces ``web_poet.ResponseData``.
+* Support for the new features in ``web_poet>=0.2.0`` for supporting additional 
+  requests inside Page Objects:
+
+    * Created new providers for ``web_poet.PageParams`` and
+      ``web_poet.HttpClient``.
+    * The minimum Scrapy version is now ``2.6.0``.
 * We have these **backward incompatible** changes since the
   ``web_poet.OverrideRule`` follow a different structure:
 
@@ -15,6 +21,8 @@ TBR
     * This resuls in a newer format in the ``SCRAPY_POET_OVERRIDES`` setting.
 * Removal of this deprecated module: ``scrapy.utils.reqser``
 
+* add ``async`` support for ``callback_for``.
+
 
 0.3.0 (2022-01-28)
 ------------------

diff --git a/README.rst b/README.rst
@@ -36,3 +36,27 @@ License is BSD 3-clause.
 * Issue tracker: https://github.com/scrapinghub/scrapy-poet/issues
 
 .. _`web-poet`: https://github.com/scrapinghub/web-poet
+
+
+Quick Start
+***********
+
+Installation
+============
+
+.. code-block::
+
+    pip install scrapy-poet
+
+Requires **Python 3.7+** and **Scrapy >= 2.6.0**.
+
+Usage in a Scrapy Project
+=========================
+
+Add the following inside Scrapy's ``settings.py`` file:
+
+.. code-block:: python
+
+    DOWNLOADER_MIDDLEWARES = {
+        "scrapy_poet.InjectionMiddleware": 543,
+    }
diff --git a/docs/conf.py b/docs/conf.py
@@ -61,7 +61,7 @@
 #
 # This is also used if you do content translation via gettext catalogs.
 # Usually you set "language" from the command line for these cases.
-language = None
+language = "en"
 
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.

diff --git a/docs/index.rst b/docs/index.rst
@@ -35,7 +35,8 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
    :maxdepth: 1
 
    intro/install
-   intro/tutorial
+   intro/basic-tutorial
+   intro/advanced-tutorial
 
 .. toctree::
    :caption: Advanced

diff --git a/docs/intro/advanced-tutorial.rst b/docs/intro/advanced-tutorial.rst
@@ -0,0 +1,168 @@
+.. _`intro-advanced-tutorial`:
+
+=================
+Advanced Tutorial
+=================
+
+This section intends to go over the supported features in **web-poet** by
+**scrapy-poet**:
+
+    * ``web_poet.HttpClient``
+    * ``web_poet.PageParams``
+
+These are mainly achieved by **scrapy-poet** implementing **providers** for them:
+
+    * :class:`scrapy_poet.page_input_providers.HttpClientProvider`
+    * :class:`scrapy_poet.page_input_providers.PageParamsProvider`
+
+.. _`intro-additional-requests`:
+
+Additional Requests
+===================
+
+Using Page Objects using additional requests doesn't need anything special from
+the spider. It would work as-is because of the readily available 
+:class:`scrapy_poet.page_input_providers.HttpClientProvider` that is enabled
+out of the box.
+
+This supplies the Page Object with the necessary ``web_poet.HttpClient`` instance.
+
+The HTTP client implementation that **scrapy-poet** provides to
+``web_poet.HttpClient`` handles requests as follows:
+
+-   Requests go through downloader middlewares, but they do not go through
+    spider middlewares or through the scheduler.
+
+-   Duplicate requests are not filtered out.
+
+-   In line with the web-poet specification for additional requests,
+    ``Request.meta['dont_redirect']`` is set to ``True`` for requests with the
+    ``HEAD`` HTTP method.
+
+Suppose we have the following Page Object:
+
+.. code-block:: python
+
+    import attr
+    import web_poet
+
+
+    @attr.define
+    class ProductPage(web_poet.ItemWebPage):
+        http_client: web_poet.HttpClient
+
+        async def to_item(self):
+            item = {
+                "url": self.url,
+                "name": self.css("#main h3.name ::text").get(),
+                "product_id": self.css("#product ::attr(product-id)").get(),
+            }
+
+            # Simulates clicking on a button that says "View All Images"
+            response: web_poet.HttpResponse = await self.http_client.get(
+                f"https://api.example.com/v2/images?id={item['product_id']}"
+            )
+            item["images"] = response.css(".product-images img::attr(src)").getall()
+            return item
+
+
+It can be directly used inside the spider as:
+
+.. code-block:: python
+
+    import scrapy
+
+
+    class ProductSpider(scrapy.Spider):
+
+        custom_settings = {
+            "DOWNLOADER_MIDDLEWARES": {
+                "scrapy_poet.InjectionMiddleware": 543,
+            }
+        }
+
+        start_urls = [
+            "https://example.com/category/product/item?id=123",
+            "https://example.com/category/product/item?id=989",
+        ]
+
+        async def parse(self, response, page: ProductPage):
+            return await page.to_item()
+
+Note that we needed to update the ``parse()`` method to be an ``async`` method,
+since the ``to_item()`` method of the Page Object we're using is an ``async``
+method as well.
+
+
+Page params
+===========
+
+Using ``web_poet.PageParams`` allows the Scrapy spider to pass any arbitrary
+information into the Page Object.
+
+Suppose we update the earlier Page Object to control the additional request.
+This basically acts as a switch to update the behavior of the Page Object:
+
+.. code-block:: python
+
+    import attr
+    import web_poet
+
+
+    @attr.define
+    class ProductPage(web_poet.ItemWebPage):
+        http_client: web_poet.HttpClient
+        page_params: web_poet.PageParams
+
+        async def to_item(self):
+            item = {
+                "url": self.url,
+                "name": self.css("#main h3.name ::text").get(),
+                "product_id": self.css("#product ::attr(product-id)").get(),
+            }
+
+            # Simulates clicking on a button that says "View All Images"
+            if self.page_params.get("enable_extracting_all_images")
+                response: web_poet.HttpResponse = await self.http_client.get(
+                    f"https://api.example.com/v2/images?id={item['product_id']}"
+                )
+                item["images"] = response.css(".product-images img::attr(src)").getall()
+
+            return item
+
+Passing the ``enable_extracting_all_images`` page parameter from the spider
+into the Page Object can be achieved by using **Scrapy's** ``Request.meta``
+attribute. Specifically, any ``dict`` value inside the ``page_params``
+parameter inside **Scrapy's** ``Request.meta`` will be passed into
+``web_poet.PageParams``.
+
+Let's see it in action:
+
+.. code-block:: python
+
+    import scrapy
+
+
+    class ProductSpider(scrapy.Spider):
+
+        custom_settings = {
+            "DOWNLOADER_MIDDLEWARES": {
+                "scrapy_poet.InjectionMiddleware": 543,
+            }
+        }
+
+        start_urls = [
+            "https://example.com/category/product/item?id=123",
+            "https://example.com/category/product/item?id=989",
+        ]
+
+        def start_requests(self):
+            for url in start_urls:
+                yield scrapy.Request(
+                    url=url,
+                    callback=self.parse,
+                    meta={"page_params": {"enable_extracting_all_images": True}}
+                )
+
+        async def parse(self, response, page: ProductPage):
+            return await page.to_item()
diff --git a/docs/intro/tutorial.rst → docs/intro/basic-tutorial.rst b/docs/intro/tutorial.rst → docs/intro/basic-tutorial.rst
@@ -1,8 +1,8 @@
-.. _`intro-tutorial`:
+.. _`intro-basic-tutorial`:
 
-========
-Tutorial
-========
+==============
+Basic Tutorial
+==============
 
 In this tutorial, we’ll assume that ``scrapy-poet`` is already installed on your
 system. If that’s not the case, see :ref:`intro-install`.
@@ -198,6 +198,42 @@ returning the result of the ``to_item`` method call. We could use
     ``response.follow_all(links, callback_for(BookPage))``, without creating
     an attribute, but currently it won't work with Scrapy disk queues.
 
+.. tip::
+
+    :func:`~.callback_for` also supports `async generators`. So if we have the
+    following:
+
+    .. code-block:: python
+
+        class BooksSpider(scrapy.Spider):
+            name = 'books'
+            start_urls = ['http://books.toscrape.com/']
+
+            def parse(self, response):
+                links = response.css('.image_container a')
+                yield from response.follow_all(links, self.parse_book)
+
+            async def parse_book(self, response: DummyResponse, page: BookPage):
+                yield await page.to_item()
+
+    It could be turned into:
+
+    .. code-block:: python
+
+        class BooksSpider(scrapy.Spider):
+            name = 'books'
+            start_urls = ['http://books.toscrape.com/']
+
+            def parse(self, response):
+                links = response.css('.image_container a')
+                yield from response.follow_all(links, self.parse_book)
+
+            parse_book = callback_for(BookPage)
+
+    This is useful when the Page Objects uses additional requests, which rely
+    heavily on ``async/await`` syntax. More info on this in this tutorial 
+    section: :ref:`intro-additional-requests`.
+
 Final result
 ============
 

diff --git a/docs/intro/install.rst b/docs/intro/install.rst
@@ -16,7 +16,7 @@ If you’re already familiar with installation of Python packages, you can insta
 
     pip install scrapy-poet
 
-Scrapy 2.1.0 or above is required and it has to be installed separately.
+Scrapy 2.6.0 or above is required and it has to be installed separately.
 
 Things that are good to know
 ============================

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,3 +1,3 @@
-Scrapy >= 2.1.0
+Scrapy >= 2.6.0
 Sphinx >= 3.0.3
 sphinx-rtd-theme >= 0.4
diff --git a/scrapy_poet/api.py b/scrapy_poet/api.py
@@ -1,4 +1,5 @@
 from typing import Callable, Optional, Type
+from inspect import iscoroutinefunction
 
 from scrapy.http import Request, Response
 
@@ -55,6 +56,38 @@ def parse_book(self, response: DummyResponse, page: BookPage):
 
     It allows to write this:
 
+    .. code-block:: python
+
+        class BooksSpider(scrapy.Spider):
+            name = 'books'
+            start_urls = ['http://books.toscrape.com/']
+
+            def parse(self, response):
+                links = response.css('.image_container a')
+                yield from response.follow_all(links, self.parse_book)
+
+            parse_book = callback_for(BookPage)
+
+    It also supports producing an async generator callable if the Page Objects's
+    ``to_item()`` method is a coroutine which uses the ``async/await`` syntax.
+
+    So if we have the following:
+
+    .. code-block:: python
+
+        class BooksSpider(scrapy.Spider):
+            name = 'books'
+            start_urls = ['http://books.toscrape.com/']
+
+            def parse(self, response):
+                links = response.css('.image_container a')
+                yield from response.follow_all(links, self.parse_book)
+
+            async def parse_book(self, response: DummyResponse, page: BookPage):
+                yield await page.to_item()
+
+    It could be turned into:
+
     .. code-block:: python
 
         class BooksSpider(scrapy.Spider):
@@ -90,5 +123,12 @@ def parse(self, response):
     def parse(*args, page: page_cls, **kwargs):  # type: ignore
         yield page.to_item()  # type: ignore
 
+    async def async_parse(*args, page: page_cls, **kwargs):  # type: ignore
+        yield await page.to_item()  # type: ignore
+
+    if iscoroutinefunction(page_cls.to_item):
+        setattr(async_parse, _CALLBACK_FOR_MARKER, True)
+        return async_parse
+
     setattr(parse, _CALLBACK_FOR_MARKER, True)
     return parse