Skip to content

Commit

Permalink
add docs for supporting web-poet's HttpClient and Meta
Browse files Browse the repository at this point in the history
  • Loading branch information
BurnzZ committed Mar 15, 2022
1 parent e72d5ca commit 5eca70e
Show file tree
Hide file tree
Showing 10 changed files with 204 additions and 12 deletions.
24 changes: 24 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,27 @@ License is BSD 3-clause.
* Issue tracker: https://github.com/scrapinghub/scrapy-poet/issues

.. _`web-poet`: https://github.com/scrapinghub/web-poet


Quick Start
***********

Installation
============

.. code-block::
pip install scrapy-poet
Requires **Python 3.7+** and **Scrapy >= 2.6.0**.

Usage in a Scrapy Project
=========================

Add the following inside Scrapy's ``settings.py`` file:

.. code-block:: python
DOWNLOADER_MIDDLEWARES = {
"scrapy_poet.InjectionMiddleware": 543,
}
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
:maxdepth: 1

intro/install
intro/tutorial
intro/basic-tutorial
intro/advanced-tutorial

.. toctree::
:caption: Advanced
Expand Down
167 changes: 167 additions & 0 deletions docs/intro/advanced-tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
.. _`intro-advanced-tutorial`:

=================
Advanced Tutorial
=================

This section intends to go over the supported features in **web-poet** by
**scrapy-poet**:

* ``web_poet.HttpClient``
* ``web_poet.Meta``

These are mainly achieved by **scrapy-poet** implementing **providers** for them:

* :class:`scrapy_poet.page_input_providers.HttpClientProvider`
* :class:`scrapy_poet.page_input_providers.MetaProvider`


Additional Requests
===================

Using Page Objects using additional requests doesn't need anything special from
the spider. It would work as-is because of the readily available
:class:`scrapy_poet.page_input_providers.HttpClientProvider` that is enabled
out of the box.

This supplies the Page Object with the necessary ``web_poet.HttpClient`` instance.
Take note the HTTP Downloader implementation that **scrapy-poet** provides to
``web_poet.HttpClient`` would be the **Scrapy Downloader**.

.. tip::

This means that the additional requests inside a Page Object will have access
to the **Downloader Middlewares** that the Spider is using.


Suppose we have the following Page Object:

.. code-block:: python
import attr
import web_poet
@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}
# Simulates clicking on a button that says "View All Images"
response: web_poet.ResponseData = await self.http_client.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
page = web_poet.WebPage(response)
item["images"] = page.css(".product-images img::attr(src)").getall()
return item
It can be directly used inside the spider as:

.. code-block:: python
import scrapy
def ProductSpider(scrapy.Spider):
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_poet.InjectionMiddleware": 543,
}
}
start_urls = [
"https://example.com/category/product/item?id=123",
"https://example.com/category/product/item?id=989",
]
async def parse(self, response, page: ProductPage):
return await page.to_item()
Note that we needed to update the ``parse()`` method to be an ``async`` method,
since the ``to_item()`` method of the Page Object we're using is an ``async``
method as well.

This is also the primary reason why **scrapy-poet** requires ``scrapy>=2.6.0``
since it's the minimum version that has full :mod:`asyncio` support.


Meta
====

Using ``web_poet.Meta`` allows the Scrapy spider to pass any arbitrary information
into the Page Object.

Suppose we update the earlier Page Object to control the additional request.
This basically acts as a switch to update the behavior of the Page Object:

.. code-block:: python
import attr
import web_poet
@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
meta: web_poet.Meta
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}
# Simulates clicking on a button that says "View All Images"
if self.meta.get("enable_extracting_all_images")
response: web_poet.ResponseData = await self.http_client.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
page = web_poet.WebPage(response)
item["images"] = page.css(".product-images img::attr(src)").getall()
return item
Passing the ``enable_extracting_all_images`` meta value from the spider into
the Page Object can be achieved by using **Scrapy's** ``Request.meta`` attribute.
Specifically, any ``dict`` value inside the ``po_args`` parameter inside
**Scrapy's** ``Request.meta`` will be passed into ``web_poet.Meta``.

Let's see it in action:

.. code-block:: python
import scrapy
def ProductSpider(scrapy.Spider):
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_poet.InjectionMiddleware": 543,
}
}
start_urls = [
"https://example.com/category/product/item?id=123",
"https://example.com/category/product/item?id=989",
]
def start_requests(self):
for url in start_urls:
yield scrapy.Request(
url=url,
callback=self.parse,
meta={"po_args": {"enable_extracting_all_images": True}}
)
async def parse(self, response, page: ProductPage):
return await page.to_item()
8 changes: 4 additions & 4 deletions docs/intro/tutorial.rst → docs/intro/basic-tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _`intro-tutorial`:
.. _`intro-basic-tutorial`:

========
Tutorial
========
==============
Basic Tutorial
==============

In this tutorial, we’ll assume that ``scrapy-poet`` is already installed on your
system. If that’s not the case, see :ref:`intro-install`.
Expand Down
2 changes: 1 addition & 1 deletion docs/intro/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ If you’re already familiar with installation of Python packages, you can insta

pip install scrapy-poet

Scrapy 2.1.0 or above is required and it has to be installed separately.
Scrapy 2.6.0 or above is required and it has to be installed separately.

Things that are good to know
============================
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
Scrapy >= 2.1.0
Scrapy >= 2.6.0
Sphinx >= 3.0.3
sphinx-rtd-theme >= 0.4
1 change: 1 addition & 0 deletions scrapy_poet/debug.log
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/home/k/.pyenv/versions/3.7.9/bin/python3: can't open file 'multiple_spider_in_one_process.py': [Errno 2] No such file or directory
4 changes: 2 additions & 2 deletions scrapy_poet/page_input_providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ class HttpClientProvider(PageObjectInputProvider):
provided_classes = {HttpClient}

def __call__(self, to_provide: Set[Callable], crawler: Crawler):
"""Creates an ``web_poet.requests.HttpClient``` instance using Scrapy's
"""Creates an ``web_poet.requests.HttpClient`` instance using Scrapy's
downloader.
"""
backend = create_scrapy_backend(crawler.engine.download)
Expand All @@ -211,6 +211,6 @@ class MetaProvider(PageObjectInputProvider):

def __call__(self, to_provide: Set[Callable], request: Request):
"""Creates a ``web_poet.requests.Meta`` instance based on the data found
from the ``meta["po_args"]`` field of a ``scrapy.http.Response``instance.
from the ``meta["po_args"]`` field of a ``scrapy.http.Response`` instance.
"""
return [Meta(**request.meta.get("po_args", {}))]
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
'andi >= 0.4.1',
'attrs',
'parsel',
'web-poet',
'web-poet @ git+https://[email protected]/scrapinghub/web-poet@meta#egg=web-poet',
'tldextract',
'sqlitedict',
],
Expand Down
3 changes: 1 addition & 2 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,9 @@ deps =
pytest
pytest-cov
pytest-asyncio
scrapy >= 2.1.0
scrapy >= 2.6.0
pytest-twisted
web-poet @ git+https://[email protected]/scrapinghub/web-poet@meta#egg=web-poet
scrapy @ git+https://github.com/scrapy/scrapy.git@30d5779#egg=scrapy

commands =
py.test \
Expand Down

0 comments on commit 5eca70e

Please sign in to comment.