Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement search support #77

Merged
merged 22 commits into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
70bb360
Implement search support
Gallaecio Oct 31, 2024
ee995f9
Search keywords → search queries
Gallaecio Oct 31, 2024
4360e30
Proofreading and dependency review
Gallaecio Oct 31, 2024
188e84f
Fix typing issues
Gallaecio Oct 31, 2024
38ffe75
Mention workarounds in the search_queries tooltip
Gallaecio Oct 31, 2024
4c91132
Do not allow combining crawl_strategy=direct_item with search_queries
Gallaecio Nov 5, 2024
c61e848
Solve an issue reported by mypy
Gallaecio Nov 5, 2024
49c1f3c
Remove docs about reusing the default search request PO builders
Gallaecio Nov 6, 2024
a21741f
UseFallback → no_item_found()
Gallaecio Nov 8, 2024
a7a5cef
Solve typing and dependency conflict issues
Gallaecio Nov 8, 2024
dd18ec4
Require extruct 0.18.0
Gallaecio Nov 8, 2024
4b771fb
zyte-common-items ≥ 0.25.0
Gallaecio Nov 11, 2024
61d435e
DefaultSearchRequestTemplatePage: Do not require RequestUrl
Gallaecio Nov 11, 2024
74d140d
e-commerce: set browserHtml to True for SearchRequestTemplate if extr…
Gallaecio Nov 12, 2024
3f3e8c9
Add missing type hint
Gallaecio Nov 12, 2024
9dd4f38
Fixes
Gallaecio Nov 12, 2024
cdaea64
Use DynamicDeps
Gallaecio Nov 12, 2024
7300db6
Remove the error suggesting browser rendering as a solution
Gallaecio Nov 12, 2024
5763840
Improve the Search Queries description
Gallaecio Nov 13, 2024
8389d7b
Do not allow search queries with full crawl
Gallaecio Nov 13, 2024
a5f3be4
Disable subcategory crawling with search queries
Gallaecio Nov 13, 2024
a298fc1
Merge remote-tracking branch 'zytedata/main' into search
Gallaecio Nov 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,14 @@
html_theme = "sphinx_rtd_theme"

intersphinx_mapping = {
"form2request": (
"https://form2request.readthedocs.io/en/latest",
None,
),
"formasaurus": (
"https://formasaurus.readthedocs.io/en/latest",
None,
),
"python": (
"https://docs.python.org/3",
None,
Expand Down
101 changes: 100 additions & 1 deletion docs/customization/pages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ Customizing page objects

All parsing is implemented using :ref:`web-poet page objects <page-objects>`
that use `Zyte API automatic extraction`_ to extract :ref:`standard items
<item-api>`, both for navigation and for item details.
<item-api>`: for navigation, for item details, and even for :ref:`search
request generation <search-queries>`.

.. _Zyte API automatic extraction: https://docs.zyte.com/zyte-api/usage/extract.html

Expand Down Expand Up @@ -141,3 +142,101 @@ To extract a new field for one or more websites:

def parse_product(self, response: DummyResponse, product: CustomProduct):
yield from super().parse_product(response, product)

.. _fix-search:

Fixing search support
=====================

If the default implementation to build a request out of :ref:`search queries
<search-queries>` does not work on a given website, you can implement your
own search request page object to fix that. See
:ref:`custom-request-template-page`.

For example:

.. code-block:: python

from web_poet import handle_urls
from zyte_common_items import SearchRequestTemplatePage


@handle_urls("example.com")
class ExampleComSearchRequestTemplatePage(SearchRequestTemplatePage):
@field
def url(self):
return "https://example.com/search?q={{ keyword|quote_plus }}"


Reusing the default implementation
----------------------------------
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

The default implementation of search request building combines the following
*builders*, pieces of code that each can determine how to build a search
request from a given web page using a different approach:

- ``extruct``: Uses the extruct_ library to build a request based on
SearchAction_ metadata.

.. _extruct: https://github.com/scrapinghub/extruct
.. _SearchAction: https://schema.org/SearchAction

- ``formasaurus``: Uses the AI-powered :doc:`Formasaurus <formasaurus:index>`
library to find a search form, and builds a request out of it with the
:doc:`form2request <form2request:index>` library.

- ``link_heuristics``: Uses heuristics to find a link that looks like a
search link, and builds a GET request with a URL based on that search link.

- ``form_heuristics``: Uses heuristics to find a form that look like a search
form, and builds a request out of it with the :doc:`form2request
<form2request:index>` library.

By default, the first builder from the list above that yields a search request
is used, but if multiple builders yield the same search request, that search
request is preferred.

This is implemented in the
:class:`zyte_spider_templates.pages.search_request_template.DefaultSearchRequestTemplatePage`
page object class, which supports :ref:`page params <page-params>` to modify
which builders are used, in which order, and with which strategy:

- ``search_request_builders`` determines which builders to use and their
order of precedence.

Default: ``["extruct", "formasaurus", "link_heuristics", "form_heuristics"]``

- ``search_request_builder_strategy`` determines the strategy to use among
these:

- ``"first"``: Builders are executed in order of precedence, and the
first search request yielded is used. Builders that do not yield a
search request at all are ignored.

- ``"popular"`` (default): Runs every builder and picks the most common
search request. If there is not a single most common search request,
then the order of precedence of builders is taken into account.

If the default implementation does not work for a given website, but a specific
builder does work, you could implement a search request template page object
class that subclasses this one and changes the strategy and builders.

For example, if a website defines valid SearchAction_ metadata, you can force
that metadata to be used for that website with the following page object class:

.. code-block:: python

import attrs
from web_poet import handle_urls
from zyte_common_items import SearchRequestTemplatePage
from zyte_spider_templates.pages.search_request_template import (
DefaultSearchRequestTemplatePage,
)


@handle_urls("example.com")
@attrs.define
class ExampleComSearchRequestTemplatePage(DefaultSearchRequestTemplatePage):
def __attrs_post_init__(self):
self.page_params.setdefault("search_request_builder_strategy", "first")
self.page_params.setdefault("search_request_builders", ["extruct"])
43 changes: 43 additions & 0 deletions docs/features/search.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. _search-queries:

==============
Search queries
==============

The :ref:`e-commerce spider template <e-commerce>` supports a spider argument,
:data:`~zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams.search_queries`,
that allows you to define a different search query per line, and
turns the input URLs into search requests for those queries.

For example, given the following input URLs:

.. code-block:: none

https://a.example
https://b.example

And the following list of search queries:

.. code-block:: none

foo bar
baz

By default, the spider would send 2 initial requests to those 2 input URLs,
to try and find out how to build a search request for them, and if it succeeds,
it will then send 4 search requests, 1 per combination of input URL and search
query. For example:

.. code-block:: none

https://a.example/search?q=foo+bar
https://a.example/search?q=baz
https://b.example/s/foo%20bar
https://b.example/s/baz

The default implementation uses a combination of HTML metadata, AI-based HTML
form inspection and heuristics to find the most likely way to build a search
request for a given website.

If this default implementation does not work as expected on a given website,
you can :ref:`write a page object to fix that <fix-search>`.
6 changes: 6 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ zyte-spider-templates documentation
E-commerce <templates/e-commerce>
Google search <templates/google-search>

.. toctree::
:caption: Features
:hidden:

Search queries <features/search>

.. toctree::
:caption: Customization
:hidden:
Expand Down
3 changes: 3 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[pytest]
filterwarnings =
ignore:deprecated string literal syntax::jmespath.lexer
5 changes: 5 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,17 @@
packages=find_packages(),
include_package_data=True,
install_requires=[
"extruct @ git+https://github.com/Gallaecio/extruct.git@query-input",
"form2request>=0.2.0",
"formasaurus @ git+https://github.com/Gallaecio/Formasaurus.git@form2request",
"jmespath>=0.9.5",
"pydantic>=2.1",
"requests>=0.10.1",
"scrapy>=2.11.0",
"scrapy-poet>=0.24.0",
"scrapy-spider-metadata>=0.2.0",
"scrapy-zyte-api[provider]>=0.23.0",
"web-poet>=0.17.1",
"zyte-common-items>=0.23.0",
],
classifiers=[
Expand Down
56 changes: 56 additions & 0 deletions tests/test_ecommerce.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,19 @@ def test_parameters():
with pytest.raises(ValidationError):
EcommerceSpider(url="https://example.com", crawl_strategy="unknown")

EcommerceSpider(
url="https://example.com", crawl_strategy="direct_item", search_queries=""
)
EcommerceSpider(
url="https://example.com", crawl_strategy="automatic", search_queries="foo"
)
with pytest.raises(ValidationError):
EcommerceSpider(
url="https://example.com",
crawl_strategy="direct_item",
search_queries="foo",
)


def test_start_requests():
url = "https://example.com"
Expand Down Expand Up @@ -420,6 +433,21 @@ def test_metadata():
"title": "URLs file",
"type": "string",
},
"search_queries": {
"default": [],
"description": (
"Turn the input URLs into search requests for these "
"queries. You may specify a separate search query per "
"line. If search request building fails, you can "
"instead pass search URLs as input start URLs, or "
"customize the AI spider project with a search "
"request template page object (check the docs)."
),
"items": {"type": "string"},
"title": "Search Queries",
"type": "array",
"widget": "textarea",
},
"crawl_strategy": {
"default": "automatic",
"description": "Determines how the start URL and follow-up URLs are crawled.",
Expand Down Expand Up @@ -820,6 +848,34 @@ def test_urls_file():
assert start_requests[2].url == "https://c.example"


def test_search_queries():
crawler = get_crawler()
url = "https://example.com"

spider = EcommerceSpider.from_crawler(crawler, url=url, search_queries="foo bar")
start_requests = list(spider.start_requests())
assert len(start_requests) == 1
assert start_requests[0].url == url
assert start_requests[0].callback == spider.parse_search_request_template
assert spider.args.search_queries == ["foo bar"]

spider = EcommerceSpider.from_crawler(crawler, url=url, search_queries="foo\nbar")
start_requests = list(spider.start_requests())
assert len(start_requests) == 1
assert start_requests[0].url == url
assert start_requests[0].callback == spider.parse_search_request_template
assert spider.args.search_queries == ["foo", "bar"]

spider = EcommerceSpider.from_crawler(
crawler, url=url, search_queries=["foo", "bar"]
)
start_requests = list(spider.start_requests())
assert len(start_requests) == 1
assert start_requests[0].url == url
assert start_requests[0].callback == spider.parse_search_request_template
assert spider.args.search_queries == ["foo", "bar"]


@pytest.mark.parametrize(
"url,has_full_domain",
(
Expand Down
Loading
Loading