Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article spider template #91

Merged
merged 26 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
17d9f95
remove old file
PyExplorer Dec 6, 2024
c3fa843
merge from article repo
PyExplorer Dec 6, 2024
d5b8adc
update api.rst
PyExplorer Dec 6, 2024
98b4e9c
add DefaultSearchRequestTemplatePage to pages api.rst
PyExplorer Dec 6, 2024
e04c7b3
add tests for coverage utils.py
PyExplorer Dec 8, 2024
2623342
add tests for coverage PageParamsMiddlewareBase
PyExplorer Dec 8, 2024
1045617
add tests for OnlyFeedsMiddleware and PageParamsMiddlewareBase
PyExplorer Dec 9, 2024
124e123
add tests for DummyDupeFilter
PyExplorer Dec 9, 2024
80c7022
add from_crawler for TrackSeedsSpiderMiddleware test
PyExplorer Dec 9, 2024
492a7be
add from_crawler for PageParamsMiddlewareBase test
PyExplorer Dec 9, 2024
758ec6f
add tests for DupeFilterSpiderMiddleware + tune others
PyExplorer Dec 9, 2024
61fab4e
fix async test for DupeFilterSpiderMiddleware
PyExplorer Dec 9, 2024
b24de1e
add test for from_crawler for TrackSeedsSpiderMiddleware, MaxRequests…
PyExplorer Dec 9, 2024
8ad76ce
add test for TrackNavigationDepthSpiderMiddleware
PyExplorer Dec 9, 2024
728f84e
formatting
PyExplorer Dec 9, 2024
970743b
fix auth issue on staging
PyExplorer Dec 12, 2024
951bcd7
test
PyExplorer Dec 12, 2024
68371f2
clean todo
PyExplorer Dec 12, 2024
769bcd0
Only enable DropLowProbabilityItemPipeline for the articles spider.
wRAR Dec 13, 2024
8927379
Merge pull request #98 from zytedata/articles-probability-pipeline
wRAR Dec 13, 2024
ad6a58c
add validation for incremental_collection_name
kmike Dec 13, 2024
0a6cad3
fix formatting
kmike Dec 13, 2024
80aa641
Merge pull request #100 from zytedata/fix-collection-name-validation
kmike Dec 13, 2024
9f2ef4b
Mark Articles spider as experimental
kmike Dec 13, 2024
96376a5
Merge pull request #101 from zytedata/mark-articles-as-experimental
kmike Dec 13, 2024
bacdae4
Merge branch 'main' into articles_to_main
kmike Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
autodoc_pydantic_model_show_validator_members = False
autodoc_pydantic_model_show_validator_summary = False
autodoc_pydantic_field_list_validators = False
autodoc_pydantic_field_show_constraints = False

# sphinx-reredirects
redirects = {
Expand Down
10 changes: 9 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ zyte-spider-templates documentation

templates/index
E-commerce <templates/e-commerce>
Article <templates/article>
Google search <templates/google-search>

.. toctree::
Expand All @@ -34,9 +35,16 @@ zyte-spider-templates documentation
customization/spiders
customization/pages

.. toctree::
:caption: Reference
:hidden:

reference/settings
reference/reqmeta
reference/api

.. toctree::
:caption: All the rest
:hidden:

reference/index
changes
31 changes: 28 additions & 3 deletions docs/reference/index.rst → docs/reference/api.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
=========
Reference
=========
===
API
===

Spiders
=======

.. autoclass:: zyte_spider_templates.ArticleSpider

.. autoclass:: zyte_spider_templates.BaseSpider

.. autoclass:: zyte_spider_templates.EcommerceSpider
Expand All @@ -15,6 +17,10 @@ Spiders
Pages
=====

.. autoclass:: zyte_spider_templates.pages.DefaultSearchRequestTemplatePage

.. autoclass:: zyte_spider_templates.pages.HeuristicsArticleNavigationPage

.. autoclass:: zyte_spider_templates.pages.HeuristicsProductNavigationPage

kmike marked this conversation as resolved.
Show resolved Hide resolved

Expand Down Expand Up @@ -59,3 +65,22 @@ Parameter mixins

.. autopydantic_model:: zyte_spider_templates.spiders.serp.SerpMaxPagesParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.spiders.article.ArticleCrawlStrategyParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.spiders.article.ArticleCrawlStrategy


.. _middlewares:

Middlewares
===========

.. autoclass:: zyte_spider_templates.CrawlingLogsMiddleware
.. autoclass:: zyte_spider_templates.TrackNavigationDepthSpiderMiddleware
.. autoclass:: zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware
.. autoclass:: zyte_spider_templates.OffsiteRequestsPerSeedMiddleware
.. autoclass:: zyte_spider_templates.OnlyFeedsMiddleware
.. autoclass:: zyte_spider_templates.TrackSeedsSpiderMiddleware
.. autoclass:: zyte_spider_templates.IncrementalCrawlMiddleware
112 changes: 112 additions & 0 deletions docs/reference/reqmeta.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
.. _meta:

=================
Request.meta keys
=================

Keys that can be defined in :attr:`Request.meta <scrapy.http.Request.meta>` for
zyte-spider-templates.

.. reqmeta:: seed

seed
====

Default: ``The seed URL (or value) from which the request originated.``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware` and
:class:`~zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware`.

The `seed` meta key is used to track and identify the origin of a request. It
is initially set for each request that originates from the start request and
can be used to manage domain constraints for subsequent requests. This key can
also be set to an arbitrary value by the user to identify the seed source.

Here's an example:

.. code-block:: python

meta = {
"seed": "http://example.com",
}

.. reqmeta:: is_seed_request

is_seed_request
===============

Default: ``False``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware`.

The `is_seed_request` meta key is a boolean flag that identifies whether the
request is a start request (i.e., originating from the initial seed URL). When
set to True, the middleware extracts seed domains from the response.

Example:
::

meta = {
'is_seed_request': True,
}

.. reqmeta:: seed_domains

seed_domains
============

Default: ``Initial URL and redirected URLs``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware`.

The `seed_domains` meta key is a list of domains that the middleware uses to
check whether a request belongs to these domains or not. By default, this list
includes the initial URL's domain and domains of any redirected URLs `(if there
was a redirection)`. This list can also be set by the user in the spider to
specify additional domains for which the middleware should allow requests.

Here's an example:

.. code-block:: python

meta = {"seed_domains": ["example.com", "another-example.com"]}

.. reqmeta:: is_hop

increase_navigation_depth
=========================

Default: ``True``

The key is used for :class:`~zyte_spider_templates.TrackNavigationDepthSpiderMiddleware`.

The `increase_navigation_depth` meta key is a boolean flag that determines whether the
navigation_depth for a request should be increased. By default, the middleware increases
navigation_depth for all requests. Specific spiders can override this behavior for certain
types of requests, such as pagination or RSS feeds, by explicitly setting the meta key.

Example:
::

meta = {
'increase_navigation_depth': False,
}

.. reqmeta:: only_feeds

only_feeds
==========
Default: ``False``

The key is used for :class:`~zyte_spider_templates.OnlyFeedsMiddleware`.

The `only_feeds` meta key is a boolean flag that identifies whether the
spider should discover all links on the website or extract links from RSS/Atom feeds only.

Example:
::

meta = {
'page_params': {'only_feeds': True}
}

196 changes: 196 additions & 0 deletions docs/reference/settings.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
.. _settings:

========
Settings
========

.. setting:: NAVIGATION_DEPTH_LIMIT

NAVIGATION_DEPTH_LIMIT
======================

Default: ``0``

The maximum navigation depth to crawl. If ``0``, no limit is imposed.

We increase *navigation_depth* for requests navigating to a subcategory originating from
its parent category, including a request targeting a category starting at the website home page.
We don't increase *navigation_depth* for requests accessing item details (e.g., an article) or for
additional pages of a visited webpage. For example, if you set ``NAVIGATION_DEPTH_LIMIT`` to ``1``,
only item details and pagination links from your start URLs are followed.

.. note::
Currently, only the :ref:`Article spider template <article>` implements proper
navigation_depth support. Other spider templates treat all follow-up requests as
increasing navigation_depth.

Setting a navigation_depth limit can prevent a spider from delving too deeply into
subcategories. This is especially useful if you only need data from the
top-level categories or specific subcategories.

When :ref:`customizing a spider template <custom-spiders>`, set the
:reqmeta:`increase_navigation_depth` request metadata key to override whether a request is
considered as increasing navigation depth (``True``) or not (``False``):

.. code-block:: python

Request("https://example.com", meta={"increase_navigation_depth": False})

If you want to limit all link following, including pagination and item details,
consider using the :setting:`DEPTH_LIMIT <scrapy:DEPTH_LIMIT>` setting instead.

Implemented by :class:`~zyte_spider_templates.TrackNavigationDepthSpiderMiddleware`.

.. setting:: MAX_REQUESTS_PER_SEED

MAX_REQUESTS_PER_SEED
=====================

.. tip:: When using the :ref:`article spider template <article>`, you may use
the
:attr:`~zyte_spider_templates.spiders.article.ArticleSpiderParams.max_requests_per_seed`
command-line parameter instead of this setting.

Default: ``0``

Limit the number of follow-up requests per initial URL to the specified amount.
Non-positive integers (i.e. 0 and below) imposes no limit and disables this middleware.

The limit is the total limit for all direct and indirect follow-up requests
of each initial URL.

Implemented by
:class:`~zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware`.

.. setting:: OFFSITE_REQUESTS_PER_SEED_ENABLED

OFFSITE_REQUESTS_PER_SEED_ENABLED
=================================

Default: ``True``

Setting this value to ``True`` enables the
:class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware` while ``False``
completely disables it.

The middleware ensures that *most* requests would belong to the domain of the
seed URLs. However, it does allow offsite requests only if they were obtained
from a response that belongs to the domain of the seed URLs. Any other requests
obtained thereafter from a response in a domain outside of the seed URLs will
not be allowed.

This prevents the spider from completely crawling other domains while ensuring
that aggregator websites *(e.g. a news website with articles from other domains)*
are supported, as it can access pages from other domains.

Disabling the middleware would not prevent offsite requests from being filtered
and might generally lead in other domains from being crawled completely, unless
``allowed_domains`` is set in the spider.

.. note::

If a seed URL gets redirected to a different domain, both the domain from
the original request and the domain from the redirected response will be
used as references.

If the seed URL is `https://books.toscrape.com`, all subsequent requests to
`books.toscrape.com` and its subdomains are allowed, but requests to
`toscrape.com` are not. Conversely, if the seed URL is `https://toscrape.com`,
requests to both `toscrape.com` and `books.toscrape.com` are allowed.

.. setting:: ONLY_FEEDS_ENABLED

ONLY_FEEDS_ENABLED
==================

.. note::

Only works for the :ref:`article spider template <article>`.

Default: ``False``

Whether to extract links from Atom and RSS news feeds only (``True``) or
to also use extracted links from ``ArticleNavigation.subCategories`` (``False``).

Implemented by :class:`~zyte_spider_templates.OnlyFeedsMiddleware`.

.. setting:: INCREMENTAL_CRAWL_BATCH_SIZE

INCREMENTAL_CRAWL_BATCH_SIZE
============================

Default: ``50``

The maximum number of seen URLs to read from or write to the corresponding
:ref:`Zyte Scrapy Cloud collection <api-collections>` per request during an incremental
crawl (see :setting:`INCREMENTAL_CRAWL_ENABLED`).

This setting determines the batch size for interactions with the Collection.
If the response from a webpage contains more than 50 URLs, they will be split
into smaller batches for processing. Conversely, if fewer than 50 URLs are present,
all URLs will be handled in a single request to the Collection.

Adjusting this value can optimize the performance of a crawl by balancing the number
of requests sent to the Collection with processing efficiency.

.. note::

Setting it too large (e.g. > 100) will cause issues due to the large query length.
Setting it too small (less than 10) will remove the benefit of using a batch.

Implemented by :class:`~zyte_spider_templates.IncrementalCrawlMiddleware`.


.. setting:: INCREMENTAL_CRAWL_COLLECTION_NAME

INCREMENTAL_CRAWL_COLLECTION_NAME
=================================

.. note::

:ref:`virtual spiders <virtual-spiders>` are spiders based on :ref:`spider templates <spider-templates>`.
The explanation of using INCREMENTAL_CRAWL_COLLECTION_NAME related to both types of spiders.

.. tip:: When using the :ref:`article spider template <article>`, you may use
the
:attr:`~zyte_spider_templates.spiders.article.ArticleSpiderParams.incremental_collection_name`
command-line parameter instead of this setting.

.. note::
Only ASCII alphanumeric characters and underscores are allowed.

Default: `<The current spider's name>_incremental`.
The current spider's name here will be virtual spider's name, if it's a virtual spider;
otherwise, :data:`Spider.name <scrapy.Spider.name>`.

Name of the :ref:`Zyte Scrapy Cloud collection <api-collections>` used during
an incremental crawl (see :setting:`INCREMENTAL_CRAWL_ENABLED`).

By default, a collection named after the spider is used, meaning that matching URLs from
previous runs of the same spider are skipped, provided those previous runs had
the :setting:`INCREMENTAL_CRAWL_ENABLED` setting set to ``True`` or the spider
argument `incremental` set to `true`.

Using a different collection name makes sense, for example, in the following cases:
- Different spiders share a collection.
- The same spider uses different collections (e.g., for development runs vs. production runs).

Implemented by :class:`~zyte_spider_templates.IncrementalCrawlMiddleware`.


.. setting:: INCREMENTAL_CRAWL_ENABLED

INCREMENTAL_CRAWL_ENABLED
=========================

.. tip:: When using the :ref:`article spider template <article>`, you may use
the
:attr:`~zyte_spider_templates.spiders.article.ArticleSpiderParams.incremental`
command-line parameter instead of this setting.

Default: ``False``

If set to ``True``, items seen in previous crawls with the same
:setting:`INCREMENTAL_CRAWL_COLLECTION_NAME` value are skipped.

Implemented by :class:`~zyte_spider_templates.IncrementalCrawlMiddleware`.
Loading
Loading