Skip to content

Commit

Permalink
Merge pull request #91 from zytedata/articles_to_main
Browse files Browse the repository at this point in the history
Article spider template
  • Loading branch information
kmike authored Dec 16, 2024
2 parents 552223e + bacdae4 commit 262e603
Show file tree
Hide file tree
Showing 36 changed files with 5,986 additions and 63 deletions.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
autodoc_pydantic_model_show_validator_members = False
autodoc_pydantic_model_show_validator_summary = False
autodoc_pydantic_field_list_validators = False
autodoc_pydantic_field_show_constraints = False

# sphinx-reredirects
redirects = {
Expand Down
10 changes: 9 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ zyte-spider-templates documentation

templates/index
E-commerce <templates/e-commerce>
Article <templates/article>
Google search <templates/google-search>

.. toctree::
Expand All @@ -34,9 +35,16 @@ zyte-spider-templates documentation
customization/spiders
customization/pages

.. toctree::
:caption: Reference
:hidden:

reference/settings
reference/reqmeta
reference/api

.. toctree::
:caption: All the rest
:hidden:

reference/index
changes
31 changes: 28 additions & 3 deletions docs/reference/index.rst → docs/reference/api.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
=========
Reference
=========
===
API
===

Spiders
=======

.. autoclass:: zyte_spider_templates.ArticleSpider

.. autoclass:: zyte_spider_templates.BaseSpider

.. autoclass:: zyte_spider_templates.EcommerceSpider
Expand All @@ -15,6 +17,10 @@ Spiders
Pages
=====

.. autoclass:: zyte_spider_templates.pages.DefaultSearchRequestTemplatePage

.. autoclass:: zyte_spider_templates.pages.HeuristicsArticleNavigationPage

.. autoclass:: zyte_spider_templates.pages.HeuristicsProductNavigationPage


Expand Down Expand Up @@ -59,3 +65,22 @@ Parameter mixins

.. autopydantic_model:: zyte_spider_templates.spiders.serp.SerpMaxPagesParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.spiders.article.ArticleCrawlStrategyParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.spiders.article.ArticleCrawlStrategy


.. _middlewares:

Middlewares
===========

.. autoclass:: zyte_spider_templates.CrawlingLogsMiddleware
.. autoclass:: zyte_spider_templates.TrackNavigationDepthSpiderMiddleware
.. autoclass:: zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware
.. autoclass:: zyte_spider_templates.OffsiteRequestsPerSeedMiddleware
.. autoclass:: zyte_spider_templates.OnlyFeedsMiddleware
.. autoclass:: zyte_spider_templates.TrackSeedsSpiderMiddleware
.. autoclass:: zyte_spider_templates.IncrementalCrawlMiddleware
112 changes: 112 additions & 0 deletions docs/reference/reqmeta.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
.. _meta:

=================
Request.meta keys
=================

Keys that can be defined in :attr:`Request.meta <scrapy.http.Request.meta>` for
zyte-spider-templates.

.. reqmeta:: seed

seed
====

Default: ``The seed URL (or value) from which the request originated.``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware` and
:class:`~zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware`.

The `seed` meta key is used to track and identify the origin of a request. It
is initially set for each request that originates from the start request and
can be used to manage domain constraints for subsequent requests. This key can
also be set to an arbitrary value by the user to identify the seed source.

Here's an example:

.. code-block:: python
meta = {
"seed": "http://example.com",
}
.. reqmeta:: is_seed_request

is_seed_request
===============

Default: ``False``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware`.

The `is_seed_request` meta key is a boolean flag that identifies whether the
request is a start request (i.e., originating from the initial seed URL). When
set to True, the middleware extracts seed domains from the response.

Example:
::

meta = {
'is_seed_request': True,
}

.. reqmeta:: seed_domains

seed_domains
============

Default: ``Initial URL and redirected URLs``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware`.

The `seed_domains` meta key is a list of domains that the middleware uses to
check whether a request belongs to these domains or not. By default, this list
includes the initial URL's domain and domains of any redirected URLs `(if there
was a redirection)`. This list can also be set by the user in the spider to
specify additional domains for which the middleware should allow requests.

Here's an example:

.. code-block:: python
meta = {"seed_domains": ["example.com", "another-example.com"]}
.. reqmeta:: is_hop

increase_navigation_depth
=========================

Default: ``True``

The key is used for :class:`~zyte_spider_templates.TrackNavigationDepthSpiderMiddleware`.

The `increase_navigation_depth` meta key is a boolean flag that determines whether the
navigation_depth for a request should be increased. By default, the middleware increases
navigation_depth for all requests. Specific spiders can override this behavior for certain
types of requests, such as pagination or RSS feeds, by explicitly setting the meta key.

Example:
::

meta = {
'increase_navigation_depth': False,
}

.. reqmeta:: only_feeds

only_feeds
==========
Default: ``False``

The key is used for :class:`~zyte_spider_templates.OnlyFeedsMiddleware`.

The `only_feeds` meta key is a boolean flag that identifies whether the
spider should discover all links on the website or extract links from RSS/Atom feeds only.

Example:
::

meta = {
'page_params': {'only_feeds': True}
}

196 changes: 196 additions & 0 deletions docs/reference/settings.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
.. _settings:

========
Settings
========

.. setting:: NAVIGATION_DEPTH_LIMIT

NAVIGATION_DEPTH_LIMIT
======================

Default: ``0``

The maximum navigation depth to crawl. If ``0``, no limit is imposed.

We increase *navigation_depth* for requests navigating to a subcategory originating from
its parent category, including a request targeting a category starting at the website home page.
We don't increase *navigation_depth* for requests accessing item details (e.g., an article) or for
additional pages of a visited webpage. For example, if you set ``NAVIGATION_DEPTH_LIMIT`` to ``1``,
only item details and pagination links from your start URLs are followed.

.. note::
Currently, only the :ref:`Article spider template <article>` implements proper
navigation_depth support. Other spider templates treat all follow-up requests as
increasing navigation_depth.

Setting a navigation_depth limit can prevent a spider from delving too deeply into
subcategories. This is especially useful if you only need data from the
top-level categories or specific subcategories.

When :ref:`customizing a spider template <custom-spiders>`, set the
:reqmeta:`increase_navigation_depth` request metadata key to override whether a request is
considered as increasing navigation depth (``True``) or not (``False``):

.. code-block:: python
Request("https://example.com", meta={"increase_navigation_depth": False})
If you want to limit all link following, including pagination and item details,
consider using the :setting:`DEPTH_LIMIT <scrapy:DEPTH_LIMIT>` setting instead.

Implemented by :class:`~zyte_spider_templates.TrackNavigationDepthSpiderMiddleware`.

.. setting:: MAX_REQUESTS_PER_SEED

MAX_REQUESTS_PER_SEED
=====================

.. tip:: When using the :ref:`article spider template <article>`, you may use
the
:attr:`~zyte_spider_templates.spiders.article.ArticleSpiderParams.max_requests_per_seed`
command-line parameter instead of this setting.

Default: ``0``

Limit the number of follow-up requests per initial URL to the specified amount.
Non-positive integers (i.e. 0 and below) imposes no limit and disables this middleware.

The limit is the total limit for all direct and indirect follow-up requests
of each initial URL.

Implemented by
:class:`~zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware`.

.. setting:: OFFSITE_REQUESTS_PER_SEED_ENABLED

OFFSITE_REQUESTS_PER_SEED_ENABLED
=================================

Default: ``True``

Setting this value to ``True`` enables the
:class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware` while ``False``
completely disables it.

The middleware ensures that *most* requests would belong to the domain of the
seed URLs. However, it does allow offsite requests only if they were obtained
from a response that belongs to the domain of the seed URLs. Any other requests
obtained thereafter from a response in a domain outside of the seed URLs will
not be allowed.

This prevents the spider from completely crawling other domains while ensuring
that aggregator websites *(e.g. a news website with articles from other domains)*
are supported, as it can access pages from other domains.

Disabling the middleware would not prevent offsite requests from being filtered
and might generally lead in other domains from being crawled completely, unless
``allowed_domains`` is set in the spider.

.. note::

If a seed URL gets redirected to a different domain, both the domain from
the original request and the domain from the redirected response will be
used as references.

If the seed URL is `https://books.toscrape.com`, all subsequent requests to
`books.toscrape.com` and its subdomains are allowed, but requests to
`toscrape.com` are not. Conversely, if the seed URL is `https://toscrape.com`,
requests to both `toscrape.com` and `books.toscrape.com` are allowed.

.. setting:: ONLY_FEEDS_ENABLED

ONLY_FEEDS_ENABLED
==================

.. note::

Only works for the :ref:`article spider template <article>`.

Default: ``False``

Whether to extract links from Atom and RSS news feeds only (``True``) or
to also use extracted links from ``ArticleNavigation.subCategories`` (``False``).

Implemented by :class:`~zyte_spider_templates.OnlyFeedsMiddleware`.

.. setting:: INCREMENTAL_CRAWL_BATCH_SIZE

INCREMENTAL_CRAWL_BATCH_SIZE
============================

Default: ``50``

The maximum number of seen URLs to read from or write to the corresponding
:ref:`Zyte Scrapy Cloud collection <api-collections>` per request during an incremental
crawl (see :setting:`INCREMENTAL_CRAWL_ENABLED`).

This setting determines the batch size for interactions with the Collection.
If the response from a webpage contains more than 50 URLs, they will be split
into smaller batches for processing. Conversely, if fewer than 50 URLs are present,
all URLs will be handled in a single request to the Collection.

Adjusting this value can optimize the performance of a crawl by balancing the number
of requests sent to the Collection with processing efficiency.

.. note::

Setting it too large (e.g. > 100) will cause issues due to the large query length.
Setting it too small (less than 10) will remove the benefit of using a batch.

Implemented by :class:`~zyte_spider_templates.IncrementalCrawlMiddleware`.


.. setting:: INCREMENTAL_CRAWL_COLLECTION_NAME

INCREMENTAL_CRAWL_COLLECTION_NAME
=================================

.. note::

:ref:`virtual spiders <virtual-spiders>` are spiders based on :ref:`spider templates <spider-templates>`.
The explanation of using INCREMENTAL_CRAWL_COLLECTION_NAME related to both types of spiders.

.. tip:: When using the :ref:`article spider template <article>`, you may use
the
:attr:`~zyte_spider_templates.spiders.article.ArticleSpiderParams.incremental_collection_name`
command-line parameter instead of this setting.

.. note::
Only ASCII alphanumeric characters and underscores are allowed.

Default: `<The current spider's name>_incremental`.
The current spider's name here will be virtual spider's name, if it's a virtual spider;
otherwise, :data:`Spider.name <scrapy.Spider.name>`.

Name of the :ref:`Zyte Scrapy Cloud collection <api-collections>` used during
an incremental crawl (see :setting:`INCREMENTAL_CRAWL_ENABLED`).

By default, a collection named after the spider is used, meaning that matching URLs from
previous runs of the same spider are skipped, provided those previous runs had
the :setting:`INCREMENTAL_CRAWL_ENABLED` setting set to ``True`` or the spider
argument `incremental` set to `true`.

Using a different collection name makes sense, for example, in the following cases:
- Different spiders share a collection.
- The same spider uses different collections (e.g., for development runs vs. production runs).

Implemented by :class:`~zyte_spider_templates.IncrementalCrawlMiddleware`.


.. setting:: INCREMENTAL_CRAWL_ENABLED

INCREMENTAL_CRAWL_ENABLED
=========================

.. tip:: When using the :ref:`article spider template <article>`, you may use
the
:attr:`~zyte_spider_templates.spiders.article.ArticleSpiderParams.incremental`
command-line parameter instead of this setting.

Default: ``False``

If set to ``True``, items seen in previous crawls with the same
:setting:`INCREMENTAL_CRAWL_COLLECTION_NAME` value are skipped.

Implemented by :class:`~zyte_spider_templates.IncrementalCrawlMiddleware`.
Loading

0 comments on commit 262e603

Please sign in to comment.