Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article spider template #91

Merged
merged 26 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
17d9f95
remove old file
PyExplorer Dec 6, 2024
c3fa843
merge from article repo
PyExplorer Dec 6, 2024
d5b8adc
update api.rst
PyExplorer Dec 6, 2024
98b4e9c
add DefaultSearchRequestTemplatePage to pages api.rst
PyExplorer Dec 6, 2024
e04c7b3
add tests for coverage utils.py
PyExplorer Dec 8, 2024
2623342
add tests for coverage PageParamsMiddlewareBase
PyExplorer Dec 8, 2024
1045617
add tests for OnlyFeedsMiddleware and PageParamsMiddlewareBase
PyExplorer Dec 9, 2024
124e123
add tests for DummyDupeFilter
PyExplorer Dec 9, 2024
80c7022
add from_crawler for TrackSeedsSpiderMiddleware test
PyExplorer Dec 9, 2024
492a7be
add from_crawler for PageParamsMiddlewareBase test
PyExplorer Dec 9, 2024
758ec6f
add tests for DupeFilterSpiderMiddleware + tune others
PyExplorer Dec 9, 2024
61fab4e
fix async test for DupeFilterSpiderMiddleware
PyExplorer Dec 9, 2024
b24de1e
add test for from_crawler for TrackSeedsSpiderMiddleware, MaxRequests…
PyExplorer Dec 9, 2024
8ad76ce
add test for TrackNavigationDepthSpiderMiddleware
PyExplorer Dec 9, 2024
728f84e
formatting
PyExplorer Dec 9, 2024
970743b
fix auth issue on staging
PyExplorer Dec 12, 2024
951bcd7
test
PyExplorer Dec 12, 2024
68371f2
clean todo
PyExplorer Dec 12, 2024
769bcd0
Only enable DropLowProbabilityItemPipeline for the articles spider.
wRAR Dec 13, 2024
8927379
Merge pull request #98 from zytedata/articles-probability-pipeline
wRAR Dec 13, 2024
ad6a58c
add validation for incremental_collection_name
kmike Dec 13, 2024
0a6cad3
fix formatting
kmike Dec 13, 2024
80aa641
Merge pull request #100 from zytedata/fix-collection-name-validation
kmike Dec 13, 2024
9f2ef4b
Mark Articles spider as experimental
kmike Dec 13, 2024
96376a5
Merge pull request #101 from zytedata/mark-articles-as-experimental
kmike Dec 13, 2024
bacdae4
Merge branch 'main' into articles_to_main
kmike Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
autodoc_pydantic_model_show_validator_members = False
autodoc_pydantic_model_show_validator_summary = False
autodoc_pydantic_field_list_validators = False
autodoc_pydantic_field_show_constraints = False

# sphinx-reredirects
redirects = {
Expand Down
10 changes: 9 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ zyte-spider-templates documentation

templates/index
E-commerce <templates/e-commerce>
Article <templates/article>
Google search <templates/google-search>

.. toctree::
Expand All @@ -34,9 +35,16 @@ zyte-spider-templates documentation
customization/spiders
customization/pages

.. toctree::
:caption: Reference
:hidden:

reference/settings
reference/reqmeta
reference/api

.. toctree::
:caption: All the rest
:hidden:

reference/index
changes
80 changes: 80 additions & 0 deletions docs/reference/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
===
API
===

Spiders
=======

.. autoclass:: zyte_spider_templates.BaseSpider

.. autoclass:: zyte_spider_templates.EcommerceSpider

.. autoclass:: zyte_spider_templates.GoogleSearchSpider
kmike marked this conversation as resolved.
Show resolved Hide resolved


Pages
=====

.. autoclass:: zyte_spider_templates.pages.HeuristicsProductNavigationPage
kmike marked this conversation as resolved.
Show resolved Hide resolved


.. _parameter-mixins:

Parameter mixins
================

.. autopydantic_model:: zyte_spider_templates.params.CustomAttrsInputParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.params.CustomAttrsMethodParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.params.CustomAttrsMethod

.. autopydantic_model:: zyte_spider_templates.params.ExtractFromParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.params.ExtractFrom

.. autopydantic_model:: zyte_spider_templates.params.GeolocationParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.params.Geolocation

.. autopydantic_model:: zyte_spider_templates.params.MaxRequestsParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.params.UrlParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategyParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategy

.. autopydantic_model:: zyte_spider_templates.spiders.serp.SerpItemTypeParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.spiders.serp.SerpItemType

.. autopydantic_model:: zyte_spider_templates.spiders.serp.SerpMaxPagesParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.spiders.article.ArticleCrawlStrategyParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.spiders.article.ArticleCrawlStrategy


.. _middlewares:

Middlewares
===========

.. autoclass:: zyte_spider_templates.CrawlingLogsMiddleware
.. autoclass:: zyte_spider_templates.TrackNavigationDepthSpiderMiddleware
.. autoclass:: zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware
.. autoclass:: zyte_spider_templates.OffsiteRequestsPerSeedMiddleware
.. autoclass:: zyte_spider_templates.OnlyFeedsMiddleware
.. autoclass:: zyte_spider_templates.TrackSeedsSpiderMiddleware
.. autoclass:: zyte_spider_templates.IncrementalCrawlMiddleware
61 changes: 0 additions & 61 deletions docs/reference/index.rst
Original file line number Diff line number Diff line change
@@ -1,61 +0,0 @@
=========
Reference
=========

Spiders
=======

.. autoclass:: zyte_spider_templates.BaseSpider

.. autoclass:: zyte_spider_templates.EcommerceSpider

.. autoclass:: zyte_spider_templates.GoogleSearchSpider


Pages
=====

.. autoclass:: zyte_spider_templates.pages.HeuristicsProductNavigationPage


.. _parameter-mixins:

Parameter mixins
================

.. autopydantic_model:: zyte_spider_templates.params.CustomAttrsInputParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.params.CustomAttrsMethodParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.params.CustomAttrsMethod

.. autopydantic_model:: zyte_spider_templates.params.ExtractFromParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.params.ExtractFrom

.. autopydantic_model:: zyte_spider_templates.params.GeolocationParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.params.Geolocation

.. autopydantic_model:: zyte_spider_templates.params.MaxRequestsParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.params.UrlParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategyParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategy

.. autopydantic_model:: zyte_spider_templates.spiders.serp.SerpItemTypeParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.spiders.serp.SerpItemType

.. autopydantic_model:: zyte_spider_templates.spiders.serp.SerpMaxPagesParam
:exclude-members: model_computed_fields
112 changes: 112 additions & 0 deletions docs/reference/reqmeta.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
.. _meta:

=================
Request.meta keys
=================

Keys that can be defined in :attr:`Request.meta <scrapy.http.Request.meta>` for
zyte-spider-templates.

.. reqmeta:: seed

seed
====

Default: ``The seed URL (or value) from which the request originated.``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware` and
:class:`~zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware`.

The `seed` meta key is used to track and identify the origin of a request. It
is initially set for each request that originates from the start request and
can be used to manage domain constraints for subsequent requests. This key can
also be set to an arbitrary value by the user to identify the seed source.

Here's an example:

.. code-block:: python

meta = {
"seed": "http://example.com",
}

.. reqmeta:: is_seed_request

is_seed_request
===============

Default: ``False``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware`.

The `is_seed_request` meta key is a boolean flag that identifies whether the
request is a start request (i.e., originating from the initial seed URL). When
set to True, the middleware extracts seed domains from the response.

Example:
::

meta = {
'is_seed_request': True,
}

.. reqmeta:: seed_domains

seed_domains
============

Default: ``Initial URL and redirected URLs``

The key is used for :class:`~zyte_spider_templates.OffsiteRequestsPerSeedMiddleware`.

The `seed_domains` meta key is a list of domains that the middleware uses to
check whether a request belongs to these domains or not. By default, this list
includes the initial URL's domain and domains of any redirected URLs `(if there
was a redirection)`. This list can also be set by the user in the spider to
specify additional domains for which the middleware should allow requests.

Here's an example:

.. code-block:: python

meta = {"seed_domains": ["example.com", "another-example.com"]}

.. reqmeta:: is_hop

increase_navigation_depth
=========================

Default: ``True``

The key is used for :class:`~zyte_spider_templates.TrackNavigationDepthSpiderMiddleware`.

The `increase_navigation_depth` meta key is a boolean flag that determines whether the
navigation_depth for a request should be increased. By default, the middleware increases
navigation_depth for all requests. Specific spiders can override this behavior for certain
types of requests, such as pagination or RSS feeds, by explicitly setting the meta key.

Example:
::

meta = {
'increase_navigation_depth': False,
}

.. reqmeta:: only_feeds

only_feeds
==========
Default: ``False``

The key is used for :class:`~zyte_spider_templates.OnlyFeedsMiddleware`.

The `only_feeds` meta key is a boolean flag that identifies whether the
spider should discover all links on the website or extract links from RSS/Atom feeds only.

Example:
::

meta = {
'page_params': {'only_feeds': True}
}

Loading
Loading