Skip to content

Commit

Permalink
Merge pull request #100 from zytedata/fix-collection-name-validation
Browse files Browse the repository at this point in the history
Improve collection name validation
  • Loading branch information
kmike authored Dec 13, 2024
2 parents 8927379 + 0a6cad3 commit 80aa641
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 3 deletions.
2 changes: 2 additions & 0 deletions docs/reference/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,8 @@ INCREMENTAL_CRAWL_COLLECTION_NAME
:attr:`~zyte_spider_templates.spiders.article.ArticleSpiderParams.incremental_collection_name`
command-line parameter instead of this setting.

.. note::
Only ASCII alphanumeric characters and underscores are allowed.

Default: `<The current spider's name>_incremental`.
The current spider's name here will be virtual spider's name, if it's a virtual spider;
Expand Down
8 changes: 6 additions & 2 deletions tests/test_article.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,15 +313,19 @@ def test_metadata():
"type": "boolean",
},
"incremental_collection_name": {
"anyOf": [{"type": "string"}, {"type": "null"}],
"anyOf": [
{"type": "string", "pattern": "^[a-zA-Z0-9_]+$"},
{"type": "null"},
],
"default": None,
"description": "Name of the Zyte Scrapy Cloud Collection used during an incremental crawl."
"By default, a Collection named after the spider (or virtual spider) is used, "
"meaning that matching URLs from previous runs of the same spider are skipped, "
"provided those previous runs had `incremental` argument set to `true`."
"Using a different collection name makes sense, for example, in the following cases:"
"- different spiders share a collection."
"- the same spider uses different collections (e.g., for development runs vs. production runs).",
"- the same spider uses different collections (e.g., for development runs vs. production runs). "
"Only ASCII alphanumeric characters and underscores are allowed in the collection name.",
"title": "Incremental Collection Name",
},
"crawl_strategy": {
Expand Down
4 changes: 3 additions & 1 deletion zyte_spider_templates/spiders/article.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,9 +95,11 @@ class IncrementalParam(BaseModel):
"provided those previous runs had `incremental` argument set to `true`."
"Using a different collection name makes sense, for example, in the following cases:"
"- different spiders share a collection."
"- the same spider uses different collections (e.g., for development runs vs. production runs)."
"- the same spider uses different collections (e.g., for development runs vs. production runs). "
"Only ASCII alphanumeric characters and underscores are allowed in the collection name."
),
default=None,
pattern="^[a-zA-Z0-9_]+$",
)


Expand Down

0 comments on commit 80aa641

Please sign in to comment.