Skip to content

Latest commit

 

History

History
120 lines (87 loc) · 4.32 KB

README.rst

File metadata and controls

120 lines (87 loc) · 4.32 KB

zyte-spider-templates-project

This is a starting template for a Scrapy project, with built-in integration with Zyte technologies (scrapy-zyte-api, zyte-spider-templates).

Requirements

Requires Python 3.9+.

For Zyte API features, including AI-powered spiders, a Zyte API subscription is also required.

First steps

After you clone this repository, follow these step to make it yours:

  1. Rename the zyte_spider_templates_project folder to a valid Python module name that you would like to use as your project ID, and update scrapy.cfg and <project ID>/settings.py (BOT_NAME, SPIDER_MODULES, NEWSPIDER_MODULE and SCRAPY_POET_DISCOVER settings) accordingly.

  2. For local development, assign your Zyte API key to the ZYTE_API_KEY environment variable, for example, using direnv.

    Note

    Scrapy Cloud automatically provides Zyte API key for the jobs, if you have a subscription.

  3. Remove or replace the LICENSE and README.rst files.

  4. Delete .git, and start a fresh Git repository:

    git init
    git add -A
    git commit -m "Initial commit"
    
  5. Create a Python virtual environment and install requirements.txt into it:

    python3 -m venv venv
    . venv/bin/activate
    pip install -r requirements.txt
    

Usage

This is an already created and configured Scrapy project so when you follow guides like the Scrapy Cloud tutorial you should skip most of the parts that talk about creating and configuring it. Still, you need some additional configuration specific to your account. Here is a short guide for using this project on Scrapy Cloud.

  1. Create a Scrapy Cloud project on the Zyte dashboard if you don't have it yet.
  2. Make sure you have a Zyte API subscription. For Scrapy Cloud runs the API key will be used automatically, for local runs you need to set a setting or an environment variable, as described in the first steps above.
  3. Run shub login and enter your Scrapy Cloud API key.
  4. Deploy your project with shub deploy 000000, replacing 000000 with your Scrapy Cloud project ID (found in the project dashboard URL). Alternatively, put the project ID into the scrapinghub.yml file to be able to run simply shub deploy.
  5. Now you should be able to create smart spiders on your Scrapy Cloud project using the templates from this project.

For more information and more verbose descriptions of specific steps you can check:

You can also run the spiders locally, for example:

scrapy crawl ecommerce -a url="https://books.toscrape.com/" -o output.jsonl

Development

By default all spiders and page objects defined in zyte-spider-templates are available in this project. You can also:

  • Subclass spiders from zyte-spider-templates or write spiders from scratch.

    Define your spiders in Python files and modules within <project ID>/spiders/.

  • Use web-poet and scrapy-poet to modify the parsing behavior of spiders, in all, some, or specific websites.

    Define your page objects in Python files and modules within <project ID>/pages/.