Scrapy tutorial

Project guide

Implement scrapy crawler on googlenews page and return news information.

Install package

pip install -r requirements.txt

Directory Structure

tutorial/                \# project name
+--scrapy.cfg            \# deploy configuration file
+--config.py             \# config for crawler_dev.py 
+--crawler_dev.py        \# lib for parse news website 
+--requirement.txt       \# for install package
+--tutorial/             \# project's Python module, you'll import your code from here
/   +--__init__.py
/   +--items.py          \# project items definition file
/   +--middlewares.py    \# project middlewares file
/   +--pipelines.py      \# project pipelines file
/   +--settings.py       \# project settings file
/   +--spiders/          \# a directory where you put your spiders
/   /   +--__init__.py
/   /   +--crawler.py    \# define your spider crawler

Note that config.py and crawler_dev.py is lib which we define. They're not scrapy modules.

And all we need to do is defining what spider do.

Run spider

cd to tutorial and run:

scrapy crawl GoogleNews

If need to output json file run:

scrapy crawl GoogleNews -o google.json

Tutorial

Class GoogleNewsCrawler is where we define spider crawler do.

GoogleNews is spider name and start_urls is the site we want to crawl.

This start_url is GoogleNews business page. And scapy would request a response from start_urls and send to callback function parse. You can change the url to the site you want.

Parse response body by BeautifulSoup and filte the title and url of all news. Scrapy also provide selector method to parse respond or you could import lxml or other lib you want.

Then yield to parse_detail parse the url we get.This function call parse from crawler_dev to parse website by url. Then yiled to item module and store information we get.

You can also redefine pipeline module to store information into json or database.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
tutorial		tutorial
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapy tutorial

Project guide

Install package

Directory Structure

Run spider

Tutorial

About

Releases

Packages

Languages

Chiuuu0209/Scrapy_news

Folders and files

Latest commit

History

Repository files navigation

Scrapy tutorial

Project guide

Install package

Directory Structure

Run spider

Tutorial

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages