A Scrapy-based Web Spider

This is a Scrapy-based Web spider to scrawl useful information from public websites. Currently, included in the package is the case for scrawling the tags of public bitcoin addresses. In the future, more cases will be addeded.

The web spider depends on Scrapy. On ubuntu, it can be installed as:

$ [sudo] pip install Scrapy

The web spider is easy to use:

First, go to the directory tag_spider/spiders:

$ cd tag_spider/spiders

Then, running the spider:

$ ./run.sh

Note: The file run.sh includes one command of Scracpy: scrapy crawl tagspider. Just for easy use.

Remarks:

The number of pipelines to scrawl the webpages can be set in the file tag_spider/settings.py:
```
ITEM_PIPELINES = {
    'tag_spider.pipelines.BlockSpiderPipeline': 16,
}
```
The number 16 refers to the number of threads, which can be customized based on the capability of your machine.
The file that used to save the data scrawled from the website can be set in the file tag_spider/pipelines.py:
```
class BlockSpiderPipeline(object):

    def open_spider(self, spider):

        self.fp = open("data.list", "w" )
```
The default file is "data.list". If necessary, the data can also be saved in a database, such as mysql. This can be customized by change the underlaying storage from file to database.
The rules for scrawl are defined in the file tag_spider/spiders/tagspider.py
More information of Scrapy can be founded at the official website https://scrapy.org/

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
tag_spider		tag_spider
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Scrapy-based Web Spider

About

Releases

Packages

Languages

bhou2/A-Scrapy-based-Web-Spider

Folders and files

Latest commit

History

Repository files navigation

A Scrapy-based Web Spider

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages