This project was used as a practical exercise in coroutines and asyncronous programming in Python. It uses primarily the async/await syntax introduced in Python3.5.
It is a modular web-crawling and web-scraping framework inspired by the Scrapy project. If you are here because you want to use a mature web scraping framework, I urge you to seek out Scrapy.
A simple example.
Save the file as
from arachnid import spider
class ItemPrinter:
def process_item(self, item, spider):
return item
class MyExample(arachnid.Spider):
start_urls = ['']
name = 'HackerNews'
def parse(self, response):
articles = response.css('tr.athing, tr.athing+tr')
for idx in range(0, len(articles), 2):
title_elm = articles[idx]
title = title_elm.css('.title a::text').extract_first()
meta_elm = articles[idx+1]
yield {'title': title}
You need to define a configuration for your crawl job to execute the above scraper.
spiders = [
{'spider': 'myexample.MyExample',
'spider_middleware': [],
'downloader_middleware': [],
'result_middleware': ['myexample.ItemPrinter']}
Save as
Run arachnid settings
- Built-in support for CSS/XPath data extraction using Parsel library.
- Extensibility support, allowing you to plug-in your own functionality with a well-defined API (pipelines, middlewares).
- Ability to load multiple spiders with their own middleware.