GitHub - narzeja/py-arachnid: Python3 web crawling and web scraping framework

Arachnid

This project was used as a practical exercise in coroutines and asyncronous programming in Python. It uses primarily the async/await syntax introduced in Python3.5.

It is a modular web-crawling and web-scraping framework inspired by the Scrapy project. If you are here because you want to use a mature web scraping framework, I urge you to seek out Scrapy.

The Example

A simple example.

Defining a Crawler

Save the file as myexample.py

from arachnid import spider

class ItemPrinter:
    def process_item(self, item, spider):
        print(item)
        return item


class MyExample(arachnid.Spider):
    start_urls = ['http://news.ycombinator.com/']
    name = 'HackerNews'

    def parse(self, response):
        articles = response.css('tr.athing, tr.athing+tr')
        for idx in range(0, len(articles), 2):
            title_elm  = articles[idx]
            title = title_elm.css('.title a::text').extract_first()
            meta_elm = articles[idx+1]
            yield {'title': title}

Defining the configuration

You need to define a configuration for your crawl job to execute the above scraper.

spiders = [
    {'spider': 'myexample.MyExample',
     'spider_middleware': [],
     'downloader_middleware': [],
     'result_middleware': ['myexample.ItemPrinter']}
]

Save as myexamplesettings.py

Run arachnid settings myexamplesettings.py.

Features

Built-in support for CSS/XPath data extraction using Parsel library.
Extensibility support, allowing you to plug-in your own functionality with a well-defined API (pipelines, middlewares).
Ability to load multiple spiders with their own middleware.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
docs		docs
examples		examples
src/arachnid		src/arachnid
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
VERSION		VERSION
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arachnid

The Example

Defining a Crawler

Defining the configuration

Features

About

Releases

Packages

Languages

License

narzeja/py-arachnid

Folders and files

Latest commit

History

Repository files navigation

Arachnid

The Example

Defining a Crawler

Defining the configuration

Features

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages