Skip to content

Lightweight web scraping toolkit for documents and structured data.

License

Notifications You must be signed in to change notification settings

alephdata/memorious

Folders and files

NameName
Last commit message
Last commit date

Latest commit

4d0ed7c · Jan 10, 2024
May 4, 2023
Mar 30, 2023
May 4, 2023
May 2, 2023
Aug 29, 2023
Feb 1, 2021
Mar 30, 2023
Mar 4, 2022
Feb 1, 2021
May 4, 2023
Mar 30, 2023
May 2, 2023
Apr 30, 2020
Mar 14, 2018
Jan 10, 2024
Jul 12, 2019

Repository files navigation

Memorious

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

Funes the Memorious, Jorge Luis Borges

memorious is a light-weight web scraping toolkit. It supports scrapers that collect structured or un-structured data. This includes the following use cases:

  • Make crawlers modular and simple tasks re-usable
  • Provide utility functions to do common tasks such as data storage, HTTP session management
  • Integrate crawlers with the Aleph and FollowTheMoney ecosystem
  • Get out of your way as much as possible

Design

When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.

memorious handles this by managing a set of crawlers, each of which can be composed of multiple stages. Each stage is implemented using a Python function, which can be re-used across different crawlers.

The basic steps of writing a Memorious crawler:

  1. Make YAML crawler configuration file
  2. Add different stages
  3. Write code for stage operations (optional)
  4. Test, rinse, repeat

Documentation

The documentation for Memorious is available at alephdata.github.io/memorious. Feel free to edit the source files in the docs folder and send pull requests for improvements.

To build the documentation, inside the docs folder run make html

You'll find the resulting HTML files in /docs/_build/html.