Memorious

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

—Funes the Memorious, Jorge Luis Borges

memorious is a light-weight web scraping toolkit. It supports scrapers that collect structured or un-structured data. This includes the following use cases:

Make crawlers modular and simple tasks re-usable
Provide utility functions to do common tasks such as data storage, HTTP session management
Integrate crawlers with the Aleph and FollowTheMoney ecosystem
Get out of your way as much as possible

Design

When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.

memorious handles this by managing a set of crawlers, each of which can be composed of multiple stages. Each stage is implemented using a Python function, which can be re-used across different crawlers.

The basic steps of writing a Memorious crawler:

Make YAML crawler configuration file
Add different stages
Write code for stage operations (optional)
Test, rinse, repeat

Documentation

The documentation for Memorious is available at alephdata.github.io/memorious. Feel free to edit the source files in the docs folder and send pull requests for improvements.

To build the documentation, inside the docs folder run make html

You'll find the resulting HTML files in /docs/_build/html.

Name	Name	Last commit message	Last commit date
Latest commit catileptic Bump followthemoney version 3.5.2->3.5.8 (#235 ) Jan 10, 2024 4d0ed7c · Jan 10, 2024 History 891 Commits
.github	.github	Use a stable version of the pypi publish action	May 4, 2023
docs	docs	Redirect index page to new documentation site (#223 )	Mar 30, 2023
example	example	Update docker repo to ghcr	May 4, 2023
memorious	memorious	Use a local httpbin instance for tests (#226 )	May 2, 2023
.bumpversion.cfg	.bumpversion.cfg	Bump version: 2.6.3 → 2.6.4	Aug 29, 2023
.dockerignore	.dockerignore	base the docker image off of followthemoney	Feb 1, 2021
.gitignore	.gitignore	Convert Memorious docs to new documentation setup (#213 )	Mar 30, 2023
Dockerfile	Dockerfile	Reverting addition of packages to Dockerfile	Mar 4, 2022
LICENSE	LICENSE	base the docker image off of followthemoney	Feb 1, 2021
Makefile	Makefile	Update docker repo to ghcr	May 4, 2023
README.rst	README.rst	Convert Memorious docs to new documentation setup (#213 )	Mar 30, 2023
docker-compose.dev.yml	docker-compose.dev.yml	Use a local httpbin instance for tests (#226 )	May 2, 2023
env.sh.tmpl	env.sh.tmpl	update documentation, settings etc	Apr 30, 2020
setup.cfg	setup.cfg	Version up	Mar 14, 2018
setup.py	setup.py	Bump followthemoney version 3.5.2->3.5.8 (#235 )	Jan 10, 2024
tox.ini	tox.ini	Build in parallel	Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memorious

Design

Documentation

About

Releases 2

Packages 1

Contributors 18

Languages

License

alephdata/memorious

Folders and files

Latest commit

History

Repository files navigation

Memorious

Design

Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 1

Contributors 18

Languages