A Corpus for Simulating Search on Mastodon.
-
Install Python 3.11 or higher.
-
Create and activate a virtual environment:
python3.11 -m venv venv/ source venv/bin/activate
-
Install dependencies:
pip install -e .
Use this repository to crawl, analyze, and search Mastodon posts.
Hint: You can always list all available commands of our crawler by running:
mastodon-search -h
The central command used to crawl an instance is stream-to-es
. It opens a connection to the specified Mastodon instance, receives new posts, and stores them in an Elasticsearch index:
mastodon-search stream-to-es --host https://es.example.com --username es_username --password es_password mastodon.example.com
Behind the scenes, this will fetch posts using Mastodon's streaming API.
Because the streaming API is unavailable on many instances, our crawler gracefully falls back to using regular HTTP GET
requests with the public timeline API.
An initial list of nodes can be obtained from https://nodes.fediverse.party/:
wget https://nodes.fediverse.party/nodes.json
Now, enrich the list of instances with global and weekly activity stats. Be aware that the below command can take a few hours to complete:
mastodon-search obtain-instance-data nodes.json mastodon_instance_data/
With the activity stats obtained, we can draw a representative sample out of all the instances:
mastodon-search choose-instances mastodon_instance_data/ out.csv
TODO: Don't we do this in the notebooks?
We provide [Jupyter notebooks] for easily analyzing the instances and crawled posts.
To open a notebook, just run, e.g.:
jupyter notebook notebooks/mastodon-instance-data-vis.ipynb
The correlation between all available instance statistics can be calculated by running:
mastodon-search calculate-correlation mastodon_instance_data/
Our code can also run in a container. First, build the image with this command:
docker build -t mastodon_search .
To run commands using the Docker image just created, replace the mastodon-search
command from the previous sections with docker run mastodon_search
.
If you want to save statuses to an Elasticsearch running on your localhost
, the command should look like the following code snippet.
(You can leave out --network=host
if it's not on your local machine.)
docker run --network host mastodon_search stream-to-es --host http://localhost --username es_username --password es_password mastodon.example.com
Crawling can be parallelized on a Kubernetes cluster.
To do so, install Helm and configure kubectl
for your cluster.
You are then ready to deploy the Helm chart on the cluster and start the crawling:
helm install --dry-run --set esUsername="<REDACTED>" --set esPassword="<REDACTED>" --set-file instances="./data/instances.txt" mastodon-crawler ./helm
If the above command worked and the Kubernetes resources to be deployed look good to you, just remove the --dry-run
flag to actually deploy the crawlers.
To stop the crawling, just uninstall the Helm chart:
helm uninstall mastodon-crawler
To re-start the crawling, first uninstall and then re-install the Helm chart.
First, install Python 3.11 or higher and then clone this repository. From inside the repository directory, create a virtual environment and activate it:
python3.11 -m venv venv/
source venv/bin/activate
Then, install the test dependencies:
pip install -e .[tests]
After having implemented a new feature, please check the code format, inspect common LINT errors, and run all unit tests with the following commands:
ruff . # Code format and LINT
# mypy . # Static typing
bandit -c pyproject.toml -r . # Security
pytest . # Unit tests
If you have found a bug in this crawler or feel some feature is missing, please create an issue. We also gratefully accept pull requests!
If you are unsure about anything, post an issue or contact us:
We are happy to help!
- Standards:
- ActivityPub
- Activity Streams 2.0 (syntax for activity data)
- Activity Vocabulary
- WebFinger
- APIs:
- List of Fediverse nodes (source code)
- Liste of Fediverse software
- Blogs:
This repository is released under the MIT license.