🕸️ mastodon-search

A Corpus for Simulating Search on Mastodon.

Installation

Install Python 3.11 or higher.

Create and activate a virtual environment:

python3.11 -m venv venv/
source venv/bin/activate

Install dependencies:
```
pip install -e .
```

Usage

Use this repository to crawl, analyze, and search Mastodon posts.

Hint: You can always list all available commands of our crawler by running:

mastodon-search -h

Crawling

Crawling a single instance

The central command used to crawl an instance is stream-to-es. It opens a connection to the specified Mastodon instance, receives new posts, and stores them in an Elasticsearch index:

mastodon-search stream-to-es --host https://es.example.com --username es_username --password es_password mastodon.example.com

Behind the scenes, this will fetch posts using Mastodon's streaming API. Because the streaming API is unavailable on many instances, our crawler gracefully falls back to using regular HTTP GET requests with the public timeline API.

Obtaining and analyzing instance data

An initial list of nodes can be obtained from https://nodes.fediverse.party/:

wget https://nodes.fediverse.party/nodes.json

Now, enrich the list of instances with global and weekly activity stats. Be aware that the below command can take a few hours to complete:

mastodon-search obtain-instance-data nodes.json mastodon_instance_data/

Sampling instances for crawling

With the activity stats obtained, we can draw a representative sample out of all the instances:

mastodon-search choose-instances mastodon_instance_data/ out.csv

TODO: Don't we do this in the notebooks?

Analyzing

We provide [Jupyter notebooks] for easily analyzing the instances and crawled posts.

To open a notebook, just run, e.g.:

jupyter notebook notebooks/mastodon-instance-data-vis.ipynb

Correlation of instance statistics

The correlation between all available instance statistics can be calculated by running:

mastodon-search calculate-correlation mastodon_instance_data/

Docker image

Our code can also run in a container. First, build the image with this command:

docker build -t mastodon_search .

To run commands using the Docker image just created, replace the mastodon-search command from the previous sections with docker run mastodon_search. If you want to save statuses to an Elasticsearch running on your localhost, the command should look like the following code snippet. (You can leave out --network=host if it's not on your local machine.)

docker run --network host mastodon_search stream-to-es --host http://localhost --username es_username --password es_password mastodon.example.com

Deployment

Crawling can be parallelized on a Kubernetes cluster. To do so, install Helm and configure kubectl for your cluster.

You are then ready to deploy the Helm chart on the cluster and start the crawling:

helm install --dry-run --set esUsername="<REDACTED>" --set esPassword="<REDACTED>" --set-file instances="./data/instances.txt" mastodon-crawler ./helm

If the above command worked and the Kubernetes resources to be deployed look good to you, just remove the --dry-run flag to actually deploy the crawlers.

To stop the crawling, just uninstall the Helm chart:

helm uninstall mastodon-crawler

To re-start the crawling, first uninstall and then re-install the Helm chart.

Development

First, install Python 3.11 or higher and then clone this repository. From inside the repository directory, create a virtual environment and activate it:

python3.11 -m venv venv/
source venv/bin/activate

Then, install the test dependencies:

pip install -e .[tests]

After having implemented a new feature, please check the code format, inspect common LINT errors, and run all unit tests with the following commands:

ruff .                         # Code format and LINT
# mypy .                         # Static typing
bandit -c pyproject.toml -r .  # Security
pytest .                       # Unit tests

Contribute

If you have found a bug in this crawler or feel some feature is missing, please create an issue. We also gratefully accept pull requests!

If you are unsure about anything, post an issue or contact us:

[email protected]

We are happy to help!

Further resources

Standards:
- ActivityPub
- Activity Streams 2.0 (syntax for activity data)
- Activity Vocabulary
- WebFinger
APIs:
- Mastodon-Dokumentation
List of Fediverse nodes (source code)
Liste of Fediverse software
Blogs:
- Understanding ActivityPub
- Understanding Mastodon

License

This repository is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 214 Commits
.github		.github
data		data
helm		helm
mastodon_search		mastodon_search
notebooks		notebooks
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕸️ mastodon-search

Installation

Usage

Crawling

Crawling a single instance

Obtaining and analyzing instance data

Sampling instances for crawling

Analyzing

Correlation of instance statistics

Docker image

Deployment

Development

Contribute

Further resources

License

About

Releases

Contributors 2

Languages

License

webis-de/mastodon-search

Folders and files

Latest commit

History

Repository files navigation

🕸️ mastodon-search

Installation

Usage

Crawling

Crawling a single instance

Obtaining and analyzing instance data

Sampling instances for crawling

Analyzing

Correlation of instance statistics

Docker image

Deployment

Development

Contribute

Further resources

License

About

Resources

License

Stars

Watchers

Forks

Releases

Contributors 2

Languages