Skip to content
/ LAMA Public

Lightweight Archiver for Mastodon Activities - a simple Python script to periodically save your Mastodon activities in a local database

License

Notifications You must be signed in to change notification settings

s427/LAMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LAMA - Lightweight Archiver for Mastodon Activities

A simple script that uses the Mastodon API to fetch and archive all the posts you wrote with one (or several) account(s).

Optionally, it can also fetch:

  • your boosts
  • your favourites
  • your bookmarks
  • your mentions
  • the polls you took part of
  • posts that are linked to inside one of the above post
  • posts you reply to (ie. the parent(s) in a conversation)

The data is stored in a simple SQLite database, and can also be saved as JSON files (each post is a file).

Finally, it can also (and does by default) download attachments to posts. You can choose to either download:

  • only attachements from your own post
  • attachments from any post

All files (JSON and attachments) are neatly organized in a clean folder structure, based on the account, type of activity (post, reblog, favorite, bookmark), year, month, and author.

After an initial run where the script will download everything it can from the configured account(s), the script will only fetch the posts that haven't been fetched yet. So the idea is to run it periodically (on a daily or weekly basis, for instance), using a cronjob or a similar mechanism.

⚠️ As of now, all the script does is fetch and store the data. There is no interface to view it or manipulate it. This might come at a later date, bur for now, at least the data is saved. :) You can use any tool compatible with SQLite to explore it and run any query you want.

📢 The announcement posts, should you want to help spread the word:
In English: https://lou.lt/@s427/115033828181344046
In French: https://lou.lt/@s427/115033826235638224

🙋‍♂️ "But you can just download your archive as a ZIP file!"

It's true, you can get an archive of your account by visiting the "Preferences" page on your Mastodon account (on the web), and then look into the "Import and export" section.

I should know: I actually wrote a small app (MARL) to help exploring such an archive. :)

However this archive currently has a few shortcomings. While it does contain (almost) all the data you could hope for for your own posts, the same is not true for all the other types of posts:

  1. It contains almost no data about the posts you shared (aka reblogs or boosts), favorited or bookmarked. All you get for those items is basically an URL to the relevant post. Not very helpful if you want to, say, find a post you liked based on a keyword.
  2. More importantly, most of the posts that originated from another instance (and that you boosted, bookmarked or favorited) will likely be absent from this archive. This is because the content of the archive is based on what your instance has in its cache when it builds the archive, and posts in this cache expire after a certain amount of time (how long depends on the instance configuration: it can be a week, it can be a month, or more). When this happens, they disappear from your account, archive included.

In other words, it's mainly about the content you created.

I personally tend to share (reblog/boost) a lot of posts that I see on my timeline: interesting links, geeky threads, funny comments, silly memes, nice pictures, etc. And I'd like to be able to search and find those posts even after they have expired from my instance cache.

Hence, this script.

How to use

You can use LAMA as a Python script, or as a Docker image.

In both cases, you will need to:

  • decide where the archived data will be saved
  • configure your preferences (at least one account)
  • run LAMA in initialization mode (to set up things and authorize the app)

You will then be able to run LAMA in regular mode.

1. Decide where to save your data

LAMA saves all its data in a single folder. By default, this is the user subfolder of the LAMA project, which will be created if it doesn't exist. You can change this location if you want:

  • If you run LAMA as a Python script, you do so by setting the user_dir option in the prefs.json file. See the next step for more details.
  • If you run LAMA as a Docker image, you will have to choose a folder outside the container, and map it as a volume to the user subfolder when running the image. More details further below ("Run LAMA as a Docker image").

What will be stored in this folder:

  • creds subfolder: the .secret files needed to authorize the app for your account(s);
  • data subfolder: the SQLite database, the JSON files (in data/json), and the downloaded attachments (in data/media);
  • logs subfolder: log files generated by the app.

If you want you can also store the prefs.json file in this folder.

2. Configure the preferences file

Copy and rename prefs.example.json as prefs.json, and edit its content. At least one account must be configured. For each account you just need to fill the instance name (including "https://") and your username on that instance (you can drop the "@" or keep it).

Everything else in this file is optional and can be deleted.

This file can either be placed at the root of the LAMA project (same level as main.py), or in the user subfolder (that you must create). If a prefs.json file is found in both locations, the one in the user subfolder will be used (and the other one ignored).

🐋 Running LAMA as a Docker image: you have to copy this file in a location of your choosing in your local file system. The easiest approach is to put it in the "user" folder you chose in the previous step.

A minimal prefs.json file looks like this:

{
  "accounts": [
    {
      "username": "my_username",
      "instance": "https://domain.tld"
    }
  ]
}

If you want to change some of the default settings, you can do so in a prefs object in the same file:

{
  "accounts": [
    {
      "username": "my_username",
      "instance": "https://domain.tld"
    }
  ],
  "prefs": {
    "download_others_attachments": 0
  }
}

See below for a list of all available preferences and their default values.

Change the user folder location

⚠️🐋 This section does not apply if you use LAMA as a Docker file. The user_dir setting in your prefs.json will be ignored. See the Docker section below for more information.

If you don't want your data to be stored in the user subfolder, you can specify another location with the user_dir preference in prefs.json:

{
  "accounts": [
    {
      "username": "myUsername",
      "instance": "https://domain.tld"
    }
  ],
  "prefs": {
    "user_dir": "my/alternative/user/dir",
  }
}

You can use an absolute path, or a path relative to the root of the LAMA project.

Windows users: use a forward slash ("/") and not a backslash in your path:
"user_dir": "C:/my/custom/path",

3.a. Run LAMA as Python script

Requirements: Python 3.10 or above

Initialization

This only needs to be done once.

You must first activate the virtual environment, then run main.py with the init parameter:

source .venv/Scripts/activate
python main.py init

Windows users:
source .venv/bin/activate
python main.py init

➡️ Follow the instructions displayed in the terminal to authorize the app for your account.

⚠️ If you add or change an account in prefs.json, you will need to run this command again.

Normal execution

Once LAMA has been initialized, you can run it in regular mode, simply by dropping the init parameter:

source .venv/Scripts/activate
python main.py

Windows users:
source .venv/bin/activate
python main.py

This will fetch every new posts since the previous run. If this is the first run, then all posts will be fetched.

3.b. Run LAMA as a Docker image

Build the image

First you have to build the image from the Dockerfile:

docker build -t lama:latest .
(You may need to do this in sudo mode.)

♻️ Use the same command to rebuild the image every time the project is updated.

Choose the user folder location

With a Docker image, we do not want to save our data in the user subfolder (because that would be in the Docker container and therefore non-persistent), and for the same reason, we can not configure this folder using the user_dir preference in prefs.json.

Therefore we have to map the user folder (in the project folder, which will be run in the container) to the folder we want to use in our local file systme (outside of the Docker container).

This can be done with the -v parameter when running the image. For instance:

-v /path/to/my/user_dir:/usr/src/app/user
For Windows user it would look like:
-v C:/path/to/my/user_dir:/usr/src/app/user

Keep :/usr/src/app/user untouched, just change the part that comes before the colon.

If you store your prefs.json file in this same folder (in this example /path/to/my/user_dir), then this is all you need to specifiy.

If you want to store your prefs.json file in a separate location, then you need to add a second -v parameter to map it separately:

-v /path/to/my/prefs.json:/usr/src/app/prefs.json -v /path/to/my/user_dir:/usr/src/app/user

Configure your account

Put your prefs.json file in the chosen location. Read above ("2. Configure the preferences file") for more information about its content.

Initialize the app

Run the Docker image in interactive mode (-it) and with the init parameter to initialize the app. Don't forget to also include your volume(s) mapping:

docker run -it --rm -v /path/to/my/user_dir:/usr/src/app/user lama init

➡️ Follow the instructions displayed in the terminal to authorize the app for your account.

⚠️ If you add or change an account in prefs.json, you will need to run this command again.

Run the app (regular mode)

Run the same command, this time without the -it and init parameters:

docker run --rm -v /path/to/my/user_dir:/usr/src/app/user lama

This will fetch every new posts since the previous run. If this is the first run, then all posts will be fetched.

🕑 Note that the image is set to the UTC timezone. This affects: the date/time shown in the logs, and the date/time for the fetched_at and archived_at columns in the database. If you want to use a specific timezone, you can set it with the -e TZ parameter:
docker run --rm -e TZ=Europe/Zurich -v /path/to/my/user_dir:/usr/src/app/user lama
For a complete list of the possible values, check this Wikipedia page.

Preferences

The following preferences can be set in the prefs object in your prefs.json file:

{
  "accounts": [],
  "prefs": {
    "preference_name": "value",
  }
}

"user_dir"

default value: "{pwd}/user"

⚠️ Do not use this preference if you're using the Docker image. ⚠️
Where LAMA should save its data: database, JSON files and attachments, credentials, logs. The default is a "user" subfolder in the LAMA project folder. You can specifiy an absolute path or a path relative to the root of the project (where main.py is).

"save_json"

Boolean (0 or 1); default value: 1

Whether or not to also save all the archived posts as json files, in addition to saving them to the database. Each post will be a separate JSON file. They will be saved in [user_dir]/data/json.

"fetch_reblogs"

Boolean (0 or 1); default value: 1

Whether or not to save the reblogs (boosts).
Note: if a post has both reblog data and its own content, it will be saved even if this is set to 0. (Not sure if this is actually possible, so this is mainly a safeguard.)

"fetch_favourites"

Boolean (0 or 1); default value: 1

Whether or not to fetch and save your favourites.

"fetch_bookmarks"

Boolean (0 or 1); default value: 1

Whether or not to fetch and save your bookmarks.

"fetch_mentions"

Boolean (0 or 1); default value: 1

Whether or not to fetch mentions (from your notification) and save the corresponding posts.
Note: after a while, mentions don't have any content any more (probably due to the corresponding post being removed from the instance cache) and therefore LAMA will not save them. A message will be written in the log if this happens.

"fetch_polls"

Boolean (0 or 1); default value: 1

Whether or not to fetch the posts containing polls the configured account have taken part of.
Note: this information only becomes available once the poll has ended, so don't expect ongoing polls to appear in the database.

"fetch_linked_posts"

Boolean (0 or 1); default value: 1

If true, and a post contains a link that looks like a Mastodon post (based on its URL scheme), then LAMA will attempt to fetch and save the corresponding post (as a "A.link" activity, where "A" is the activity which triggered this post being fetched; for instance "bookmark.link" in the case of a link found in a bookmark).
Note that this can be triggered recursively (fetch a post linked in a post linked in a post linked in a post, etc). See the recursion_limit preference below.

"fetch_reply_parents"

Boolean (0 or 1); default value: 1

If true, and a post is a reply to another post (in_reply_to_id field), then LAMA will attempt to also fetch this referenced post and save it as a "A.parent" activity (where "A" is the activity which triggered this post being fetched; for instance "bookmark.parent" in the case of the parent post to a bookmark).
Note that this can be triggered recursively (fetch a parent of a parent, etc). See the recursion_limit preference below.

"download_own_attachments"

Boolean (0 or 1); default value: 1

Whether or not to download your own attachment. They will be saved in [user_dir]/data/media.

"download_others_attachments"

Boolean (0 or 1); default value: 1

Whether or not to download the attachments posted by other users. They will be saved in [user_dir]/data/media.

"fetch_limit"

Integer; default value: 25

How many posts to fetch for the instance with each single API request. Most instances restrict this number to 40, so it's probably pointless to go above that value. I set it to 25 so to have more "round" numbers (multiples of 100). :)
This is unrelated to the total number of posts that LAMA will fetch during one run, so there's not much reason to change it.

"recursion_limit"

Integer; default value: 100

This is related to fetch_linked_posts and fetch_reply_parents, in situations where a large number of chained posts may be fetched. LAMA will stop when this number of recursion is reached.
If you set it to 0, then no parent or Mastodon link will be fetched, which would be equivalent to disabling both fetch_linked_posts and fetch_reply_parents.

"log_level"

String; possible values : "debug", "info", "warning", "critical"
Default value: "info"

The level of details that will be written to the log file ("logs" folder):

  • critical: errors that prevent the script from running
  • warning: unusual cases
  • info: most normal operations
  • debug: detailed infos, including the full json data of fetched posts; this can quickly lead to a big log file.

(Each level includes the previous ones.)

"log_history_limit"

Integer; default value: 1000

The maximum number of log files that should be kept in the "logs" folder. If this number is exceeded, the oldest files will be deleted. Only files starting with "LAMA-run" are counted and deleted.
Set to 0 to never auto-delete any log.

Database structure

LAMA has three tables:

  • posts
  • activities
  • states

The states table is only used to help LAMA keep track of things internally. You can ignore it.

'posts' table

posts contains all the data of the posts that have been fetched. For each post, the complete, raw data is saved in the json column. Most other columns are therefore redundant to this column: they have been added to make consulting and searching the table easier:

  • id: primary key (auto-increment);
  • post_uri: used as a unique identifier for each post;
    example: https://lou.lt/users/s427/statuses/113567190207673533
  • post_id: the id for this post, unique to the instance the post was fetched from;
    example: 113567190207673533
  • author: the author of the post;
    example: s427@lou.lt
  • visibility: public, unlisted, private, or direct;
  • content: text-only version of the content of the post (HTML stripped);
  • hashtags: a JSON array containing all the hahtags found in the post;
  • mentions: a JSON array containing all the mentions found in the post;
  • links: a JSON array containing all the links found in the post, excluding hashtags and mentions;
    Each link is an object with three key:
    • url;
    • text: the text of the link, if different from the url; otherwise an empty string;
    • mastodon: boolean, whether the URL looks like it might be a link to a Mastodon post, based on its URL scheme.
  • attachments: a JSON array of all the attachments that have been downloaded for this post (if LAMA is configured to do so); each attachement in the array is itself an array with two entries:
    • first, the local path where the file has been saved; if the file could not be saved (for instance because of an HTTP error), then the corresponding error message is saved here instead of the path;
    • second, the description text (alt text) for the attachment (or an empty string)
  • poll_options: if the post contains a poll, this column will contain a JSON array of all the options for this poll;
  • reblog: if this post is a reblog, this contains the URL for the post that was boosted;
  • created_at: date of the post;
  • edited_at: the last time the post was edited (or null);
  • fetched_at: date at which the post was fetched and saved to the database by LAMA;
  • json: full, raw data, as received from the Mastodon API;
  • note: this column is not used for now.

The same post can be fetched multiple times (by different accounts, or the same account but different activities: once as a boost, another time as a bookmark, etc). In such a case, LAMA compares the "edited_at" value of both posts (the one already present in the database, and the newer one), and if the value is the same, it doesn't save it again, nor does it download the attachments again.

'activities' table

The activities table contains information about why each post in the posts table was fetched in the first place (eg. was it a favourite, a boost, etc). One single post can be referenced by multiple lines in activities. For instance one post may have been bookmarked at a certain date, then favourited and/or boosted later on: it would then have two or three entries in activities.

  • id: primary key (auto-increment);
  • account: which account (as configured in prefs.json) performed the activity;
  • post_uri: which post in the posts table are we referencing;
  • activity_type: the type of activity (post, reblog, favourite, bookmark, etc);
  • activity_id: it's the post id in most cases, but for mentions it's a different kind of id; you can ignore it;
  • archived_at: the date at which the activity was recorded in the database by LAMA.

A note on reblogs

When you share a post on Mastodon (aka "reblog" or "boost"), what happens is that a new post is created, whose author is "you". This post doesn't have any content of its own, but it still has the same structure as any regular post; it also contains a "reblog" object, which in turn contains all the data from the original post. So in effect, you get two posts in one.

(Normal posts (non-boost) also have a "reblog" key, which is simply left empty (null).)

Greatly simplified example:

{
    "url": "your reblog URL",
    "account": {
        "username": "you"
    },
    "content": "", // <- empty!
    "created_at": "the date you boosted the post",
    "visibility": "the visibility you chose for your boost",
    "reblog": {
        "url": ["URL to the original post"],
        "account": {
            "username": "original author"
        },
        "content": "<p>The content of the post you boosted</p>",
        "created_at": "the date of the original post",
        "visibility": "visibility of the original post",
        "reblog": null
    }
}

So, how does LAMA handle that?

If you choose to fetch and archive reblogs (which is activated by default), LAMA will:

  1. extract the data from the "reblog" object (ie the original post, independent from your boost), and save it as a post on its own in the "posts" table; the "reblog" column is left empty (null).

    • in the activities table, an entry of type "reblog" is created, referencing this post via post_uri.
  2. also save the whole post as it was fetched (ie. "your" post, containing the reblog) as a post on its own; the "reblog" column contains the URI of the original post, which means you can easily find it in the table (via the post_uri column).

    • in the activities table, an entry of type "post" is created, referencing this post via post_uri.

Unauthorize LAMA

If you want to stop archiving an account, simply delete the corresponding information from your prefs.json file.

You can also delete the corresponding credential files in the user/creds folder. Each configured account has two such files. The account is indicated in the file name.

Finally, you can log in to your Mastodon account on the web, and visit "Preferences > Account > Authorized apps" to revoke the permissions you granted to LAMA.

Similar apps

(In no particular order)

Feel free to let me know of other entries I could add to this list!

Disclaimer

This is a personal, non-official project. I am not associated with the Mastodon project in any way.

You can reach me via github or on Mastodon:
Github: https://github.com/s427
Mastodon: https://lou.lt/@s427

Version history

  • v1.1.0
    • new option: fetch_polls (0 or 1) if you want LAMA to fetch the posts containing polls you have taken part of. Note that this only becomes available once the poll has ended, so don't expect ongoing polls to appear in the database.
    • misc minor fixes and improvements
  • v1.0.1
    • fix: author name for posts coming from Bluesky Bridge (bsky.brid)
    • safeguard: if fetch_reblogs is disabled but the post has attachments or poll options, we save the post anyway (but not the reblog)
  • v1.0
    • initial release

About

Lightweight Archiver for Mastodon Activities - a simple Python script to periodically save your Mastodon activities in a local database

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published