Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split files cache #5658

Open
ThomasWaldmann opened this issue Jan 27, 2021 · 2 comments
Open

split files cache #5658

ThomasWaldmann opened this issue Jan 27, 2021 · 2 comments
Assignees
Labels
Milestone

Comments

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 27, 2021

Problem

the files cache can be rather large and consume a lot of memory while loaded.

currently, borg by default uses 1 files cache for all backup runs, containing informations about all files seen in the last BORG_FILES_CACHE_TTL backup runs.

there is already one mechanism to optimize that: using BORG_FILES_CACHE_SUFFIX to split this overall files cache into multiple specific files caches, so each backup run only gets what it really needs

this is a bit of a hack as the user has to care for using files cache suffixes corresponding to the backup input data sets.

From the docs:

BORG_FILES_CACHE_SUFFIX

When set to a value at least one character long, instructs borg to use a specifically named (based on the suffix) alternative files cache. This can be used to avoid loading and saving cache entries for backup sources other than the current sources.

BORG_FILES_CACHE_TTL

When set to a numeric value, this determines the maximum “time to live” for the files cache entries (default: 20). The files cache is used to quickly determine whether a file is unchanged. The FAQ explains this more detailed in: It always chunks all my files, even unchanged ones!

Ideas

auto-compute the suffix

Simply compute the cache suffix from all recursion roots, like suffix = H(root1, root2, ...)

This would use 1 files cache per borg run.

split files cache per recursion root

Maybe borg could default to using a split files cache and handle separately storing these automatically, without the user having to think about suffixes and manually mapping these to the backup runs.

E.g. it could store (and load) a files cache per recursion root.

This could potentially use multiple files caches per borg run.

Note:

  • cache transactions still need to work, so we always have a correct/consistent cache.
  • in some modes of operation there is no recursion root, e.g. if filenames or content data is coming in via stdin
@ThomasWaldmann
Copy link
Member Author

Maybe we won't need this, see the idea in #8385.

@ThomasWaldmann ThomasWaldmann self-assigned this Sep 19, 2024
@ThomasWaldmann ThomasWaldmann added this to the 2.0.0b11 milestone Sep 19, 2024
@ThomasWaldmann
Copy link
Member Author

Split by archive series (still allow manual control via env var).

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 19, 2024
- changes to locally stored files cache:

  - store as files.<H(archive_name)>
  - user can manually control suffix via env var
  - if local files cache is not found, build from previous archive.
- enable rebuilding the files cache via loading the previous
  archive's metadata from the repo (better than starting with
  empty files cache and needing to read/chunk/hash all files).
  previous archive == same archive name, latest timestamp in repo.
- remove AdHocCache (not needed any more, slow)
- remove BORG_CACHE_IMPL, we only have one
- remove cache lock (this was blocking parallel backups to same
  repo from same machine/user).

Cache entries now have ctime AND mtime.

Note: TTL and age still needed for discarding removed files.
      But due to the separate files caches per series, the TTL
      was lowered to 2 (from 20).
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 19, 2024
- changes to locally stored files cache:

  - store as files.<H(archive_name)>
  - user can manually control suffix via env var
  - if local files cache is not found, build from previous archive.
- enable rebuilding the files cache via loading the previous
  archive's metadata from the repo (better than starting with
  empty files cache and needing to read/chunk/hash all files).
  previous archive == same archive name, latest timestamp in repo.
- remove AdHocCache (not needed any more, slow)
- remove BORG_CACHE_IMPL, we only have one
- remove cache lock (this was blocking parallel backups to same
  repo from same machine/user).

Cache entries now have ctime AND mtime.

Note: TTL and age still needed for discarding removed files.
      But due to the separate files caches per series, the TTL
      was lowered to 2 (from 20).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant