files cache improvements, fixes borgbackup#8385, fixes borgbackup#5658

- changes to locally stored files cache: - store as files.<H(archive_name)> - user can manually control suffix via env var - if local files cache is not found, build from previous archive. - enable rebuilding the files cache via loading the previous archive's metadata from the repo (better than starting with empty files cache and needing to read/chunk/hash all files). previous archive == same archive name, latest timestamp in repo. - remove AdHocCache (not needed any more, slow) - remove BORG_CACHE_IMPL, we only have one - remove cache lock (this was blocking parallel backups to same repo from same machine/user). Cache entries now have ctime AND mtime. Note: TTL and age still needed for discarding removed files. But due to the separate files caches per series, the TTL was lowered to 2 (from 20).
ThomasWaldmann · Sep 19, 2024 · a891559 · a891559
1 parent 385eeeb
commit a891559
Show file tree

Hide file tree

Showing 14 changed files with 181 additions and 271 deletions.
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -837,50 +837,29 @@ already used.
 By default, ctime (change time) is used for the timestamps to have a rather
 safe change detection (see also the --files-cache option).
 
-Furthermore, pathnames recorded in files cache are always absolute, even if you
-specify source directories with relative pathname. If relative pathnames are
-stable, but absolute are not (for example if you mount a filesystem without
-stable mount points for each backup or if you are running the backup from a
-filesystem snapshot whose name is not stable), borg will assume that files are
-different and will report them as 'added', even though no new chunks will be
-actually recorded for them. To avoid this, you could bind mount your source
-directory in a directory with the stable path.
+Furthermore, pathnames used as key into the files cache are **as archived**,
+so make sure these are always the same (see ``borg list``).
 
 .. _always_chunking:
 
 It always chunks all my files, even unchanged ones!
 ---------------------------------------------------
 
-Borg maintains a files cache where it remembers the timestamp, size and
+Borg maintains a files cache where it remembers the timestamps, size and
 inode of files. When Borg does a new backup and starts processing a
 file, it first looks whether the file has changed (compared to the values
 stored in the files cache). If the values are the same, the file is assumed
 unchanged and thus its contents won't get chunked (again).
 
-Borg can't keep an infinite history of files of course, thus entries
-in the files cache have a "maximum time to live" which is set via the
-environment variable BORG_FILES_CACHE_TTL (and defaults to 20).
-Every time you do a backup (on the same machine, using the same user), the
-cache entries' ttl values of files that were not "seen" are incremented by 1
-and if they reach BORG_FILES_CACHE_TTL, the entry is removed from the cache.
-
-So, for example, if you do daily backups of 26 different data sets A, B,
-C, ..., Z on one machine (using the default TTL), the files from A will be
-already forgotten when you repeat the same backups on the next day and it
-will be slow because it would chunk all the files each time. If you set
-BORG_FILES_CACHE_TTL to at least 26 (or maybe even a small multiple of that),
-it would be much faster.
-
-Besides using a higher BORG_FILES_CACHE_TTL (which also increases memory usage),
-there is also BORG_FILES_CACHE_SUFFIX which can be used to have separate (smaller)
-files caches for each backup set instead of the default one (big) unified files cache.
-
-Another possible reason is that files don't always have the same path, for
-example if you mount a filesystem without stable mount points for each backup
-or if you are running the backup from a filesystem snapshot whose name is not
-stable. If the directory where you mount a filesystem is different every time,
-Borg assumes they are different files. This is true even if you back up these
-files with relative pathnames - borg uses full pathnames in files cache regardless.
+The files cache is stored separately (using a different filename suffix) per
+archive series, thus using always the same name for the archive is strongly
+recommended. The "rebuild files cache from previous archive in repo" feature
+also depends on that.
+Alternatively, there is also BORG_FILES_CACHE_SUFFIX which can be used to
+manually set a custom suffix (if you can't just use the same archive name).
+
+Another possible reason is that files don't always have the same path -
+borg uses the paths as seen in the archive when using ``borg list``.
 
 It is possible for some filesystems, such as ``mergerfs`` or network filesystems,
 to return inconsistent inode numbers across runs, causing borg to consider them changed.

diff --git a/docs/internals/data-structures.rst b/docs/internals/data-structures.rst
@@ -474,18 +474,20 @@ guess what files you have based on a specific set of chunk sizes).
 The cache
 ---------
 
-The **files cache** is stored in ``cache/files`` and is used at backup time to
-quickly determine whether a given file is unchanged and we have all its chunks.
+The **files cache** is stored in ``cache/files.<SUFFIX>`` and is used at backup
+time to quickly determine whether a given file is unchanged and we have all its
+chunks.
 
 In memory, the files cache is a key -> value mapping (a Python *dict*) and contains:
 
-* key: id_hash of the encoded, absolute file path
+* key: id_hash of the encoded path (same path as seen in archive)
 * value:
 
+  - age (0 [newest], ..., BORG_FILES_CACHE_TTL - 1)
   - file inode number
   - file size
-  - file ctime_ns (or mtime_ns)
-  - age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1)
+  - file ctime_ns
+  - file mtime_ns
   - list of chunk (id, size) tuples representing the file's contents
 
 To determine whether a file has not changed, cached values are looked up via
@@ -514,7 +516,7 @@ be told to ignore the inode number in the check via --files-cache.
 The age value is used for cache management. If a file is "seen" in a backup
 run, its age is reset to 0, otherwise its age is incremented by one.
 If a file was not seen in BORG_FILES_CACHE_TTL backups, its cache entry is
-removed. See also: :ref:`always_chunking` and :ref:`a_status_oddity`
+removed.
 
 The files cache is a python dictionary, storing python objects, which
 generates a lot of overhead.

diff --git a/docs/usage/general/environment.rst.inc b/docs/usage/general/environment.rst.inc
@@ -66,8 +66,7 @@ General:
         cache entries for backup sources other than the current sources.
     BORG_FILES_CACHE_TTL
         When set to a numeric value, this determines the maximum "time to live" for the files cache
-        entries (default: 20). The files cache is used to determine quickly whether a file is unchanged.
-        The FAQ explains this more detailed in: :ref:`always_chunking`
+        entries (default: 2). The files cache is used to determine quickly whether a file is unchanged.
     BORG_USE_CHUNKS_ARCHIVE
         When set to no (default: yes), the ``chunks.archive.d`` folder will not be used. This reduces
         disk space usage but slows down cache resyncs.
@@ -85,15 +84,6 @@ General:
         - ``pyfuse3``: only try to load pyfuse3
         - ``llfuse``: only try to load llfuse
         - ``none``: do not try to load an implementation
-    BORG_CACHE_IMPL
-        Choose the implementation for the clientside cache, choose one of:
-
-        - ``adhoc``: builds a non-persistent chunks cache by querying the repo. Chunks cache contents
-          are somewhat sloppy for already existing chunks, concerning their refcount ("infinite") and
-          size (0). No files cache (slow, will chunk all input files). DEPRECATED.
-        - ``adhocwithfiles``: Like ``adhoc``, but with a persistent files cache. Default implementation.
-        - ``cli``: Determine the cache implementation from cli options. Without special options, will
-          usually end up with the ``local`` implementation.
     BORG_SELFTEST
         This can be used to influence borg's builtin self-tests. The default is to execute the tests
         at the beginning of each borg command invocation.

diff --git a/src/borg/archive.py b/src/borg/archive.py
@@ -1345,7 +1345,7 @@ def process_file(self, *, path, parent_fd, name, st, cache, flags=flags_normal,
                         item.chunks.append(chunk_entry)
                 else:  # normal case, no "2nd+" hardlink
                     if not is_special_file:
-                        hashed_path = safe_encode(os.path.join(self.cwd, path))
+                        hashed_path = safe_encode(item.path)  # path as in archive item!
                         started_hashing = time.monotonic()
                         path_hash = self.key.id_hash(hashed_path)
                         self.stats.hashing_time += time.monotonic() - started_hashing

diff --git a/src/borg/archiver/_common.py b/src/borg/archiver/_common.py
@@ -161,13 +161,12 @@ def wrapper(self, args, **kwargs):
                     if "compression" in args:
                         manifest_.repo_objs.compressor = args.compression.compressor
                     if secure:
-                        assert_secure(repository, manifest_, self.lock_wait)
+                        assert_secure(repository, manifest_)
                 if cache:
                     with Cache(
                         repository,
                         manifest_,
                         progress=getattr(args, "progress", False),
-                        lock_wait=self.lock_wait,
                         cache_mode=getattr(args, "files_cache_mode", FILES_CACHE_MODE_DISABLED),
                         iec=getattr(args, "iec", False),
                     ) as cache_:
@@ -230,15 +229,14 @@ def wrapper(self, args, **kwargs):
                     manifest_ = Manifest.load(
                         repository, compatibility, ro_cls=RepoObj if repository.version > 1 else RepoObj1
                     )
-                    assert_secure(repository, manifest_, self.lock_wait)
+                    assert_secure(repository, manifest_)
                     if manifest:
                         kwargs["other_manifest"] = manifest_
                 if cache:
                     with Cache(
                         repository,
                         manifest_,
                         progress=False,
-                        lock_wait=self.lock_wait,
                         cache_mode=getattr(args, "files_cache_mode", FILES_CACHE_MODE_DISABLED),
                         iec=getattr(args, "iec", False),
                     ) as cache_:

diff --git a/src/borg/archiver/create_cmd.py b/src/borg/archiver/create_cmd.py
@@ -222,10 +222,9 @@ def create_inner(archive, cache, fso):
                 repository,
                 manifest,
                 progress=args.progress,
-                lock_wait=self.lock_wait,
-                prefer_adhoc_cache=args.prefer_adhoc_cache,
                 cache_mode=args.files_cache_mode,
                 iec=args.iec,
+                archive_name=args.name,
             ) as cache:
                 archive = Archive(
                     manifest,
@@ -787,12 +786,6 @@ def build_parser_create(self, subparsers, common_parser, mid_common_parser):
             help="only display items with the given status characters (see description)",
         )
         subparser.add_argument("--json", action="store_true", help="output stats as JSON. Implies ``--stats``.")
-        subparser.add_argument(
-            "--prefer-adhoc-cache",
-            dest="prefer_adhoc_cache",
-            action="store_true",
-            help="experimental: prefer AdHocCache (w/o files cache) over AdHocWithFilesCache (with files cache).",
-        )
         subparser.add_argument(
             "--stdin-name",
             metavar="NAME",

diff --git a/src/borg/archiver/list_cmd.py b/src/borg/archiver/list_cmd.py
@@ -37,7 +37,7 @@ def _list_inner(cache):
 
         # Only load the cache if it will be used
         if ItemFormatter.format_needs_cache(format):
-            with Cache(repository, manifest, lock_wait=self.lock_wait) as cache:
+            with Cache(repository, manifest) as cache:
                 _list_inner(cache)
         else:
             _list_inner(cache=None)

diff --git a/src/borg/archiver/prune_cmd.py b/src/borg/archiver/prune_cmd.py
@@ -111,7 +111,7 @@ def do_prune(self, args, repository, manifest):
                 keep += prune_split(archives, rule, num, kept_because)
 
         to_delete = set(archives) - set(keep)
-        with Cache(repository, manifest, lock_wait=self.lock_wait, iec=args.iec) as cache:
+        with Cache(repository, manifest, iec=args.iec) as cache:
             list_logger = logging.getLogger("borg.output.list")
             # set up counters for the progress display
             to_delete_len = len(to_delete)