file_known_and_unchanged is wasteful and should be converted to inodes #909

verygreen · 2016-04-15T01:38:01Z

Moved from #908
I guess another change is to turn cache.file_known_and_unchanged into just inode_known_and_unchanged.
Who cares what's the actual file name, since it's always stored in the archive anyway I believe?
Additional benefit, the "files_cache" huge file in .cache might become smaller as it no longer needs to hold path names.

This poses some difficulties for filesystems without inode numbers like vfat and on Windows, as mentioned by @ThomasWaldmann

Windows (assuming they don't have inodes - I don't know much about ntfs) - some other unique identifier might be used, I guess. For filesystems that have no symlinks (like vfat) - file path might be unique enough. Still some races with content changing underneath, but at least nothing security critical (also hopefully nobody resurrects umsdos fs) (it's tempting to use filesize + create date (+modification date?) as unique id for vfat to handle renames more optimally there, but that might not be safe).
On nfs there's "file handle" that's unique (no idea if it's easily visible from userspace).
fuse and friends are harder, I guess.
ntfs has symlinks and hardlinks, so they must have have unique file ids similar to inode numbers of some sort.
These strategies could be determined by filesystem type (available from statfs(2) and could be called once we cross a mountpoint boundary assuming this is enabled, also separately checked for every path).
It should be noted that since inodes are not unique outside of a normal filesystem, really inode in the table should be some combination of actual inode number and filesystem id (similar to how nfs creates file handles). Filesystem id might be anything from mount path (unsafe) to underlying block device id (unsafe) to fs uuid (not bullet proof either) to some other options.

Using file path as generic fallback for inode number for cases when inode numbers are ignored for other reasons (commandline switch) is also possible, though it should be marked as "potentially suboptimal".

ThomasWaldmann · 2016-11-08T04:52:42Z

Currently, a mapping is used from key path_hash -> value FileCacheEntry (size, mtime, inode, age, chunk_ids).

key: path_hash = self.key.id_hash(safe_encode(os.path.join(self.cwd, path)))

So, the key hash could be just computed from different input data or get replaced by a tuple.

Algorithm:

if has_stable_inodes(path, st):
    key = (st.st_dev, st.st_ino)
else:
    key = id_hash(path)  # fallback, see orig. code

enkore · 2016-11-08T21:55:45Z

A hash isn't even necessary - we can just pack the device number and the inode number in 16 bytes(!) each. But since we in fact don't even have the requirement of len(key) == 32 here, like we do for objects, we could just as well use (dev, ino) as the key directly.

Maybe that even uses less memory (but certainly less storage when msgpacked) than the 32 byte string. Edit: Uh, well maybe not :D

>>> sys.getsizeof((2065, 557311989))
64
>>> sys.getsizeof(2065)
28

= 120 bytes per key. While bytes(32).

>>> sys.getsizeof(bytes(32))
65

Storage, however:

>>> len(msgpack.packb((2065, 557311989)))
9

But we can combine both: Store tuples in the file = smaller storage. When reading the file, keys that are tuples are converted to a 32 byte string (= same memory footprint). When writing the file, the reverse is performed. This works, because when writing we know we want to operate in use_inodes_for_cache mode, and when reading, the file just tells us.

This is is almost compatible: Old cold doesn't touch it. It would just insert all the tuples as keys. The only problem is that msgpack deserializes tuples as lists, which are mutable, which are therefore not hashable, and therefore don't qualify as keys. Hm.

ThomasWaldmann · 2016-11-09T00:23:49Z

Trying to list filesystems with known-stable inode numbers (white list):

all native linux, local filesystems(?): ext2/3/4, btrfs, xfs, zfs, ...
cifs with unix extensions and "serverino" option

Trying to list filesystems with known-unstable inode numbers (black list):

non-linux/unix filesystems: fat, ntfs
FUSE filesystems: ntfs, sshfs
network filesystems: smbfs / cifs (w/o serverino), nfs

ThomasWaldmann · 2016-11-13T01:22:49Z

Crap, os.statvfs() does not return f_type.

Calling stat(v)fs ourselves seem to require some platform-specific code as it is slightly different on different platforms (data type of fs type, include files). Also, we would have to classify a lot of filesystems.

Maybe just let the user specify whether inodes shall be considered stable? We already have a commandline option to ignore inodes (used when they are unstable or unsupported) - add another one like --use-inodes to trigger that borg uses (st_dev, st_ino) instead of the fs item path as key in the files cache?

enkore · 2016-11-13T10:39:04Z

Yes I think the latter would be a better approach to this. Only some 'special' cases profit from this (but if they do, then quite a lot I suspect); hardlinked files in the same archive are already indexed by (dev, ino) anyway.

ThomasWaldmann · 2016-11-13T12:58:24Z

#1842 (comment) see there.

enkore · 2016-11-13T13:01:53Z

Re. compatibility with older Borg:

in ino-mode, truncate/leave files (this is valid). Store ino files cache as files.ino or so. No problems. This also needs next to no code changes in the loading/saving logic, only the filename has to be changed depending on mode.

With some added code the cache could be automatically online, as files are seen during backups, as well.

ThomasWaldmann · 2016-11-13T13:15:24Z

The code will dynamically switch between modes, so we need both in parallel.
Guess nothing bad will happen with old borg. The keys are bytes in both cases and they are both only computed in a one-way manner.

ThomasWaldmann · 2016-11-25T00:58:03Z

#1842 currently blocked by some C/Cython problem (access fsid) - can anybody help with that?

ThomasWaldmann · 2017-01-22T02:57:34Z

anyone?

ThomasWaldmann · 2024-09-17T17:12:10Z

Small related update:

borg2 will archive the inode number of files now.
And there is the idea in improve files cache #8385.

verygreen mentioned this issue Apr 15, 2016

Races between readdir/getdents64 and name access. #908

Closed

enkore added the filesystem label Oct 23, 2016

enkore mentioned this issue Oct 23, 2016

rsnapshot / rsync+hardlinks importer #1754

Closed

enkore mentioned this issue Nov 13, 2016

WIP: fs type detection for linux, dev/inode cache keys #1842

Closed

ThomasWaldmann mentioned this issue Nov 19, 2016

mtime cache misbehaving on nested backups #1860

Closed

ThomasWaldmann added the help wanted label Nov 25, 2016

enkore added the c: cache label Jul 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file_known_and_unchanged is wasteful and should be converted to inodes #909

file_known_and_unchanged is wasteful and should be converted to inodes #909

verygreen commented Apr 15, 2016

ThomasWaldmann commented Nov 8, 2016 •

edited

Loading

enkore commented Nov 8, 2016 •

edited

Loading

ThomasWaldmann commented Nov 9, 2016 •

edited

Loading

ThomasWaldmann commented Nov 13, 2016 •

edited

Loading

enkore commented Nov 13, 2016 •

edited

Loading

ThomasWaldmann commented Nov 13, 2016

enkore commented Nov 13, 2016 •

edited

Loading

ThomasWaldmann commented Nov 13, 2016

ThomasWaldmann commented Nov 25, 2016

ThomasWaldmann commented Jan 22, 2017

ThomasWaldmann commented Sep 17, 2024

file_known_and_unchanged is wasteful and should be converted to inodes #909

file_known_and_unchanged is wasteful and should be converted to inodes #909

Comments

verygreen commented Apr 15, 2016

ThomasWaldmann commented Nov 8, 2016 • edited Loading

enkore commented Nov 8, 2016 • edited Loading

ThomasWaldmann commented Nov 9, 2016 • edited Loading

ThomasWaldmann commented Nov 13, 2016 • edited Loading

enkore commented Nov 13, 2016 • edited Loading

ThomasWaldmann commented Nov 13, 2016

enkore commented Nov 13, 2016 • edited Loading

ThomasWaldmann commented Nov 13, 2016

ThomasWaldmann commented Nov 25, 2016

ThomasWaldmann commented Jan 22, 2017

ThomasWaldmann commented Sep 17, 2024

ThomasWaldmann commented Nov 8, 2016 •

edited

Loading

enkore commented Nov 8, 2016 •

edited

Loading

ThomasWaldmann commented Nov 9, 2016 •

edited

Loading

ThomasWaldmann commented Nov 13, 2016 •

edited

Loading

enkore commented Nov 13, 2016 •

edited

Loading

enkore commented Nov 13, 2016 •

edited

Loading