This tutorial explains how to use plain gnu tools to find and remove duplicate files. It only relies on bash command and gnu tools, such as: parallel, sha256sum and other bash commands.
cat
, find
, sed
, uniq
, sponge
, grep
, tree
and sort
should be installed.
Install sha256sum
:
$> sudo apt-get install -y sha256sum tree moreutils
Install parallel:
see : parallel.adoc
To search duplicates, I rely on sha256sum hash functions. So first thing to do, is to compute the hash for all files. It may take a while.
# compute hash for files
$> find ./ -type f ! -wholename "*/@eaDir/*" ! -name "Thumbs.db" | parallel "sha256sum {} >> fingerprints.txt"
# note @eaDir Thumbs.db files are not processed
# you can repeat the hash process with multiple path for find, as long as root path is the same (or use absolute paths).
# optionally exclude some files a posteriori
$> grep -v "selection" fingerprints.txt | sponge fingerprints.txt
# At the end, make sure you have no duplicate entries that can pollute the following processes.
# duplicate entries might happens if you apply twice same command.
$> sort --unique fingerprints.txt -o fingerprints.txt # heals remove single file duplication in case of multiple runs
$> sort -k2 fingerprints.txt -o fingerprints.txt # sort by filename, just for convenience
fingerprints.txt
should looks like:
ee5954355d52d5fdde425ab37adb2247cad3c8492966005631ee04fda2d5f188 ./mydir/tmp.47gcuzDvp2 0191a0f48a8a0d4f1db3e906c54d4f0b815da7dd7eea3abb9682265b8aef5e8c ./mydir/tmp.8UTz5kGOiZ abf61dc38ed225c2c2d71afc6f093e120289c1ca5bdb10b70b195b6891560496 ./mydir/tmp.BlgDxx3yAj 87c67bbc3c3b16f7ad8f8db6d912df5739efea9be76218f5d6a9d38524b34724 ./mydir/tmp.dzCsVWATqC 9cd05f25f126936ae1e6ac487021033fd8c02760c48eac35af07a66755e4d684 ./mydir/tmp.E1dIRUDPLf 78fb97104bfc022e2d3f8bd40193ff0ede37fd3a398a0e2ac0947fbfc6627aaf ./mydir/tmp.FBINu8gnGL 8f04793f9be7bb502d33cec17a9aad557d84f0a194c48bc7530c845a73c541eb ./mydir/tmp.foNntc262d 8f04793f9be7bb502d33cec17a9aad557d84f0a194c48bc7530c845a73c541eb ./mydir/tmp.H24yD9WWzP e458588262911df9a6bf9be66bc93955db4a396b42c9d180d7387eafa29cf987 ./mydir/tmp.kpg3eqAzUi 6a8eaa536cc65fc6490e6b743e722647037b2e57e6461c875a76539685e167a5 ./mydir/tmp.MjEENkA9rM 98bd06465977efdb4026b198c0f81dd4565754c5705cb57e82c2733b452464bd ./mydir/tmp.nYU8iWL1MY 1149788a2cc8097bf17bbcbd9a5aa56d08e4b4ea669279259fe40a5a08d37970 ./mydir/tmp.Q6bW1mVC4J 8af7a327a438d22a8158eac434c37e16fd70ffa326a2d520f52b819dc07a1b48 ./mydir/tmp.sgprKjAw1B d38407d55361bc28decb6e8d5f42587aa04651772342ceb906fd08d687ff5653 ./mydir/tmp.uMXMd32oLk 5b07fc1686009a66c650982c0a6363f69301c5369f2dd87ec930eee583aa2ebc ./mydir/tmp.WI4RxZeymC 8f04793f9be7bb502d33cec17a9aad557d84f0a194c48bc7530c845a73c541eb ./mydir/tmp.Xdr3qwyAu9 f2d922da9e89fa59ee4561cadf2b0b809e18ac01360e8ea53d67d643e4e645bb ./mydir/tmp.ziiDYzWAG2 e458588262911df9a6bf9be66bc93955db4a396b42c9d180d7387eafa29cf987 ./mydir/tmp.zx5eYf4cgK ee5954355d52d5fdde425ab37adb2247cad3c8492966005631ee04fda2d5f188 ./mydir/tmp.zXo79fQtSm
Once fingerprints.txt
index file is built, you can look for duplication.
So, in the end, duplicates.txt
is just a subset of fingerprints.txt
containing
all occurrences of duplicate files.
$> cat fingerprints.txt | sort -k1,1 | uniq -D --check-chars=65 > duplicates.txt
8f04793f9be7bb502d33cec17a9aad557d84f0a194c48bc7530c845a73c541eb ./mydir/tmp.foNntc262d 8f04793f9be7bb502d33cec17a9aad557d84f0a194c48bc7530c845a73c541eb ./mydir/tmp.H24yD9WWzP 8f04793f9be7bb502d33cec17a9aad557d84f0a194c48bc7530c845a73c541eb ./mydir/tmp.Xdr3qwyAu9 e458588262911df9a6bf9be66bc93955db4a396b42c9d180d7387eafa29cf987 ./mydir/tmp.kpg3eqAzUi e458588262911df9a6bf9be66bc93955db4a396b42c9d180d7387eafa29cf987 ./mydir/tmp.zx5eYf4cgK ee5954355d52d5fdde425ab37adb2247cad3c8492966005631ee04fda2d5f188 ./mydir/tmp.47gcuzDvp2 ee5954355d52d5fdde425ab37adb2247cad3c8492966005631ee04fda2d5f188 ./mydir/tmp.zXo79fQtSm
You can inspect the number of duplicate files using the following:
$> cat duplicates.txt | sort -k1 | uniq --repeated --check-chars=65 --count | sort -r -k1 | head -n 20 3 8f04793f9be7bb502d33cec17a9aad557d84f0a194c48bc7530c845a73c541eb ./mydir/tmp.foNntc262d 2 ee5954355d52d5fdde425ab37adb2247cad3c8492966005631ee04fda2d5f188 ./mydir/tmp.47gcuzDvp2 2 e458588262911df9a6bf9be66bc93955db4a396b42c9d180d7387eafa29cf987 ./mydir/tmp.kpg3eqAzUi
Now, we will extract one occurrence of file for each duplication from duplicates.txt
.
Then, a single file path for each hash will be stored in suppression.txt
.
If more than 2 copies of files are present, redo steps suppression.txt
and clean-up.
# optionally reverse sort entries of each duplicates (might be handy if names are dates).
$> cat duplicates.txt | sort -k1,1 -k2r | sponge duplicates.txt
# this is the one: delete first entry for each duplication
$> cat duplicates.txt | uniq --repeated --check-chars=65 | cur -c67- > suppression.txt
Moves files listed in suppression.txt
to a dedicated trash
directory with:
$> mkdir trash
$> parallel mv {} trash/{/} :::: suppression.txt
You can inspect what s in trash before deleting:
$> du -csh trash # see how much space is gained
$> find ./trash -type f ! -wholename "*/@eaDir/*" ! -name "Thumbs.db" | parallel "sha256sum {} >> trash/fingerprints_trash.txt"
$> find ./ -type f ! -wholename "./trash/*" | parallel "sha256sum {} >> trash/fingerprints_keep.txt"
$> cut -c1-64 trash/fingerprints_trash.txt > trash/sha256_trash.txt
$> cut -c1-64 trash/fingerprints_keep.txt > trash/sha256_keep.txt
$> fgrep -x -f trash/sha256_trash.txt trash/sha256_keep.txt > trash/sha256_safe.txt
$> fgrep -x -v -f trash/sha256_safe.txt trash/sha256_trash.txt > trash/sha256_lost.txt
# trash/sha256_lost.txt contains list of sha256 of lost files (no duplication)
$> fgrep -f trash/sha256_lost.txt trash/fingerprints_trash.txt | cut -c67- # display file paths of lost files
$> rm -rf trash suppression.txt duplicates.txt
# filter out files that does not exists anymore
$> cat fingerprints.txt | parallel --plus "[[ -f {= s/^.{66}// =} ]] && echo {}" | sponge fingerprints.txt