File: finding_duplicate_files.mdwn

package info (click to toggle)

git-annex 8.20210223-2

links: PTS, VCS
area: main
in suites: bullseye
size: 68,764 kB
sloc: haskell: 70,359; javascript: 9,103; sh: 1,304; makefile: 212; perl: 136; ansic: 44

file content (21 lines) | stat: -rw-r--r-- 797 bytes

parent folder | download | duplicates (12)

Maybe you had a lot of files scattered around on different drives, and you
added them all into a single git-annex repository. Some of the files are
surely duplicates of others.

While git-annex stores the file contents efficiently, it would still
help in cleaning up this mess if you could find, and perhaps remove
the duplicate files.

Here's a command line that will show duplicate sets of files grouped together:

	git annex find --include '*' --format='${file} ${escaped_key}\n' | \
		sort -k2 | uniq --all-repeated=separate -f1 | \
		sed 's/ [^ ]*$//'

Here's a command line that will remove one of each duplicate set of files:

	git annex find --include '*' --format='${file} ${escaped_key}\n' | \
		sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
		xargs -d '\n' git rm

--[[Joey]]