findimagedupes 0.1.3 Copyright 2001 Rob Kudla <email@example.com>
This program is distributed under the GNU General Public License;
see the file COPYING for details. (Note especially the part about
NO WARRANTY. You run findimagedupes at your own risk.)
findimagedupes is a crude command-line utility for finding visually
similar images. With it you can compare two images and report the
percentage of similarity, or compare an entire tree, reporting all
likely duplicates based on a percentage of similarity you select. It
optionally exports GQView collection files.
It can handle all image types understood by ImageMagick, but currently
is limited to those types recognized as images/bitmaps by whatever
version of "file" is installed on your system. This is due to
PerlMagick's desire to treat every imaginable file type as an image,
with tragic results.
I have nothing to do with GQView development myself, but the
collections that findimagedupes creates work properly in GQView 0.91.
This functionality is provided so that there's some easy way to
visually compare duplicates; using GQView you can quickly delete the
poorer quality duplicate. I plan to write a GUI front-end as I did
with kcdfind, which in this case will allow you to view and manage
duplicates as they're found. Ultimately, this functionality should be
integrated into image management programs like GQView and Pixie. (I
think it's on GQView's todo list already.)
The algorithm is monkeys-with-typewriters simple: reduce images to a
uniform size, shape and palette; expand the histogram as much as
possible; reduce further to a 16x16x1 bitmap; compare each pair of
pictures and count the number of bits in an identical state in both.
The algorithm is (hopefully) commented usefully in the code.
Despite its crudeness, it seems to be something like 98% accurate on
most common images (people, animals, cartoons, pr0n, etc.) It does
spectacularly poorly on images with miniscule differences, like 10
different shots of a sunset over the ocean, or the two presidential
candidates in the last US election.
I've recently added a "GUI mode" which should allow other programs
to interface with findimagedupes. It theoretically prints no
output except status info and when it finds dupes. It's formatted
Status::<number of current image>::<total number of images>::<percent>
This program seems stable, but should be considered a development
release. In particular, the processing time rises geometrically with
the number of images in a tree. I am not enough of a math or data
structures guy to know how to speed up the binary comparison and bit
counting of, say, 20,000 elements of 32 bytes of data, all resident in
memory. (PerlMagick also seems to not let go of its memory when you
undef images out of a set.) Suggestions would be welcome.
perl - language this is written in
ImageMagick - library for manipulating images
PerlMagick (Image::Magick) - Perl interface to above
pwd, find, sort, tput (curses), file
(i.e. if this works right under NT I'd be surprised)
A bunch of pictures of which you've totally lost control
Non-standard packages required for this to run should be
available via my perl page at:
If you'd like, get GQview at:
Usage: findimagedupes [options] [<file1> <file2>]
-rescan = rescan fingerprints of all files in directory
-f <file> = use <file> as image fingerprint database
-d <dir> = scan <dir> instead of current directory
-t <num> = use <num> as threshold% of similarity (default 90)
-v <program> = launch <program> (in bg) to view each set of dupes
-c <file> = create GQView collection <file>.gqv of duplicates
<file1> <file2> = diff just those two files, using -v if present
(other options ignored if files are specified)
-p = only valid when files specified; prints the
hex of the actual fingerprint of each file.
-g = GUI mode: produce only machine-friendly output.
0.1.3 11 February 2001
Due to a problem with ImageMagick 5.2.3 on my machine,
now uses raw 'mono' format for thumbprints instead
of 'pbm'. This returns identical data minus the
header stuff we were discarding anyway.
Applied patch from Max Stekelenburg:
Program would sometimes try to scan text files due
to a misplaced if.
Stomped on some uninitialized value warnings.
Applied patch from Paul Cassella:
Improved performance by only comparing images whose
number of 1 bits would allow them to fall within
Sped up bit counting routing using pack once per
image rather than a loop per comparison.
Eliminated non-regular-files from the scan.
Stomped on some uninitialized value warnings.
Bugfix: Will again rescan automatically if db file not
present or is zero-length.
Added -g option, "GUI mode", to format output for use
by GUI's and other programs.
0.1.2 30 September 2000
Changed algorithm to use an 8x8x8 thumbprint instead of
16x16x1. It failed disastrously, so changed back.
Cleaned up the code a little for public release.
Minor bug fixes I didn't bother to document.
First public release.
0.1.1 24 September 2000
Added "compare 2 files at a time".
Added GQview collection generation.
Fixed bug in binary comparison; huge jump in accuracy.
Generalized (sorta) the thumbprint generation routine.
0.1.0 17 September 2000
First working version, perhaps 75% accurate.