File: README-findimagedupes_0.1.3

package info (click to toggle)
findimagedupes 0.2-2
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 120 kB
  • ctags: 15
  • sloc: perl: 441; makefile: 11
file content (124 lines) | stat: -rw-r--r-- 5,860 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
findimagedupes 0.1.3 Copyright 2001 Rob Kudla <webmaster@kudla.org>

This program is distributed under the GNU General Public License;
see the file COPYING for details.  (Note especially the part about
NO WARRANTY.  You run findimagedupes at your own risk.)

findimagedupes is a crude command-line utility for finding visually
similar images.  With it you can compare two images and report the
percentage of similarity, or compare an entire tree, reporting all
likely duplicates based on a percentage of similarity you select.  It
optionally exports GQView collection files.  

It can handle all image types understood by ImageMagick, but currently
is limited to those types recognized as images/bitmaps by whatever
version of "file" is installed on your system.  This is due to
PerlMagick's desire to treat every imaginable file type as an image,
with tragic results.

I have nothing to do with GQView development myself, but the
collections that findimagedupes creates work properly in GQView 0.91.
This functionality is provided so that there's some easy way to
visually compare duplicates; using GQView you can quickly delete the
poorer quality duplicate.  I plan to write a GUI front-end as I did
with kcdfind, which in this case will allow you to view and manage
duplicates as they're found.  Ultimately, this functionality should be
integrated into image management programs like GQView and Pixie.  (I
think it's on GQView's todo list already.)

The algorithm is monkeys-with-typewriters simple: reduce images to a
uniform size, shape and palette; expand the histogram as much as
possible; reduce further to a 16x16x1 bitmap; compare each pair of
pictures and count the number of bits in an identical state in both.
The algorithm is (hopefully) commented usefully in the code.

Despite its crudeness, it seems to be something like 98% accurate on
most common images (people, animals, cartoons, pr0n, etc.)  It does
spectacularly poorly on images with miniscule differences, like 10
different shots of a sunset over the ocean, or the two presidential
candidates in the last US election.

I've recently added a "GUI mode" which should allow other programs
to interface with findimagedupes.  It theoretically prints no
output except status info and when it finds dupes.  It's formatted
like this:

Status::<number of current image>::<total number of images>::<percent>
Dupe::<filename1>::<filename2>::<percent similarity>

This program seems stable, but should be considered a development
release.  In particular, the processing time rises geometrically with
the number of images in a tree.  I am not enough of a math or data
structures guy to know how to speed up the binary comparison and bit
counting of, say, 20,000 elements of 32 bytes of data, all resident in
memory.  (PerlMagick also seems to not let go of its memory when you
undef images out of a set.)  Suggestions would be welcome.

Requirements:
	perl - language this is written in
	ImageMagick - library for manipulating images
	PerlMagick (Image::Magick) - Perl interface to above
	pwd, find, sort, tput (curses), file
	   (i.e. if this works right under NT I'd be surprised)
	A bunch of pictures of which you've totally lost control

Non-standard packages required for this to run should be
available via my perl page at:
        http://www.kudla.org/raindog/perl

If you'd like, get GQview at:
	http://gqview.sourceforge.net

Usage: findimagedupes [options] [<file1> <file2>]
Options:
       -rescan         = rescan fingerprints of all files in directory
       -f <file>       = use <file> as image fingerprint database
       -d <dir>        = scan <dir> instead of current directory
       -t <num>        = use <num> as threshold% of similarity (default 90)
       -v <program>    = launch <program> (in bg) to view each set of dupes
       -c <file>       = create GQView collection <file>.gqv of duplicates
       <file1> <file2> = diff just those two files, using -v if present
                         (other options ignored if files are specified)
       -p              = only valid when files specified; prints the
                         hex of the actual fingerprint of each file.
       -g              = GUI mode: produce only machine-friendly output.

History:

0.1.3 11 February 2001
	Due to a problem with ImageMagick 5.2.3 on my machine,
           now uses raw 'mono' format for thumbprints instead 
           of 'pbm'.  This returns identical data minus the 
           header stuff we were discarding anyway.
        Applied patch from Max Stekelenburg:
           Program would sometimes try to scan text files due
              to a misplaced if.
           Stomped on some uninitialized value warnings.
	Applied patch from Paul Cassella:
           Improved performance by only comparing images whose
              number of 1 bits would allow them to fall within
              the threshold.
           Sped up bit counting routing using pack once per
              image rather than a loop per comparison.
           Eliminated non-regular-files from the scan.
           Stomped on some uninitialized value warnings.
        Bugfix: Will again rescan automatically if db file not 
           present or is zero-length.
        Added -g option, "GUI mode", to format output for use
           by GUI's and other programs.

0.1.2 30 September 2000
	Changed algorithm to use an 8x8x8 thumbprint instead of
	   16x16x1.  It failed disastrously, so changed back.
	Cleaned up the code a little for public release.
	Minor bug fixes I didn't bother to document.
	First public release.

0.1.1 24 September 2000
	Added "compare 2 files at a time".
	Added GQview collection generation.
	Fixed bug in binary comparison; huge jump in accuracy.
	Generalized (sorta) the thumbprint generation routine.

0.1.0 17 September 2000
	First working version, perhaps 75% accurate.