File: dupemap.1

package info (click to toggle)
magicrescue 1.1.9-4
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 812 kB
  • ctags: 530
  • sloc: ansic: 1,939; perl: 1,649; sh: 316; makefile: 65
file content (405 lines) | stat: -rw-r--r-- 15,720 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
.\" Automatically generated by Pod::Man 2.22 (Pod::Simple 3.07)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.ie \nF \{\
.    de IX
.    tm Index:\\$1\t\\n%\t"\\$2"
..
.    nr % 0
.    rr F
.\}
.el \{\
.    de IX
..
.\}
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "DUPEMAP 1"
.TH DUPEMAP 1 "2008-06-26" "1.1.9" "Magic Rescue"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
dupemap \- Creates a database of file checksums and uses it to eliminate
duplicates
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
\&\fBdupemap\fR [ \fIoptions\fR ] [ \fB\-d\fR \fIdatabase\fR ] \fIoperation\fR \fIpath...\fR
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
\&\fBdupemap\fR recursively scans each \fIpath\fR to find checksums of file contents.
Directories are searched through in no particular order.  Its actions depend on
whether the \fB\-d\fR option is given, and on the \fIoperation\fR parameter, which
must be a comma-seperated list of \fBscan\fR, \fBreport\fR, \fBdelete\fR:
.SS "Without \fB\-d\fP"
.IX Subsection "Without -d"
\&\fBdupemap\fR will take action when it sees the same checksum repeated more than
once, i.e. it simply finds duplicates recursively.  The action depends on
\&\fIoperation\fR:
.IP "\fBreport\fR" 7
.IX Item "report"
Report what files are encountered more than once, printing their names to
standard output.
.IP "\fBdelete\fR[\fB,report\fR]" 7
.IX Item "delete[,report]"
Delete files that are encountered more than once.  Print their names if
\&\fBreport\fR is also given.
.Sp
\&\fI\s-1WARNING:\s0\fR use the \fBreport\fR operation first to see what will be deleted.
.Sp
\&\fI\s-1WARNING:\s0\fR You are advised to make a backup of the target first, e.g. with
\&\f(CW\*(C`cp \-al\*(C'\fR (for \s-1GNU\s0 cp) to create hard links recursively.
.SS "With \fB\-d\fP"
.IX Subsection "With -d"
The \fIdatabase\fR argument to \fB\-d\fR will denote a database file (see the
\&\*(L"\s-1DATABASE\s0\*(R" section in this manual for details) to read from or write to.  In
this mode, the \fBscan\fR operation should be run on one \fIpath\fR, followed by the
\&\fBreport\fR or \fBdelete\fR operation on another (\fInot the same!\fR) \fIpath\fR.
.IP "\fBscan\fR" 7
.IX Item "scan"
Add the checksum of each file to \fIdatabase\fR.  This operation must be run
initially to create the database.  To start over, you must manually delete the
database file(s) (see the \*(L"\s-1DATABASE\s0\*(R" section).
.IP "\fBreport\fR" 7
.IX Item "report"
Print each file name if its checksum is found in \fIdatabase\fR.
.IP "\fBdelete\fR[\fB,report\fR]" 7
.IX Item "delete[,report]"
Delete each file if its checksum is found in \fIdatabase\fR.  If \fBreport\fR is also
present, print the name of each deleted file.
.Sp
\&\fI\s-1WARNING:\s0\fR if you run \fBdupemap delete\fR on the same \fIpath\fR you just ran
\&\fBdupemap scan\fR on, it will \fIdelete every file!\fR The idea of these options is
to scan one \fIpath\fR and delete files in a second \fIpath\fR.
.Sp
\&\fI\s-1WARNING:\s0\fR use the \fBreport\fR operation first to see what will be deleted.
.Sp
\&\fI\s-1WARNING:\s0\fR You are advised to make a backup of the target first, e.g. with
\&\f(CW\*(C`cp \-al\*(C'\fR (for \s-1GNU\s0 cp) to create hard links recursively.
.SH "OPTIONS"
.IX Header "OPTIONS"
.IP "\fB\-d\fR \fIdatabase\fR" 7
.IX Item "-d database"
Use \fIdatabase\fR as an on-disk database to read from or write to.  See the
\&\*(L"\s-1DESCRIPTION\s0\*(R" section above about how this influences the operation of
\&\fBdupemap\fR.
.IP "\fB\-I\fR \fIfile\fR" 7
.IX Item "-I file"
Reads input files from \fIfile\fR in addition to those listed on the command line.
If \fIfile\fR is \f(CW\*(C`\-\*(C'\fR, read from standard input.  Each line will be interpreted as
a file name.
.Sp
The paths given here will \s-1NOT\s0 be scanned recursively.  Directories will be
ignored and symlinks will be followed.
.IP "\fB\-m\fR \fIminsize\fR" 7
.IX Item "-m minsize"
Ignore files below this size.
.IP "\fB\-M\fR \fImaxsize\fR" 7
.IX Item "-M maxsize"
Ignore files above this size.
.SH "USAGE"
.IX Header "USAGE"
.SS "General usage"
.IX Subsection "General usage"
The easiest operations to understand is when the \fB\-d\fR option is not given.  To
delete all duplicate files in \fI/tmp/recovered\-files\fR, do:
.PP
.Vb 1
\&    $ dupemap delete /tmp/recovered\-files
.Ve
.PP
Often, \fBdupemap scan\fR is run to produce a checksum database of all files in a
directory tree.  Then \fBdupemap delete\fR is run on another directory, possibly 
following \fBdupemap report\fR.  For example, to delete all files in
\&\fI/tmp/recovered\-files\fR that already exist in \fI\f(CI$HOME\fI\fR, do this:
.PP
.Vb 2
\&    $ dupemap \-d homedir.map scan $HOME
\&    $ dupemap \-d homedir.map delete,report /tmp/recovered\-files
.Ve
.SS "Usage with magicrescue"
.IX Subsection "Usage with magicrescue"
The main application for \fBdupemap\fR is to take some pain out of performing
undelete operations with \fBmagicrescue\fR(1).  The reason is that \fBmagicrescue\fR
will extract every single file of the specified type on the block device, so
undeleting files requires you to find a few files out of hundreds, which can
take a long time if done manually.  What we want to do is to only extract the
documents that don't exist on the file system already.
.PP
In the following scenario, you have accidentally deleted some important Word
documents in Windows.  If this were a real-world scenario, then by all means use
The Sleuth Kit.  However, \fBmagicrescue\fR will work even when the directory
entries were overwritten, i.e. more files were stored in the same folder later.
.PP
You boot into Linux and change to a directory with lots of space.  Mount the
Windows partition, preferably read-only (especially with \s-1NTFS\s0), and create the
directories we will use.
.PP
.Vb 2
\&    $ mount \-o ro /dev/hda1 /mnt/windows
\&    $ mkdir healthy_docs rescued_docs
.Ve
.PP
Extract all the healthy Word documents with \fBmagicrescue\fR and build a database
of their checksums.  It may seem a little redundant to send all the documents
through \fBmagicrescue\fR first, but the reason is that this process may modify
them (e.g. stripping trailing garbage), and therefore their checksum will not
be the same as the original documents.  Also, it will find documents embedded
inside other files, such as uncompressed zip archives or files with the wrong
extension.
.PP
.Vb 4
\&    $ find /mnt/windows \-type f \e
\&      |magicrescue \-I\- \-r msoffice \-d healthy_docs
\&    $ dupemap \-d healthy_docs.map scan healthy_docs
\&    $ rm \-rf healthy_docs
.Ve
.PP
Now rescue all \f(CW\*(C`msoffice\*(C'\fR documents from the block device and get rid of
everything that's not a *.doc.
.PP
.Vb 2
\&    $ magicrescue \-Mo \-r msoffice \-d rescued_docs /dev/hda1 \e
\&      |grep \-v \*(Aq\e.doc$\*(Aq|xargs rm \-f
.Ve
.PP
Remove all the rescued documents that also appear on the file system, and
remove duplicates.
.PP
.Vb 2
\&    $ dupemap \-d healthy_docs.map delete,report rescued_docs
\&    $ dupemap delete,report rescued_docs
.Ve
.PP
The \fIrescued_docs\fR folder should now contain only a few files.  This will be
the undeleted files and some documents that were not stored in contiguous
blocks (use that defragger ;\-)).
.SS "Usage with fsck"
.IX Subsection "Usage with fsck"
In this scenario (based on a true story), you have a hard disk that's gone bad.
You have managed to \fIdd\fR about 80% of the contents into the file \fIdiskimage\fR,
and you have an old backup from a few months ago.  The disk is using reiserfs
on Linux.
.PP
First, use fsck to make the file system usable again.  It will find many
nameless files and put them in \fIlost+found\fR.  You need to make sure there is
some free space on the disk image, so fsck has something to work with.
.PP
.Vb 6
\&    $ cp diskimage diskimage.bak
\&    $ dd if=/dev/zero bs=1M count=2048 >> diskimage
\&    $ reiserfsck \-\-rebuild\-tree diskimage
\&    $ mount \-o loop diskimage /mnt
\&    $ ls /mnt/lost+found
\&    (tons of files)
.Ve
.PP
Our strategy will be to restore the system with the old backup as a base and
merge the two other sets of files (\fI/mnt/lost+found\fR and \fI/mnt\fR) into the
backup after eliminating duplicates.  Therefore we create a checksum database
of the directory we have unpacked the backup in.
.PP
.Vb 1
\&    $ dupemap \-d backup.map scan ~/backup
.Ve
.PP
Next, we eliminate all the files from the rescued image that are also present
in the backup.
.PP
.Vb 1
\&    $ dupemap \-d backup.map delete,report /mnt
.Ve
.PP
We also want to remove duplicates from \fIlost+found\fR, and we want to get rid of
any files that are also present in the other directories in \fI/mnt\fR.
.PP
.Vb 3
\&    $ dupemap delete,report /mnt/lost+found
\&    $ ls /mnt|grep \-v lost+found|xargs dupemap \-d mnt.map scan
\&    $ dupemap \-d mnt.map delete,report /mnt/lost+found
.Ve
.PP
This should leave only the files in \fI/mnt\fR that have changed since the last
backup or got corrupted.  Particularly, the contents of \fI/mnt/lost+found\fR
should now be reduced enough to manually sort through them (or perhaps use
\&\fBmagicsort\fR(1)).
.SS "Primitive intrusion detection"
.IX Subsection "Primitive intrusion detection"
You can use \fBdupemap\fR to see what files change on your system.  This is one of
the more exotic uses, and it's only included for inspiration.
.PP
First, you map the whole file system.
.PP
.Vb 1
\&    $ dupemap \-d old.map scan /
.Ve
.PP
Then you come back a few days/weeks later and run \fBdupemap report\fR.  This will
give you a view of what \fIhas not\fR changed.  To see what \fIhas\fR changed, you
need a list of the whole file system.  You can get this list along with
preparing a new map easily.  Both lists need to be sorted to be compared.
.PP
.Vb 2
\&    $ dupemap \-d old.map report /|sort > unchanged_files
\&    $ dupemap \-d current.map scan /|sort > current_files
.Ve
.PP
All that's left to do is comparing these files and preparing for next week.
This assumes that the dbm appends the \f(CW\*(C`.db\*(C'\fR extension to database files.
.PP
.Vb 2
\&    $ diff unchanged_files current_files > changed_files
\&    $ mv current.map.db old.map.db
.Ve
.SH "DATABASE"
.IX Header "DATABASE"
The actual database file(s) written by \fBdupecheck\fR will have some relation to
the \fIdatabase\fR argument, but most implementations append an extension.  For
example, Berkeley \s-1DB\s0 names the files \fIdatabase\fR\fB.db\fR, while Solaris and \s-1GDBM\s0
creates both a \fIdatabase\fR\fB.dir\fR and \fIdatabase\fR\fB.pag\fR file.
.PP
\&\fBdupecheck\fR depends on a database library for storing the checksums.  It
currently requires the POSIX-standardized \fBndbm\fR library, which must be
present on XSI-compliant UNIXes.  Implementations are not required to handle
hash key collisions, and a failure to do that could make \fBdupecheck\fR delete
too many files.  I haven't heard of such an implementation, though.
.PP
The current checksum algorithm is the file's \s-1CRC32\s0 combined with its size.
Both values are stored in native byte order, and because of varying type sizes
the database is \fInot\fR portable across architectures, compilers and operating
systems.
.SH "SEE ALSO"
.IX Header "SEE ALSO"
\&\fBmagicrescue\fR(1), \fBweeder\fR(1)
.PP
This tool does the same thing \fBweeder\fR does, except that \fBweeder\fR cannot seem
to handle many files without crashing, and it has no largefile support.
.SH "BUGS"
.IX Header "BUGS"
There is a tiny chance that two different files can have the same checksum and
size.  The probability of this happening is around 1 to 10^14, and since
\&\fBdupemap\fR is part of the Magic Rescue package, which deals with disaster
recovery, that chance becomes an insignificant part of the game.  You should
consider this if you apply \fBdupemap\fR to other applications, especially if they
are security-related (see next paragraph).
.PP
It is possible to craft a file to have a known \s-1CRC32\s0.  You need to keep this in
mind if you use \fBdupemap\fR on untrusted data.  A solution to this could be to
implement an option for using \s-1MD5\s0 checksums instead.
.SH "AUTHOR"
.IX Header "AUTHOR"
Jonas Jensen <jbj@knef.dk>
.SH "LATEST VERSION"
.IX Header "LATEST VERSION"
This tool is part of Magic Rescue.  You can find the latest version at
<http://jbj.rapanden.dk/magicrescue/>