1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344
|
Lazarus
--------
Lazarus is a program that attempts to resurrect deleted files or data
from raw data - most often the unallocated portions of a Unix file
system, but it can be used on any data, such as system memory, swap,
etc. It has two basic logical pieces - one that grabs input from a source
and another that dissects, analyzes, and reports on its findings.
It can be used for recovering lost data and files (accidently removed
by yourself or maliciously), as a tool for better understanding how
a Unix system works, investigate/spy on system and user activity, etc.
It is not for the faint of heart. Unix systems are not like PC
operating systems, and in addition, since this is a first generation
tool, it is nowhere near as polished or professional as PC-based
data recovery tools. (At least, I'm pretty sure; I haven't really
used many PC recovery tools.) No special privileges are required
to run the tool, although in most systems the most interesting data
(raw disk devices, memory, swap, etc.) are readable only by root.
While something like dd(1) may be used for supplying input, perhaps
the most interesting thing to do is to utilize the unrm(1) (supplied
with TCT) program to obtain the currently unused space in a file system
to analyze information that has previously been allocated and then
released.
Lazarus (not unrm) has been used with data from UFS, EXT2, NTFS,
and FAT32 file systems, but it can be used on just about any type
of file system - your success will vary with the way the data
resides on the disk, but it always seems to find something.
Analysis & Internal Workings
-----------------------------
The dissection and analysis side of Lazarus is a perl program which
takes the following steps:
1) Read in a chunk of data (typically 1K, but modifiable by changing
the variable $BLOCK_SIZE in lazarus.cf).
2) Roughly determine what sort of data it is - text or binary, so that
further analysis can be done. This is currently done by examining
the first 10% of $BLOCK_SIZE - if they are all printable characters
then it's flagged as a text block, else it is flagged as binary
data.
3-t) If text, it checks the data against a set of regular expressions
to attempt to determine what it is.
3-b) If binary, the file(1) command is run over the chunk of data. If
it doesn't succeed, the first few bytes are examined to see if it
appears to be in ELF format. Due to the bugs and problems with
various versions of file we've ported the Free-BSD version to a
few Unix's and include it in this distribution.
4) If it is recognized, the block is marked as a block of that data type.
If it is a different type of block than the previous block then
it is saved to a new file. If the data is the same type as the
previous data block it is concatenated to it. If the data block
is not recognized after the initial text/binary recognition but
follows a recognized chunk of text/binary data (respectively),
lazarus assumes that it is a continuation of the previous data
and will also concatenated it to the previous data block.
5) The output is in two forms, the data blocks and a map that corresponds
to the blocks. The blocks are saved in a separate directory (by
default, amazingly, "blocks") in a file that starts with the
logical block # that is currently being processed. Each run
of similar data, having all been saved to a single file, also has a
name that corresponds to its data type ("c" == C source code,
"h" == HTML, etc.) Since data files might be viewed in a browser
(more on that later) they all have an ending of ".txt" so that
they will not be interpreted as potentially harmful code. The
naming convention is blocknumber.type.txt, or
blocknumber.type.{jpg,gif,etc} if it is an image file, so, for
instance, a run of mail data might have the name 100.m.txt.
So searching all the blocks identified as C code to see if they
contain a header file, for instance, is as easy as something like:
% grep -l header.h blocks/*.c.txt
The data map generated is an ASCII list of characters that
corresponds to the various data types found. The first character
that represents a "run" of blocks is capitalized, so that a
run of mail would show up as:
Mmmmm
To try to make the output stay within a semi-manageable size it does
a logarithmic compression of the output (base 2) - e.g. the
first character represents one block or less of data, the 2nd from
0-2 blocks, the 3rd 0-4, the fourth 0-8, etc., so that the above
run of mail data (of 5 characters) would be 1+2+4+8 plus the last
block, which could be from 1-16 K, totally 16-31K of data.
This is all most colorful (readable?) if viewed with a browser; it
outputs colored characters (and a map legend), and clicking on the
first character of a data block will display in a window with
both a simple navigation bar and the data contained in the run.
There is an alternate method of analysis which does byte-by-byte examination,
rather than a block at a time; it is significantly slower (and lazarus is
already very slow) but might be better suited for non-block based data
(such as memory or unknown data fragments.) It basically is the same -
looks at $BLOCK_SIZE data sized chunks - but instead of looking at only
the first 10% of the block it examines the whole thing looking for fragments.
After the initial analysis you essentially have one of two choices -
use the output/data map to examine the data or go straight to the
data blocks for further processing/searching.
Using Lazarus
--------------
Rare is the user who hasn't blown away data inadvertently; most break-ins
involved at least small amounts of data destruction (if nothing else
intruders will often carry around or install the tools that they use
to compromise systems); legitimate usage of a system - mail, WWW browsing,
compiling programs, etc. leaves considerable amounts of deleted activity
on the disk, etc. There are many reasons, some legitimate, some not,
why a user might want to examine the system.
For better or for worse, there are several significant obstacles to doing
data recovery or analysis on a Unix file system (which will probably be
the most significant usage of Lazarus.) First and foremost, unlike PC's
Mac's and other systems, most Unix file systems were designed for high
performance with no (or not much) thought to data recovery. When a file
is removed essentially all useful information about it - the filename,
inode, etc. are either deleted or mostly rendered useless for data
recovery.
So in order to do any sort of recovery you generally have to examine *ALL*
the unused portions of the disk, which can take an enormous amount of time
depending on how much free space you have (this is one case where you
don't want to have a big disk!)
In addition, while (especially for smaller files) Unix generally attempts
to write data in contiguous blocks, the larger the file the better the
chance it has been broken up into pieces. While it is possible to
manually reconstruct data in such a situation, it will probably be very
painful unless the format of the data is very regular and easy to
recognize, like mail, system log files, etc.
Also, unless the disk is frozen (that is, immediately taken off-line),
you run the risk of overwriting the data in question (and even if done
immediately, kernel data buffers and the like could still be waiting to
be sync'd. As a note - on the good side of things we've found that data
is very localized; reasonably sized files in a directory will usually
have all their data in the immediate neighborhood of that directory, so
unless you're writing to that same area you're probably OK. Indeed,
unless you really hammer the disk with writing, most stuff is probably
going to be untouched, even long after the data has been deleted.
All said, however, there are probably two main uses for such a tool -
data recovery and (often post-mortem) analysis.
Data Recovery
--------------
Hopefully you've only blown away a small easily recognized text file.
This is probably not the case. Regardless of what happened, you'll
want the following items:
1) A second system that can recognize the disk is optional, but desirable.
2) Another disk, or at least another file system on the same disk if you've
taking data from one file system and writing it to another. This
may sound foolish, but UNDER NO CIRCUMSTANCES SHOULD YOU RECOVER
DATA ON THE SAME FILE SYSTEM IT WAS LOST ON!!!!
If this isn't clear to you, consider; your data is in that
free area out there somewhere. You'd be filling up that free
space with itself, essentially at random locations. There'd be
a more than significant chance you'd blow away your potentially
interesting data before you got it.
3) At least as much free disk space on a *TARGET* location as the free space
on the afflicted disk. Ideally you want a bit more than twice
as much space, to both write the unallocated space and to save
the lazarus results. This is because lazarus (by default, at least)
rewrites all the data in another location as well. So if you have a
2GB disk with 750MB free, you'd want a second disk with 750MB x 2,
or 1.5 GB free. At least; if you try to reassemble the data in
various ways you'll want at least a bit more space to play with.
4) Ample amounts of free time. On a test system, a SPARC 5 with a
reasonably fast SCSI drive it took a bit less than hour to unrm 1.8 GB
data. I then ftp'd it over to another system - even at 30-50K
a second (my T-1 tops out at around 150K/sec, but the target
system had a much slower line) it took several hours. Then my
ultra 5 (with an even faster drive) took 10 hours to analyze the
data. Some 15+ hours and nothing has even been *looked* at yet.
Then take the following steps:
1) Remove the disk from operational hazards. If it's a system disk
that sees lots of action, halt the system and mount the disk
on another system. Mounting it read-only is a fine idea, so
that no additional data is lost.
2) Run unrm or optionally you can simply use dd(1); for instance
(on SunOS/Solaris):
# dd bs=1000k if=/dev/rsd0b of=whatever
# dd bs=1000k if=/dev/rdsk/c0t3d0s1 of=whatever
# ./unrm /dev/rsd0a > output
# ./unrm /dev/rdsk/c0t3d0s0 > output
Of course dd(1) will make no distinction between free and used
blocks; if you're interested in analyzing the free space, use
unrm if at all possible.
If you're feeling brave you can try it on kmem/mem/swap/whatever,
although don't blame us if your system doesn't like having it's
memory dumped out (worked fine in our systems, but this is
potentially dangerous stuff) and crashes:
# dd bs=1000k if=/dev/kmem of=whatever
3) Run Lazarus (see lazarus.damn-the-torpedos or lazarus.README).
4) Now begins the... "interesting" part. First, what sort of data was
lost? If it was mail or other text files, you might be in luck.
You can try to run the "rip-mail" program (see rip-mail.README
for more information) and see if it recovers the info.
If it was a piece of text - writing, or mail that the rip-mail
program didn't recover, then grep(1) is probably your next best
friend. Remember, all the blocks (or runs of blocks) are saved
in individual files in the blocks subdirectory. So, for instance,
if you're looking for your resume, think of keywords that might
help you out (like your name, employers, etc.) You could then
do something like:
% egrep -l 'keyword1|keyword2|...' blocks/*.txt > allfiles
If there are any files listed, start examining them with a good pager
(like less).
Images are likewise easy; simply do something like ("xv" is a fine
Unix image viewer/editor):
% xv blocks/*.gif blocks/*.jpg
(Many images are very easy to view with a browser as well.)
Text based log files (syslog, message, etc.), even though they will
often be spread out all over the disk because of the way they are
written (a few records at a time), are actually (potentially) trivial
to recover - and in the correct order - because of the wonderful
time stamp on each line; the simplest way (until a better log
analyzer is written) is to (the tr(1) is in there to remove nulls;
shell commands don't like nulls in the files they work with!):
% cat blocks/*.l.txt | tr -d '\000' | sort -u > all-logs
And then browse through them. A few bits and pieces will probably
be lost (due to the fragmentation at block and fragment boundaries),
but it's a good way to start.
Some data, like C source code, is very easy to confuse with other
types of program files and text, so a combined arms approach that
uses grep & the browser is sometimes useful. For instance, if you
have a section of data that looks like thus on the disk map:
....CccPpC..Hh....
The first three recognized text types - C, P, and C - might all be
the same type (C). Finding the block # by simply putting your mouse
over the link and then concatenating the files or examining them
manually is great.
Another good way to find source code is if you know of a specific
#include file that the code uses or a specific output line
that the program emits - a simple:
% grep -l rpcsvc/sm_inter.h blocks/*.[cpt].txt
Will find any files that have rpcsvc/sm_inter.h in them (not a lot,
probably! ;-)) This sort of brute force approach can be quite
useful.
Again, beware of concatenating lots of recovered blocks/files
together and performing text based searches or operations on
them (sort, grep, uniq, etc.)
The browser based approach can be especially useful if you have
lots of files on a disk. Because Unix file systems often will have a
strong sense of locality that is tied to the directories that the
files were once in, you can use the graphical browser to locate
an interesting looking section of the disk and try to hunt either
using the browser or via standard Unix tools on the saved blocks
once you've found an area that looks promising.
Random Notes
-------------
Lazarus should process and emit all data fed into it as input into
the "blocks" (by default) subdirectory. If you concatenate all the
output blocks and compare them to the original image they should be
identical. It doesn't change the input at all, it simply breaks it
up into more readable pieces.
While text output files created by lazarus can be as short as 1K (or
whatever the minimum block size is - for FFS and its derivatives it'll
be 1K), the minimum size for most binary files, irrespective of size
(like a single-pixel gif) will be the minimum block size that the disk
can write. This is because after the first 1K is read in a text file
will end if it hits binary information (e.g. the garbage at the end of a
disk block). Binary files, however, will simply concatenate the (binary)
garbage on the end of the recognized binary unless by some miracle it's
recognized as another binary type (doubtful unless real data is in there).
This could be fixed if lazarus actually looked at the binary format it
was trying to read and found the size of the file (often contained in
the header of the file) - it could stop reading after that many bytes,
and do (perhaps significantly) better overall recognition.
If that's not clear, don't worry about it ;-) If it is, change the
variable $BLOCK_SIZE in the lazarus configuration file ("conf/lazarus.cf")
to be a more reasonable value for the system you're investigating. 1024
is a very safe, but conservative number (and don't forget to change it
when examining other systems!), but if you'd like to cheat smartly and
increase performance set the block size 8192, which is the FFS logical
block size. You'll then miss all the small files that may not start on
a 8kb boundary, however.
Unrm
-----
The unrm program typically requires root privileges to use (at a minimum
it must have sufficient privileges to read the device), for it examines
a raw disk device for free data blocks. See the documentation (the unrm
man page) for more information.
|