File: lazarus.README

package info (click to toggle)
tct 1.07-9
links: PTS
area: main
in suites: woody
size: 1,828 kB
ctags: 1,128
sloc: perl: 9,604; ansic: 4,861; makefile: 516; sh: 77
file content (344 lines) | stat: -rw-r--r-- 16,089 bytes
parent folder | download | duplicates (5)

Lazarus
--------

Lazarus is a program that attempts to resurrect deleted files or data
from raw data - most often the unallocated portions of a Unix file
system, but it can be used on any data, such as system memory, swap,
etc.  It has two basic logical pieces - one that grabs input from a source
and another that dissects, analyzes, and reports on its findings.

It can be used for recovering lost data and files (accidently removed 
by yourself or maliciously), as a tool for better understanding how
a Unix system works, investigate/spy on system and user activity, etc.

It is not for the faint of heart.  Unix systems are not like PC
operating systems, and in addition, since this is a first generation
tool, it is nowhere near as polished or professional as PC-based
data recovery tools.  (At least, I'm pretty sure; I haven't really
used many PC recovery tools.)  No special privileges are required
to run the tool, although in most systems the most interesting data
(raw disk devices, memory, swap, etc.) are readable only by root.

While something like dd(1) may be used for supplying input, perhaps
the most interesting thing to do is to utilize the unrm(1) (supplied 
with TCT) program to obtain the currently unused space in a file system 
to analyze information that has previously been allocated and then 
released.

Lazarus (not unrm) has been used with data from UFS, EXT2, NTFS,
and FAT32 file systems, but it can be used on just about any type
of file system - your success will vary with the way the data
resides on the disk, but it always seems to find something.


Analysis & Internal Workings
-----------------------------

The dissection and analysis side of Lazarus is a perl program which 
takes the following steps:

1)  Read in a chunk of data (typically 1K, but modifiable by changing
the variable $BLOCK_SIZE in lazarus.cf).

2)  Roughly determine what sort of data it is - text or binary, so that 
	further analysis can be done.  This is currently done by examining
	the first 10% of $BLOCK_SIZE - if they are all printable characters
	then it's flagged as a text block, else it is flagged as binary
	data.

3-t)  If text, it checks the data against a set of regular expressions
	to attempt to determine what it is.

3-b)  If binary, the file(1) command is run over the chunk of data.  If
	it doesn't succeed, the first few bytes are examined to see if it
	appears to be in ELF format.  Due to the bugs and problems with
	various versions of file we've ported the Free-BSD version to a
	few Unix's and include it in this distribution.

4)  If it is recognized, the block is marked as a block of that data type.
	If it is a different type of block than the previous block then
	it is saved to a new file.  If the data is the same type as the 
	previous data block it is concatenated to it.  If the data block 
	is not recognized after the initial text/binary recognition but 
	follows a recognized chunk of text/binary data (respectively),
	lazarus assumes that it is a continuation of the previous data 
	and will also concatenated it to the previous data block.

5)  The output is in two forms, the data blocks and a map that corresponds
	to the blocks.  The blocks are saved in a separate directory (by
	default, amazingly, "blocks") in a file that starts with the
	logical block # that is currently being processed.  Each run 
	of similar data, having all been saved to a single file, also has a 
	name that corresponds to its data type ("c" == C source code, 
	"h" == HTML, etc.)  Since data files might be viewed in a browser 
	(more on that later) they all have an ending of ".txt" so that 
	they will not be interpreted as potentially harmful code.  The 
	naming convention is blocknumber.type.txt, or 
	blocknumber.type.{jpg,gif,etc} if it is an image file, so, for
	instance, a run of mail data might have the name 100.m.txt.

	So searching all the blocks identified as C code to see if they 
	contain a header file, for instance, is as easy as something like:

		% grep -l header.h blocks/*.c.txt

	The data map generated is an ASCII list of characters that 
	corresponds to the various data types found.  The first character
	that represents a "run" of blocks is capitalized, so that a
	run of mail would show up as:

		Mmmmm

	To try to make the output stay within a semi-manageable size it does 
	a logarithmic compression of the output (base 2) - e.g. the 
	first character represents one block or less of data, the 2nd from 
	0-2 blocks, the 3rd 0-4, the fourth 0-8, etc., so that the above
	run of mail data (of 5 characters) would be 1+2+4+8 plus the last
	block, which could be from 1-16 K, totally 16-31K of data.

	This is all most colorful (readable?) if viewed with a browser; it 
	outputs colored characters (and a map legend), and clicking on the
	first character of a data block will display in a window with 
	both a simple navigation bar and the data contained in the run.

There is an alternate method of analysis which does byte-by-byte examination,
rather than a block at a time; it is significantly slower (and lazarus is
already very slow) but might be better suited for non-block based data
(such as memory or unknown data fragments.)  It basically is the same -
looks at $BLOCK_SIZE data sized chunks - but instead of looking at only
the first 10% of the block it examines the whole thing looking for fragments.

After the initial analysis you essentially have one of two choices -
use the output/data map to examine the data or go straight to the 
data blocks for further processing/searching.  


Using Lazarus
--------------

Rare is the user who hasn't blown away data inadvertently; most break-ins
involved at least small amounts of data destruction (if nothing else
intruders will often carry around or install the tools that they use 
to compromise systems); legitimate usage of a system - mail, WWW browsing,
compiling programs, etc. leaves considerable amounts of deleted activity 
on the disk, etc.  There are many reasons, some legitimate, some not,
why a user might want to examine the system.

For better or for worse, there are several significant obstacles to doing
data recovery or analysis on a Unix file system (which will probably be
the most significant usage of Lazarus.)  First and foremost, unlike PC's
Mac's and other systems, most Unix file systems were designed for high 
performance with no (or not much) thought to data recovery.  When a file
is removed essentially all useful information about it - the filename,
inode, etc. are either deleted or mostly rendered useless for data
recovery.  

So in order to do any sort of recovery you generally have to examine *ALL*
the unused portions of the disk, which can take an enormous amount of time
depending on how much free space you have (this is one case where you 
don't want to have a big disk!)

In addition, while (especially for smaller files) Unix generally attempts
to write data in contiguous blocks, the larger the file the better the
chance it has been broken up into pieces.  While it is possible to
manually reconstruct data in such a situation, it will probably be very
painful unless the format of the data is very regular and easy to 
recognize, like mail, system log files, etc.

Also, unless the disk is frozen (that is, immediately taken off-line),
you run the risk of overwriting the data in question (and even if done
immediately, kernel data buffers and the like could still be waiting to
be sync'd.  As a note - on the good side of things we've found that data
is very localized; reasonably sized files in a directory will usually
have all their data in the immediate neighborhood of that directory, so
unless you're writing to that same area you're probably OK.  Indeed,
unless you really hammer the disk with writing, most stuff is probably
going to be untouched, even long after the data has been deleted.


All said, however, there are probably two main uses for such a tool -
data recovery and (often post-mortem) analysis.


Data Recovery
--------------

Hopefully you've only blown away a small easily recognized text file.
This is probably not the case.  Regardless of what happened, you'll 
want the following items:

1)  A second system that can recognize the disk is optional, but desirable.

2)  Another disk, or at least another file system on the same disk if you've
	taking data from one file system and writing it to another.  This 
	may sound foolish, but UNDER NO CIRCUMSTANCES SHOULD YOU RECOVER 
	DATA ON THE SAME FILE SYSTEM IT WAS LOST ON!!!!
	
	If this isn't clear to you, consider; your data is in that
	free area out there somewhere.  You'd be filling up that free
	space with itself, essentially at random locations.  There'd be
	a more than significant chance you'd blow away your potentially
	interesting data before you got it.

3)  At least as much free disk space on a *TARGET* location as the free space
	on the afflicted disk.  Ideally you want a bit more than twice
	as much space, to both write the unallocated space and to save
	the lazarus results.  This is because lazarus (by default, at least)
	rewrites all the data in another location as well.  So if you have a 
	2GB disk with 750MB free, you'd want a second disk with 750MB x 2,
	or 1.5 GB free.  At least; if you try to reassemble the data in
	various ways you'll want at least a bit more space to play with.

4)  Ample amounts of free time.  On a test system, a SPARC 5 with a 
	reasonably fast SCSI drive it took a bit less than hour to unrm 1.8 GB
	data.  I then ftp'd it over to another system - even at 30-50K 
	a second (my T-1 tops out at around 150K/sec, but the target
	system had a much slower line) it took several hours.  Then my 
	ultra 5 (with an even faster drive) took 10 hours to analyze the 
	data.  Some 15+ hours and nothing has even been *looked* at yet.

Then take the following steps:

1)  Remove the disk from operational hazards.  If it's a system disk
	that sees lots of action, halt the system and mount the disk
	on another system.  Mounting it read-only is a fine idea, so
	that no additional data is lost.

2)  Run unrm or optionally you can simply use dd(1); for instance
	(on SunOS/Solaris):

	# dd bs=1000k if=/dev/rsd0b of=whatever
	# dd bs=1000k if=/dev/rdsk/c0t3d0s1 of=whatever

	# ./unrm /dev/rsd0a > output
	# ./unrm /dev/rdsk/c0t3d0s0 > output

	Of course dd(1) will make no distinction between free and used
	blocks; if you're interested in analyzing the free space, use
	unrm if at all possible.

	If you're feeling brave you can try it on kmem/mem/swap/whatever,
	although don't blame us if your system doesn't like having it's
	memory dumped out (worked fine in our systems, but this is
	potentially dangerous stuff) and crashes:

	# dd bs=1000k if=/dev/kmem of=whatever

3)  Run Lazarus (see lazarus.damn-the-torpedos or lazarus.README).

4)  Now begins the... "interesting" part.  First, what sort of data was
	lost?  If it was mail or other text files, you might be in luck.
	You can try to run the "rip-mail" program (see rip-mail.README 
	for more information) and see if it recovers the info.

	If it was a piece of text - writing, or mail that the rip-mail
	program didn't recover, then grep(1) is probably your next best
	friend.  Remember, all the blocks (or runs of blocks) are saved 
	in individual files in the blocks subdirectory.  So, for instance,
	if you're looking for your resume, think of keywords that might
	help you out (like your name, employers, etc.)  You could then
	do something like:

		% egrep -l 'keyword1|keyword2|...' blocks/*.txt > allfiles

	If there are any files listed, start examining them with a good pager
	(like less).

	Images are likewise easy; simply do something like ("xv" is a fine
	Unix image viewer/editor):

		% xv blocks/*.gif blocks/*.jpg

	(Many images are very easy to view with a browser as well.)

	Text based log files (syslog, message, etc.), even though they will 
	often be spread out all over the disk because of the way they are
	written (a few records at a time), are actually (potentially) trivial
	to recover - and in the correct order - because of the wonderful 
	time stamp on each line; the simplest way (until a better log 
	analyzer is written) is to (the tr(1) is in there to remove nulls; 
	shell commands don't like nulls in the files they work with!):
	
		% cat blocks/*.l.txt | tr -d '\000' | sort -u > all-logs

	And then browse through them.  A few bits and pieces will probably
	be lost (due to the fragmentation at block and fragment boundaries),
	but it's a good way to start.

	Some data, like C source code, is very easy to confuse with other
	types of program files and text, so a combined arms approach that
	uses grep & the browser is sometimes useful.  For instance, if you
	have a section of data that looks like thus on the disk map:

		....CccPpC..Hh....

	The first three recognized text types - C, P, and C - might all be
	the same type (C).  Finding the block # by simply putting your mouse
	over the link and then concatenating the files or examining them 
	manually is great.

	Another good way to find source code is if you know of a specific
	#include file that the code uses or a specific output line
	that the program emits - a simple:

		% grep -l rpcsvc/sm_inter.h blocks/*.[cpt].txt

	Will find any files that have rpcsvc/sm_inter.h in them (not a lot,
	probably! ;-))  This sort of brute force approach can be quite
	useful.
	
	Again, beware of concatenating lots of recovered blocks/files 
	together and performing text based searches or operations on 
	them (sort, grep, uniq, etc.)

	The browser based approach can be especially useful if you have 
	lots of files on a disk.  Because Unix file systems often will have a
	strong sense of locality that is tied to the directories that the
	files were once in, you can use the graphical browser to locate
	an interesting looking section of the disk and try to hunt either
	using the browser or via standard Unix tools on the saved blocks
	once you've found an area that looks promising.


Random Notes
-------------

Lazarus should process and emit all data fed into it as input into
the "blocks" (by default) subdirectory.  If you concatenate all the
output blocks and compare them to the original image they should be 
identical.  It doesn't change the input at all, it simply breaks it
up into more readable pieces.

While text output files created by lazarus can be as short as 1K (or
whatever the minimum block size is - for FFS and its derivatives it'll
be 1K), the minimum size for most binary files, irrespective of size
(like a single-pixel gif) will be the minimum block size that the disk
can write.  This is because after the first 1K is read in a text file
will end if it hits binary information (e.g. the garbage at the end of a
disk block).  Binary files, however, will simply concatenate the (binary)
garbage on the end of the recognized binary unless by some miracle it's
recognized as another binary type (doubtful unless real data is in there).
This could be fixed if lazarus actually looked at the binary format it
was trying to read and found the size of the file (often contained in
the header of the file) - it could stop reading after that many bytes,
and do (perhaps significantly) better overall recognition.

If that's not clear, don't worry about it ;-)  If it is, change the
variable $BLOCK_SIZE in the lazarus configuration file ("conf/lazarus.cf")
to be a more reasonable value for the system you're investigating.  1024
is a very safe, but conservative number (and don't forget to change it
when examining other systems!), but if you'd like to cheat smartly and 
increase performance set the block size 8192, which is the FFS logical 
block size.   You'll then miss all the small files that may not start on 
a 8kb boundary, however.


Unrm
-----

The unrm program typically requires root privileges to use (at a minimum
it must have sufficient privileges to read the device), for it examines
a raw disk device for free data blocks.  See the documentation (the unrm 
man page) for more information.