File: reclimit.html

package info (click to toggle)
evolution-data-server 1.0.4-1
links: PTS
area: main
in suites: sarge
size: 39,504 kB
ctags: 26,423
sloc: ansic: 175,347; tcl: 30,499; sh: 20,699; perl: 11,320; xml: 9,039; java: 7,653; cpp: 6,029; makefile: 4,866; awk: 1,338; yacc: 1,103; sed: 772; cs: 505; lex: 134; asm: 14
file content (148 lines) | stat: -rw-r--r-- 9,728 bytes
parent folder | download | duplicates (3)
<!--$Id: reclimit.html,v 1.1.1.1 2003/11/20 22:14:21 toshok Exp $-->
<!--Copyright 1997-2002 by Sleepycat Software, Inc.-->
<!--All rights reserved.-->
<!--See the file LICENSE for redistribution information.-->
<html>
<head>
<title>Berkeley DB Reference Guide: Berkeley DB recoverability</title>
<meta name="description" content="Berkeley DB: An embedded database programmatic toolkit.">
<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++">
</head>
<body bgcolor=white>
<a name="2"><!--meow--></a>
<table width="100%"><tr valign=top>
<td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Transactional Data Store Applications</dl></h3></td>
<td align=right><a href="../../ref/transapp/filesys.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/transapp/tune.html"><img src="../../images/next.gif" alt="Next"></a>
</td></tr></table>
<p>
<h1 align=center>Berkeley DB recoverability</h1>
<p>Berkeley DB recovery is based on write-ahead logging.  This means that
when a change is made to a database page, a description of the change is
written into a log file.  This description in the log file is guaranteed
to be written to stable storage before the database pages that were
changed are written to stable storage.  This is the fundamental feature
of the logging system that makes durability and rollback work.
<p>If the application or system crashes, the log is reviewed during
recovery.  Any database changes described in the log that were part of
committed transactions and that were never written to the actual
database itself are written to the database as part of recovery.  Any
database changes described in the log that were never committed and that
were written to the actual database itself are backed-out of the
database as part of recovery.  This design allows the database to be
written lazily, and only blocks from the log file have to be forced to
disk as part of transaction commit.
<p>There are two interfaces that are a concern when considering Berkeley DB
recoverability:
<p><ol>
<p><li>The interface between Berkeley DB and the operating system/filesystem.
<li>The interface between the operating system/filesystem and the
underlying stable storage hardware.
</ol>
<p>Berkeley DB uses the operating system interfaces and its underlying filesystem
when writing its files.  This means that Berkeley DB can fail if the underlying
filesystem fails in some unrecoverable way.  Otherwise, the interface
requirements here are simple: The system call that Berkeley DB uses to flush
data to disk (normally <b>fsync</b>(2)), must guarantee that all the
information necessary for a file's recoverability has been written to
stable storage before it returns to Berkeley DB, and that no possible
application or system crash can cause that file to be unrecoverable.
<p>In addition, Berkeley DB implicitly uses the interface between the operating
system and the underlying hardware.  The interface requirements here are
not as simple.
<p>First, it is necessary to consider the underlying page size of the Berkeley DB
databases.  The Berkeley DB library performs all database writes using the
page size specified by the application, and Berkeley DB assumes pages are written
atomically.  This means that if the operating system performs filesystem
I/O in blocks of different sizes than the database page size, it may
increase the possibility for database corruption.  For example, assume
that Berkeley DB is writing 32KB pages for a database, and the operating
system does filesystem I/O in 16KB blocks.  If the operating system
writes the first 16KB of the database page successfully, but crashes
before being able to write the second 16KB of the database, the database
has been corrupted and this corruption may or may not be detected during
recovery.  For this reason, it may be important to select database page
sizes that will be written as single block transfers by the underlying
operating system.  If you do not select a page size that the underlying
operating system will write as a single block, you may want to configure
the database to use checksums (see the <a href="../../api_c/db_set_flags.html#DB_CHKSUM_SHA1">DB_CHKSUM_SHA1</a> flag for
more information).  By configuring checksums, you guarantee this kind
of corruption will be detected at the expense of the CPU required to
generate the checksums.  When such an error is detected, the only
course of recovery is to perform catastrophic recovery to restore the
database.
<p>Second, if you are copying database files (either as part of doing a
hot backup or creation of a hot failover area), there is an additional
question related to the page size of the Berkeley DB databases.  You must copy
databases atomically, in units of the database page size.  In other
words, the reads made by the copy program must not be interleaved with
writes by other threads of control, and the copy program must read the
databases in chunks that are a multiple of the underlying database page
size.  Generally, this is not a problem, as operating systems already
make this guarantee and system utilities normally read in power-of-2
sized chunks, which are larger than the largest possible Berkeley DB database
page size.
<p>However, we have seen one problem in this area because some releases of
Solaris implemented the cp utility using the mmap system call rather
than the read system call.  Because the Solaris' mmap system call did
not make the same guarantee of read atomicity as the read system call,
using the cp utility could create corrupted copies of the databases.
Using the dd utility instead of the cp utility (and specifying an
appropriate block size), fixed the problem.  If you plan to use a system
utility to copy database files, you may want to use a system call trace
utility (for example, ktrace or truss) to check for an I/O size smaller
or not a multiple of the database page size and system calls other than
read.
<p>Third, it is necessary to consider the behavior of the system's
underlying stable storage hardware.  For example, consider a SCSI
controller that has been configured to cache data and return to the
operating system that the data has been written to stable storage, when,
in fact, it has only been written into the controller RAM cache.  If
power is lost before the controller is able to flush its cache to disk,
and the controller cache is not stable (that is, the writes will not be
flushed to disk when power returns), the writes will be lost.  If the
writes include database blocks, there is no loss because recovery will
correctly update the database.  If the writes include log file blocks,
it is possible that transactions that were already committed may not
appear in the recovered database, although the recovered database will
be coherent after a crash.
<p>If the underlying hardware can fail in any way so that only part of the
block was written, the failure conditions are the same as those
described previously for an operating system failure that writes only
part of a logical database block.  In such cases, configuring the
database for checksums will ensure the corruption is detected.
<p>For these reasons, it may be important to select hardware that does not
do partial writes and does not cache data writes (or does not return
that the data has been written to stable storage until it has either
been written to stable storage or the actual writing of all of the data
is guaranteed, barring catastrophic hardware failure -- that is, your
disk drive exploding).
<p>If the disk drive on which you are storing your databases explodes, you
can perform normal Berkeley DB catastrophic recovery, because it requires only
a snapshot of your databases plus the log files you have archived since
those snapshots were taken.  In this case, you should lose no database
changes at all.
<p>If the disk drive on which you are storing your log files explodes, you
can also perform catastrophic recovery, but you will lose any database
changes made as part of  transactions committed since your last archival
of the log files.   Alternatively, if your database environment and
databases are still available after you lose the log file disk, you
should be able to dump your databases.  However, you may see an
inconsistent snapshot of your data after doing the dump, because
changes that were part of transactions that were not yet committed
may appear in the database dump.  Depending on the value of the data,
a reasonable alternative may be to perform both the database dump and
the catastrophic recovery and then compare the databases created by
the two methods.
<p>Regardless, for these reasons, storing your databases and log files on
different disks should be considered a safety measure as well as a
performance enhancement.
<p>Finally, you should be aware that Berkeley DB does not protect against all
cases of stable storage hardware failure, nor does it protect against
simple hardware misbehavior (for example, a disk controller writing
incorrect data to the disk).  However, configuring the database for
checksums will ensure that any such corruption is detected.
<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/transapp/filesys.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/transapp/tune.html"><img src="../../images/next.gif" alt="Next"></a>
</td></tr></table>
<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font>
</body>
</html>