File: Limitations.html

package info (click to toggle)
dar 2.8.1-3
links: PTS, VCS
area: main
in suites: forky, sid
size: 12,024 kB
sloc: cpp: 86,219; sh: 6,978; ansic: 895; makefile: 489; python: 242; csh: 115; perl: 43; sed: 16
file content (255 lines) | stat: -rw-r--r-- 11,931 bytes
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <link href="style.css" rel="stylesheet">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">
    <title>DAR's Limitations</title>
  </head>
  <body>


    <div class=top>
      <img alt="DAR's Documentation" src="dar_s_doc.jpg" style="float:left;">
      <h1>DAR's Limitations</h1>
    </div>


    <p>Here follows a description of
      the known limitation you should consult before creating a bug report
      for dar:
    </p>

    <h2>Fixed Limits</h2>
    <ul>
      <li>
	The size of SLICES may be limited by the file system or
	kernel (maximum file size is 2 GB with Linux kernel 2.2.x), other
	limits may exist depending on the filesystem used.
      </li>
      <li>
	the number of SLICES is only limited by the size of the
	filenames, thus using a basename of 10 chars, considering your file
	system can support 256 char per filename at most, you could already get
	up to 10^241 SLICES, 1 followed by 241 zero. But as soon as your file
	system will support bigger files or longer filename, dar will follow
	without change.
      </li>
      <li>
	dar_manager can gather up to 65534 different backups, not
	more. This limit should be high enough to not be a problem. <br>
      </li>
      <li>
	when using a listing file to define which file to operate
	on (-[ and -] option when using dar), each line of the listing file must not be longer than 20479 bytes
	else a new line is considered at the 20480th byte.
      </li>
    </ul>

    <h2>System variable limits</h2>

    <h3>Memory</h3>
    <p>
      Dar uses virtual memory (= RAM+swap) to be able to add the list of file
      saved at the end of each archive. Dar uses its own integer type (called
      "infinint") that do not have limit (unlike 32 bits or 64 bits
      integers). This makes dar already able to manage Zettabytes volumes and
      above even if the systems cannot yet manage such file sizes.
      Nevertheless, this has an overhead with memory and&nbsp; CPU usage,
      added to
      the C++ overhead for the datastructure. All together dar needs a
      average of 650 bytes of virtual memory by saved file with dar-2.1.0 and
      around 850 with dar-2.4.x
      (that's the price to pay for new features). Thus, for example
      if you have 110,000 files to save, whatever is the total amount of data
      to save, dar will require around 67 MB of virtual memory.<br>
    </p>
    <p>
      Now, when doing catalogue
      extraction or differential backup, dar has in
      memory two catalogues, thus the amount of memory space needed is the
      double (134 MB in the example). Why ? Because for differential backup,
      dar starts with the catalogue of the archive of reference which is
      needed to know which files to save and which not to save, and in
      another hand, builds the catalogue of the new archive all along the
      process. Now, for catalogue extraction, the process is equivalent to
      making a differential backup just after a full backup.
    </p>
    <p>
      As you guess merging two archives into a third one requires even more
      memory (memory to store the first archive to merge, the second archive
      to merge and the resulting archive to produce).<br>
    </p>
    <p>
      This memory issue, is not a limit by itself, but you need enough virtual
      memory to be able to save your data (if necessary you can still
      add swap space, as partition or as a plain file).
    </p>

    <h3><a name="Integers"></a>Integers</h3>
    <p>
      To overcome the previously explained
      memory issue, dar can be built in an other mode. In this other mode,
      "infinint" is replaced by 32 bits or 64 bits integers, as defined by
      the use of --enable-mode=32 or --enable-mode=64 options given to
      configure script. The executables built this way (dar, dar_xform,
      dar_slave and dar_manager) run faster and use much less memory than the
      "full" versions using "infinint". But yes, there are drawbacks:
      slice  size, file size, dates, number of files to backup, total archive size
      (sum of all slices), etc, are bounded by the maximum value of the used
      integer, which is 4,294,967,296 for 32 bits and
      18,446,744,073,709,551,616 for 64 bits integers. In clear the 32
      bits version cannot handle dates after year 2106 and file sizes over 4 GB.
      While the 64 bits version cannot handle dates after around 500 billion
      years (which is longer than the estimated age of the Universe: 15
      billion years) and file larger than around 18 EB
      (18 <a href="usage_notes.html#bytes_bits_kilo">exa bytes</a>).
    </p>
    <p>
      Since version 2.5.4 another parameter depends on the maximum supported
      integer number: the number of entries in a given archive. In other
      words, you will not be able to have more than 4 giga files (4
      billion files) in a single archive when using libdar32, 18 exa file
      with libdar64 and no limitation with libdar based on infinint.
    </p>
    <p>
      What the comportment when such a
      limit is reached ? For compatibility with the rest of the code, limited
      length integers (32 or 64 for now) cannot be used as-is, they are
      enclosed in a C++ class, which will report overflow in arithmetic
      operations. Archives generated with all the different version of dar
      will stay compatible between them, but the 32 bits or 64 bits will not
      be able to read or produce all possible archives. In that case, dar
      suite program will abort with an error message asking you to use the
      "full" version of dar program.
    </p>

    <h3>Command line</h3>
    <p>
      On several systems, command-line long
      options are not available. This is due to the fact that dar relies on
      GNU getopt. Systems like FreeBSD do not have by default GNU getopt,
      instead the getopt function proposed from the standard library does not
      support long options, nor optional arguments. On such system you will
      have to use short options only, and to overcome the lack of optional
      argument you need to explicitly set the argument. For example in place
      of "-z" use "-z 9" and so on (see
      <a href="man/index.html">dar's man page</a>
      section "EXPLICIT OPTIONAL ARGUMENTS"). All options for dar's
      features are available with FreeBSD's getopt, just using short options
      and explicit arguments.
    </p>

    <p>
      Else you can install GNU getopt as a
      separated library called libgnugetopt. If the include file
      &lt;getopt.h&gt; is also available, the configure script will detect it
      and use this library. This way you can have long options on FreeBSD for
      example.
    </p>
    <p>
      Another point concerns the comand line length limitation. All system
      (correct me if I am wrong) do limit the size of the command line. If
      you want to add more options to dar than your system can afford, you
      can use the -B option instead an put all dar's arguments (or just some
      of them) in the file pointed to by this -B option. -B can be used
      several times on command line and is recursive (you can use -B from a
      file read by -B option).<br>
    </p>

    <h3>Dates</h3>

    <p>
      Unix files have up to four dates:
    </p>

    <ul style="text-align: justify; margin-left: 40px;">
      <li>last modification date (mtime)</li>
      <li>last access date (atime)</li>
      <li>last inode change (ctime)</li>
      <li>
	creation date (birthtime) [not all Unix system
	support this date, Linux does not for example]
      </li>
    </ul>

    <p>
      In dar, these dates are stored as
      integers (the number of seconds elapsed
      since Jan 1st, 1970) as Unix systems do, since release 2.5.0 it can
      also save, store and restore the microsecond part of these dates on
      system that support it. As seen above, the limitation is not due to dar
      but on the integer used, so if you use infinint, you should be able to
      store any date as far in the future as you want. Of course dar cannot
      stores dates before Jan the 1st of 1970, but it should not be a very
      big problem as there should not be surviving files older than that
      epoch ;-)
    </p>

    <p>
      There is no standard way under Unix
      to change the ctime. So Dar is not able to restore the ctime date of
      files.
    </p>

    <h3>Symlinks</h3>
    <p>
      On system that provide the lutimes() system call, the dates of a
      symlink can be restored. On systems
      that do not provide that system call, if you modify the mtime of an existing
      symlink, you end modifying the mtime of the file targeted by that
      symlink, keeping untouched the mtime of the symlink itself! For that
      reason, without lutimes() support, dar is avoids restoring mtime of symlink.
    </p>

    <h3>Multi-threading performance and memory requirement</h3>
    <p>
      Don't expect the execution time to be exactly divided by the number of
      threads you will use, when comparing with the equivalent single threaded
      execution. There is first an overhead to have several
      threads working in the same process (protocols, synchronization
      on shared memory). There is also the nature of the work that is
      not always possible to process with threads.
    </p>
    <p> With release 2.7.0 the two most
      CPU intensive tasks involved in the building and use of a dar
      backup (compression and encryption) have been enhanced to support
      multiple threads. Still remains some light tasks (CRC calculation,
      escape sequence/tape marks, sparse file detection/reconstruction),
      that are difficult to parallelize while they require much less CPU
      than these first two tasks, so the multi-threading investement
      (processing, but also software developement) does not worth for them
    </p>
    <p>
      Third, the ciphering/deciphering process may be interrupted when
      seeking in the archive for a particular file, this slightly reduces
      performances of multi-threading. For compression even when
      multi-threading is used,
      the compression stays per file (to bring robustness), and is thus reset after
      each new file. The gain is very interesting for large files and you
      have better not to select a too small block size for compression,
      like having several tens of kilobytes or even a megabyte. Large compression
      block (which are scattered between different threads) is
      interesting as small files will be handeled
      by a single thread with little overhead and large files will
      leverage multiple threads.
      Last, your disk access or the network is using remote SFTP or FTP repository
      within dar, may become the bottleneck that drives the
      total execution time of the operation. In such context, adding more thread will
      most probably not bring any improvement.
    </p>
    <p>
      The drawback of the multi-threading used within libbdar is the
      memory requirement. Each worker thread will work with a block
      in memory (in fact two: one for compressed/ciphered data, the other
      for their decompressed/deciphered counterpart). In addition
      to the worker threads you specify, two helper threads are needed,
      one to dispatch the data blocks to the workers threads, the other
      to gather resulting blocks. These two are not CPU intensive but may
      hold memory blocks for treatment while the worker threads do also.
      To summarize, the memory footprint of multi-threading grows linearly
      by two factors: the number of workers <em>and (times)</em> the size
      of the blocks used.
    </p>
  </body>
</html>