File: README

package info (click to toggle)
seqan2 2.5.2-1
links: PTS, VCS
area: main
in suites: forky, sid
size: 228,748 kB
sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (135 lines) | stat: -rw-r--r-- 5,421 bytes
parent folder | download | duplicates (2)
                 Fast String Mining of Multiple Databases
                        under Frequency Constraints

                   https://www.seqan.de/apps/dfi.html

---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
  1.   Overview
  2.   Installation
  3.   Usage
  4.   Output Format
  5.   Example
  6.   Contact
  7.   References

---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------

The Deferred Frequency Index (DFI) is a tool for string mining under
frequency constraints, i.e., predicates that evaluate solely the frequency
of a pattern occurrence in the data. The frequency of a pattern is defined
as the number of distinct sequences in a database that contain the pattern
at least once. Currently the implementation contains 3 different predicates
and can easily be extended by user-defined frequency predicates.
The frequencies are calculated during the construction of a suffix tree
over all databases, which enables to limit the index construction to a
problem-specific minimum referred to as the optimal monotonic hull.

---------------------------------------------------------------------------
2. Installation
---------------------------------------------------------------------------

There are precompiled executables for various platforms:

  dfi.exe  dfi for Windows
  dfi32    dfi for GNU Linux x86
  dfi      dfi for GNU Linux x86-64
  dfiOSX   dfi for Mac OS X on Intel

DFI is distributed with SeqAn - The C++ Sequence Analysis Library (see
https://www.seqan.de). To compile DFI on your system do the following:

  1)  Download the latest snapshot of SeqAn
  2)  Unzip it to a directory of your choice (e.g. snapshot)
  3)  cd snapshot/apps
  4)  make dfi
  5)  cd dfi
  6)  ./dfi --help

Alternatively you can check out the latest Git version of DFI and SeqAn
with:

  1)  git clone https://github.com/seqan/seqan.git
  2)  mkdir seqan/buld; cd seqan/build
  3)  cmake .. -DCMAKE_BUILD_TYPE=Release
  4)  make dfi
  5)  ./bin/dfi --help

On success, an executable file dfi was build and a brief usage description
was dumped.

---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------

To get a short usage description of DFI, you can execute dfi -h or
dfi --help.

Usage: dfi [OPTION]... --minmax  <min_1> <max_1> <database 1> ...
                       --minmax  <min_m> <max_m> <database m>
       dfi [OPTION]... --growth  <rho_s> <rho_g> <database 1> <database 2>
       dfi [OPTION]... --entropy <rho_s> <alpha> <database 1> <database 2>
                       ... <database m>

DFI implements 3 different frequency string mining problems:

  1) Frequent Pattern Mining Problem (--minmax)
  2) Emerging Substring Mining Problem (--growth)
  3) Entropy Substring Mining Problem (--entropy)

To choose between these problems the corresponding option must be given
with associated parameters. For problem 1 the --minmax option must be given
multiple times with the minimum and maximum frequency for each database.
Problem 2 expects the minimum support in database 1 (rho_s) and the minimum
growth rate from database 2 to database 1 (rho_g). Problem 3 expects the
minimum support in at least one database (rho_s) and the maximum entropy
(alpha).
As arguments the names of the databases in Fasta format must be given.
To speed up the suffix tree construction additional options can be used to
specify the alphabet, e.g. DNA, AminoAcid or text (default).
By default, DFI outputs every substring that satisfies the frequency
predicate. If the -m option is given only maximal substrings are output,
i.e. substrings that satisfy the predicate and are not part of a longer
substring with the same frequencies.

---------------------------------------------------------------------------
4. Output Format
---------------------------------------------------------------------------

The solution set is printed to standard out, one string per line. By
defining the DEBUG_ENTROPY symbol during compilation, the frequencies and
entropy can also be printed.

---------------------------------------------------------------------------
5. Example
---------------------------------------------------------------------------

As an example run under Linux or Mac OS X:
  ./dfi32  -g 1 2 example/fasta1.fa example/fasta2.fa
  ./dfiOSX -g 1 2 example/fasta1.fa example/fasta2.fa

or under Windows:
  dfi.exe -g 1 2 example\fasta1.fa example\fasta2.fa

The solution set of this example is:
ba
bab

---------------------------------------------------------------------------
6. Contact
---------------------------------------------------------------------------

For questions or comments, contact:
  David Weese <david.weese@fu-berlin.de>
  Marcel H. Schulz <marcel.schulz@molgen.mpg.de>

---------------------------------------------------------------------------
7. References
---------------------------------------------------------------------------

Weese, D., Schulz, M. H. (2008). Efficient String Mining under Constraints
via the Deferred Frequency Index. In: Proceedings of the 8th Industrial
Conference on Data Mining (ICDM’08), LNAI 5077, Springer, pp 374-388.