1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
|
Fast String Mining of Multiple Databases
under Frequency Constraints
https://www.seqan.de/apps/dfi.html
---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
1. Overview
2. Installation
3. Usage
4. Output Format
5. Example
6. Contact
7. References
---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------
The Deferred Frequency Index (DFI) is a tool for string mining under
frequency constraints, i.e., predicates that evaluate solely the frequency
of a pattern occurrence in the data. The frequency of a pattern is defined
as the number of distinct sequences in a database that contain the pattern
at least once. Currently the implementation contains 3 different predicates
and can easily be extended by user-defined frequency predicates.
The frequencies are calculated during the construction of a suffix tree
over all databases, which enables to limit the index construction to a
problem-specific minimum referred to as the optimal monotonic hull.
---------------------------------------------------------------------------
2. Installation
---------------------------------------------------------------------------
There are precompiled executables for various platforms:
dfi.exe dfi for Windows
dfi32 dfi for GNU Linux x86
dfi dfi for GNU Linux x86-64
dfiOSX dfi for Mac OS X on Intel
DFI is distributed with SeqAn - The C++ Sequence Analysis Library (see
https://www.seqan.de). To compile DFI on your system do the following:
1) Download the latest snapshot of SeqAn
2) Unzip it to a directory of your choice (e.g. snapshot)
3) cd snapshot/apps
4) make dfi
5) cd dfi
6) ./dfi --help
Alternatively you can check out the latest Git version of DFI and SeqAn
with:
1) git clone https://github.com/seqan/seqan.git
2) mkdir seqan/buld; cd seqan/build
3) cmake .. -DCMAKE_BUILD_TYPE=Release
4) make dfi
5) ./bin/dfi --help
On success, an executable file dfi was build and a brief usage description
was dumped.
---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------
To get a short usage description of DFI, you can execute dfi -h or
dfi --help.
Usage: dfi [OPTION]... --minmax <min_1> <max_1> <database 1> ...
--minmax <min_m> <max_m> <database m>
dfi [OPTION]... --growth <rho_s> <rho_g> <database 1> <database 2>
dfi [OPTION]... --entropy <rho_s> <alpha> <database 1> <database 2>
... <database m>
DFI implements 3 different frequency string mining problems:
1) Frequent Pattern Mining Problem (--minmax)
2) Emerging Substring Mining Problem (--growth)
3) Entropy Substring Mining Problem (--entropy)
To choose between these problems the corresponding option must be given
with associated parameters. For problem 1 the --minmax option must be given
multiple times with the minimum and maximum frequency for each database.
Problem 2 expects the minimum support in database 1 (rho_s) and the minimum
growth rate from database 2 to database 1 (rho_g). Problem 3 expects the
minimum support in at least one database (rho_s) and the maximum entropy
(alpha).
As arguments the names of the databases in Fasta format must be given.
To speed up the suffix tree construction additional options can be used to
specify the alphabet, e.g. DNA, AminoAcid or text (default).
By default, DFI outputs every substring that satisfies the frequency
predicate. If the -m option is given only maximal substrings are output,
i.e. substrings that satisfy the predicate and are not part of a longer
substring with the same frequencies.
---------------------------------------------------------------------------
4. Output Format
---------------------------------------------------------------------------
The solution set is printed to standard out, one string per line. By
defining the DEBUG_ENTROPY symbol during compilation, the frequencies and
entropy can also be printed.
---------------------------------------------------------------------------
5. Example
---------------------------------------------------------------------------
As an example run under Linux or Mac OS X:
./dfi32 -g 1 2 example/fasta1.fa example/fasta2.fa
./dfiOSX -g 1 2 example/fasta1.fa example/fasta2.fa
or under Windows:
dfi.exe -g 1 2 example\fasta1.fa example\fasta2.fa
The solution set of this example is:
ba
bab
---------------------------------------------------------------------------
6. Contact
---------------------------------------------------------------------------
For questions or comments, contact:
David Weese <david.weese@fu-berlin.de>
Marcel H. Schulz <marcel.schulz@molgen.mpg.de>
---------------------------------------------------------------------------
7. References
---------------------------------------------------------------------------
Weese, D., Schulz, M. H. (2008). Efficient String Mining under Constraints
via the Deferred Frequency Index. In: Proceedings of the 8th Industrial
Conference on Data Mining (ICDM’08), LNAI 5077, Springer, pp 374-388.
|