File: README.md

package info (click to toggle)
fsm-lite 1.0-3
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 192 kB
  • sloc: cpp: 563; makefile: 37
file content (33 lines) | stat: -rw-r--r-- 1,301 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Frequency-based String Mining (lite)
===

A singe-core implemetation of frequency-based substring mining. This
implementation requires the https://github.com/simongog/sdsl-lite
library (tested using the release `sdsl-lite-2.0.3`).

1. Download and extract https://github.com/simongog/sdsl-lite/archive/v2.0.3.tar.gz
2. install SDSL by running `./install.sh /install/path/sdsl-lite-2.0.3`, where `/install/path` need to be specified,
3. update the correct SDSL installation path into the `fsm-lite/Makefile`,
4. turn on preferred compiler optimization in `fsm-lite/Makefile`, and
5. run `make depend && make` under the directory `fsm-lite`.

For command-line options, see `./fsm-lite --help`.

Usage example
---

Input files are given as a list of `<data-identifier>` `<data-filename>` pairs. The `<data-identifier>`'s are assumed to be unique. Here's an example how to construct such a list out of all `/input/dir/*.fasta` files:

  `for f in /input/dir/*.fasta; do id=$(basename "$f" .fasta); echo $id $f; done > input.list` 

The files can then be processed by 

  `./fsm-lite -l input.list -t tmp | gzip - > output.txt.gz`

where `tmp` is a prefix filename for storing temporary index files.

TODO
---
1. Optimize the time and space usage.
2. Multi-threading.
3. Support for gzip compressed input.