% PARACLU Paraclu User Manual % Martin Frith NAME ==== paraclu - finds clusters in data attached to sequences paraclu-cut - subset the output of paraclu SYNOPSIS ======== paraclu [*minValue*] [*my_input*] > [*my_output*] paraclu-cut.sh [*my_output*] > [*my_cut*] DESCRIPTION =========== Paraclu finds clusters in data attached to sequences. It was first applied to transcription start counts in genome sequences, but it could be applied to other things too. Paraclu is intended to explore the data, imposing minimal prior assumptions, and letting the data speak for itself. One consequence of this is that paraclu can find clusters within clusters. Real data sometimes exhibits clustering at multiple scales: there may be large, rarefied clusters; and within each large cluster there may be several small, dense clusters. SETUP ===== Using the command line, go into the paraclu directory and type "make". This assumes you have a C++ compiler. INPUT ===== The input to paraclu should have four columns, like this: chr1 + 17689 3 The first column is the sequence name, then the strand, then the coordinate, then the data value. For example, this might mean that we observed 3 transcripts starting at position 17689 on the + strand of chromosome 1. All the data for one strand of one sequence should appear consecutively (else it will treat the data as coming from different sequences). Furthermore, the data for one strand of one sequence should be in ascending order of coordinate (else it will complain). USAGE ===== If the data is in a file called "my_input", run paraclu like this: paraclu 30 my_input > my_output This will write the output to a file called "my_output". The "30" tells it to omit clusters whose total data value is less than 30. (In other words, it omits clusters where the sum of the data values in the cluster is less than minValue.) If you wish to read standard input (e.g. from a pipe), use the special file name "-". OUTPUT ====== The output has one cluster per line. It has eight columns, like this: chr1 + 787298 787382 64 317 0.5 2.56 - Column 1: the sequence name. - Column 2: the strand. - Column 3: the first position in the cluster. - Column 4: the last position in the cluster. - Column 5: the number of positions with data in the cluster. - Column 6: the sum of the data values in the cluster. - Column 7: the cluster's "minimum density". - Column 8: the cluster's "maximum density". For an explanation of "density", please consult the paraclu publication (see below). Briefly, the greater the fold-change between min and max density, the more prominent the cluster, and the less likely that it is due to chance fluctuations in the data. paraclu-cut.sh ============== This script simplifes the output of paraclu, by getting a subset of the clusters. The usage is like this: `paraclu-cut.sh my_output > my_cut` This performs the following steps: 1. Remove single-position clusters. 2. Remove clusters longer than 200. (Length = column_4 - column_3.) 3. Remove clusters with (maximum density / baseline density) < 2. 4. Remove any cluster that is contained in a larger cluster. The "baseline density" of a cluster X is the "minimum density" of the outermost cluster that contains X (or is X) and passed step 2. Options: -h show a help message and exit -l maximum cluster length (default 200) -d minimum density increase (default 2) -s use an alternative version of step 3: remove clusters with (maximum density / minimum density) < 2 MISCELLANEOUS ============= The original paraclu is a perl script, which is available here: http://people.binf.ku.dk/albin/supplementary_data/tss_code/ The new version works identically to the original, but is much faster and copes with much bigger data. LICENSE ======= Paraclu is distributed under the GNU General Public License, either version 3 of the License, or (at your option) any later version. For details, see COPYING.txt. REFERENCE ========= If you use paraclu in your research, please cite: "A code for transcription initiation in mammalian genomes" MC Frith, E Valen, A Krogh, Y Hayashizaki, P Carninci, A Sandelin Genome Research 2008 18(1):1-12. CONTACT ======= Website: http://www.cbrc.jp/paraclu/ E-mail: paraclu (ATmark) cbrc (dot) jp