1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
<title></title>
</head>
<body>
<pre>
BLASTCLUST - BLAST score-based single-linkage clustering.
1. Clustering procedure.
BLASTCLUST automatically and systematically clusters protein or DNA sequences
based on pairwise matches found using the BLAST algorithm in case of proteins or
Mega BLAST algorithm for DNA. In the latter case a single Mega BLAST search is
performed for all the sequences combined against a database created from the
same sequences. BLASTCLUST finds pairs of sequences that have statistically
significant matches and clusters them using single-linkage clustering.
BLASTCLUST uses the default values for the BLAST and Mega BLAST parameters.
For protein sequences these are: matrix BLOSUM62; gap opening cost 11; gap
extension cost 1; no low-complexity filtering.
For DNA sequences: match reward 1, mismatch penalty -3, non-affine gapping costs
(see README.mbl document for explanation), wordsize 28.
In both cases e-value threshold is set to 1e-6.
For each pair of sequences the top-scoring alignment is evaluated according to
the following criteria:
x1 x2 HSP length on seqX: Hx = x2-x1+1
| | gaps in seqX: Gx
seqX ---======================----- seqX length: Lx
\\|||||||||||||||||// BLAST score: S
seqY ----====================------ number of identical residues: N
| | seqY length: Ly
y1 y2 gaps in seqY: Gy
HSP length on seqY: Hy = y2-y1+1
coverage of seqX: Cx = Hx/Lx
coverage of seqY: Cy = Hy/Ly
coverage: max(Cx,Cy) or min(Cx,Cy), depending on the value of -b option
alignment length Al = Hx+Gx = Hy+Gy
score density: S/min(Hx,Hy) or N/Al*100%
If the coverage is above a certain threshold
AND
the score density is above a certain threshold,
these two sequences are considered to be neighbored.
Thus determined neighbor relationships is considered symmetric and provides
the base for clustering by a single-linkage method (which puts a sequence
to a cluster if the sequence is a neighbor to at least one sequence in the
cluster).
2. Input formats.
The primary input format for BLASTCLUST is a FASTA-format sequence file.
Each sequence should have a unique identifier (as defined by formatdb).
BLASTCLUST formats this sequence set into a BLASTable database
(in the directory pointed to by the environment variable TMPDIR or in
the current directory), then removes the database.
Instead of a FASTA file, a database prepared by formatdb with -o option
set to TRUE can be supplied as an input.
Another type of input is a sequence hit-list previously saved by
BLASTCLUST (in this case BLASTCLUST will use pre-computed HSP data
instead of making de novo comparisons).
You can restrict clustering to a subset of your data by supplying an ID
list file (IDs separated by spaces, tabs, newlines, commas or semicolons).
This is supposed to be used for re-clustering subsets of sequences using
the previously computed hit-list file.
3. Output format.
BLASTCLUST prints out clusters of sequence IDs, sorted from largest to
smallest cluster (alphabetically by ID of the first sequence if of the
same size), separating clusters by a newline character. Sequence
identifiers within a cluster are space-separated and sorted from
longest to shortest sequence (alphabetically by IDs if of the same length).
4. Crash recovery.
If the program crashed because of system error you can restart it
using crash recovery mode. This works only if you were saving
hit-list during the clustering. Start the job with the same command
line as before, specifying the hit-list saving to the same file but
also set the "continue unfinished clustering" option to TRUE. The
process will restart from the last saved point and will append the
hit-list file.
5. Environment.
BLASTCLUST is supposed to work in a normal NCBI environment, in
particular:
BLOSUM62 matrix is available via .ncbirc or BLASTMAT environment
variable.
6. Program options
Input:
-i <file> sequence file in the FASTA format (default = stdin)
-d <file> sequence database name
-r <file> name of a hit-list file saved by BLASTCLUST
These three options are mutually exclusive.
-l <file> a file with a list of IDs to restrict the clustering,
applicable only when reclustering from a saved hit-list.
Thresholds:
-S <threshold> similarity threshold
if <3 then the threshold is set as a BLAST score density
(0.0 to 3.0; default = 1.75)
if >=3 then the threshold is set as a percent of identical
residues (3 to 100)
-L <threshold> minimum length coverage (0.0 to 1.0; default = 0.9)
-b <T|F> require coverage as specified by -L and -S on both (T) or
only one (F) sequence of a pair (default = TRUE)
Output:
-o <file> file to save cluster list (default = stdout)
-s <file> file to save hit-list (this file may be not portable across
platforms)
-p <T|F> protein (T) or nucleotide (F) sequences in the input
(default = TRUE)
Misc:
-C <T|F> continue unfinished clustering (crash recovery mode).
(default = FALSE)
-a <number> Number of CPU's to use in a multi-thread mode
(default = 1).
-v <logfile> Progress report destination (printed every 1000 sequences).
Set to F to suppress report messages (default = stderr).
-e <T|F> Enable sequence id parsing in database formatting. Set to F if
multiple sequences have identical ids (default = TRUE).
-W Word size to use for initial matches (default = 0, translates to 3 for
proteins and 32 for nucleotides).
-c <config file> Configuration file with advanced options, containing any
of the following options with their values, separated by whitespace:
-r, -q, -G, -E - match, mismatch, gap open and gap extension scores
respectively,
-e - e-value cut off,
-y, -X - the dropoff values for the ungapped and gapped extension respectively,
-A - window size for two-hit version,
-I - hitlist size,
-Y, -z - effective search space and database length respectively, to be used for
e-value and bit score calculations,
-F - filter string,
-s - raw score cut off for nucleotide search,
-S - strand option.
7. Credits:
Ilya Dondoshansky (dondosha@ncbi.nlm.nih.gov)
Yuri Wolf (wolf@ncbi.nlm.nih.gov)
05 August, 2000
8. Questions, requests and/or bug reports:
blast-help@ncbi.nlm.nih.gov
APPENDIX A.
Format of the hit-list file.
The hit-list file consists of the following parts:
- header
- sequence ID list
- sequence length list
- hit list
The byte-by-byte layout is platform-dependent; field sizes given here
are true for most UNIX platforms.
A.1. Header.
4-byte integer IDtype 1 if numeric IDs; 0 if string IDs
4-byte integer ListSz size of the ID list; if IDs are numeric this
is the number of SeqID records, otherwise this
is the length of the ID list (in bytes)
A.2. Sequence ID list.
If IDtype is 1 (numeric IDs) then the list is ListSz records of
4-byte integer SeqID sequence ID (numeric)
If IDtype is 0 (string IDs) then the list is a list of records of
var-length char SeqID sequence ID (string)
space (' ') separator
(total length is ListSz bytes; the number of sequences is equal to the number
of spaces).
A.3. Sequence length list.
This is a list of
4-byte integer SeqLen sequence length
A.4. Hit list.
The list consists of the following records going to the end of file:
4-byte integer N1 ordinal number of the 1st sequence
4-byte integer N2 ordinal number of the 2nd sequence
4-byte integer HSPL1 HSP length on the 1st sequence
4-byte integer HSPL2 HSP length on the 2nd sequence
8-byte float Score BLAST score
8-byte float PercId Percent of identical residues
</pre>
</body>
</html>
|