1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383
|
TransTermHP Version 2.07
CONTENTS
0. LICENSE & CREDITS
1. INSTALLATION
2. TRANSTERM USAGE
3. FORMAT OF THE TRANSTERM OUTPUT
4. TRANSTERM COMMAND LINE OPTIONS
5. RECALIBRATING USING DIFFERENT PARAMETERS
6. FORMAT OF THE EXPTERMS.DAT FILE
7. PORTING NOTES
8. 2NDSCORE PROGRAM
9. FORMAT OF .BAG FILES
10. USING TRANSTERM WITHOUT GENOME ANNOTATIONS
0. LICENSE & CREDITS
TransTermHP v. 2.0 is a complete rewrite by Carl Kingsford of TransTerm v. 1.0,
originally written by Maria D. Ermolaeva. The first TransTermHP was described in
the paper:
[1] Maria D. Ermolaeva, Hanif G. Khalak, Owen White, Hamilton O. Smith and
Steven L. Salzberg. Prediction of Transcription Terminators in Bacterial
Genomes. J Mol Biol 301, (1), 27-33 (2000)
TransTermHP v 2.0 is free software and is distributed under the GNU Public
License. See the file LICENSE.txt included with TransTermHP for complete
details.
1. INSTALLATION
At present, TransTermHP has only been tested on UNIX-like systems with the
GCC/G++ compiler. To compile TransTermHP on such a system, "cd" into the
TransTermHP src directory, and type:
make clean transterm
If there are no errors reported, there should be a "transterm" executable file
in the same directory. You can move this executable anyplace that is
convenient. To save space, you can type:
make no_obj
to remove all the .o files that were created during compilation.
If you want to use TransTermHP on a non-UNIX-like system, see 'PORTING NOTES'
below for some tips.
2. TRANSTERM USAGE
The standard usage of TransTermHP is:
transterm -p expterm.dat seq.fasta annotation.ptt > output.tt
Any number of fasta and annotation files can be listed but fasta files should
come before annotation files. The type of the file is determined by the
extension:
.ptt a GenBank ptt annotation file
.coords or .crd a simple annotation file
Each line of a .coords or .crd file has the format:
gene_name start end chrom_id
The chrom_id specifies which sequence the annotation should apply to. For a
.ptt file, the chrom_id is taken to be the filename with the path and
extension removed. A filename with any other extension is assumed to be a
fasta file.
When processing an annotation for a chromosom with id = ID, the first word of
the '>' lines of the input sequences are searched for ID. Because there is no
good standard for how the '>' line is formated, several heuristics are tried
to find ID in the '>' line. In the order tried, they are:
>ID
>junk|cmr:ID|junk or junk|ID|junk
>junk|gi|ID|junk or >junk|gi|ID.junk|junk
>junk:ID
The option '-p expterm.dat' uses the newest confidence scheme, where
expterm.dat is the path to the file of that name supplied with TransTermHP. If
'-p expterm.dat' is omited, the version 1.0 confidence scheme is used. See
section 'COMMAND LINE OPTIONS' for more detail.
3. FORMAT OF THE TRANSTERM OUTPUT
The organism's genes are listed sorted by their end coordinate and terminators
are output between them. A terminator entry looks like this:
TERM 19 15310 - 15327 - F 99 -12.7 -4.0 |bidir
(name) (start - end) (sense)(loc) (conf) (hp) (tail) (notes)
where 'conf' is the overall confidence score, 'hp' is the hairpin score, and
'tail' is the tail score. 'Conf' (which ranges from 0 to 100) is what you
probably want to use to assess the quality of a terminator. Higher is better.
The confidence, hp score, and tail scores are described in the paper cited
above. 'Loc' gives type of region the terminator is in:
'G' = in the interior of a gene (at least 50bp from an end),
'F' = between two +strand genes,
'R' = between two -strand genes,
'T' = between the ends of a +strand gene and a -strand gene,
'H' = between the starts of a +strand gene and a -strand gene,
'N' = none of the above (for the start and end of the DNA)
Because of how overlapping genes are handled, these designations are not
exclusive. 'G', 'F', or 'R' can also be given in lowercase, indicating that
the terminator is on the opposite strand as the region. Unless the
--all-context option is given, only candidate terminators that appear to be in
an appropriate genome context (e.g. T, F, R) are output.
Following the TERM line is the sequence of the hairpin and the 5' and 3'
tails, always written 5' to 3'.
4. TRANSTERM COMMAND LINE OPTIONS
You can also set how large a hairpin must be to be considered:
--min-stem=n Stem must be n nucleotides long
--min-loop=n Loop portion of the hairpin must be at least n long
You can also set the maximum size of the hairpin that will be found:
--max-len=n Total extent of hairpin <= n NT long
--max-loop=n The loop portion can be no longer than n
The maximum length is the total length for the hairpin portion (2 stems, 1
loop) and does not include the U-tail. It's measured in nuceotides in the
input sequence, so because of gaps, the actual structure may be longer than
max-len. Max-len must be less than the compiled-in constant REALLY_MAX_UP
(which by default is 1000). To increase the size of structures found recompile
after increasing this constant.
TransTermHP assigns a score to the hairpin and tail portions of potential
terminators. Lower scores are considered better. Many of the constants used in
scoring hairpins can be set from the command line:
--gc=f Score of a G-C pair
--au=f Score of an A-U pair
--gu=f Score of a G-U pair
--mm=f Score of any other pair
--gap=f Score of a gap in the hairpin
The cost of loops of various lengths can be set using:
--loop-penalty=f1,f2,f3,f4,f5,...fn
where f1 is the cost of a loop of length --min-loop, f2 is the cost of a loop
of length --min-loop+1, as so on. If there are too few terms to cover up to
max-loop, the last term is repeated. Thus --loop-penalty=0,2 would assign cost
0 to any loop of length min-loop, and 2 to any longer loop (up to max-loop,
after which longer loops are given infinite scores). Extra terms are ignored.
Note that if you are using the --pval-conf confidence scheme (see below), you
must regenerate the expterm.dat file if you change any of the above constants.
To weed out any potential terminator with tail or hairpin scores that are too
large, you can use the following options:
--max-hp-score=f Maximum allowable hairpin score
--max-tail-score=f Maximum allowable tail score
Terminator hairpins must be adjacent to a "U-rich" region. You can adjust the
constants the define what constitutes a U-rich region. Using the options:
--uwin-size=s
--uwin-require=r
requires that there are at least r 'U' nucleotides in the s-nucleotide-long
window adjacent to the hairpin. Again, if you change these constants, you
should regenerate expterms.dat.
Before the main output, TransTermHP will output the values of the above options
in a format suitable to be used on the command line.
In addition to the tail and hairpin scores, each possible terminator is
assigned a confidence --- a value between 0 and 100 that indicates how likely
it is that the sequence is a terminator. The scoring scheme needs a background
file (supplied with TransTermHP) that is specified using:
--pval-conf expterms.dat
This will use the distribution in the file expterms.dat as the background. (You can
abreivate this as "-p expterms.dat".) Though the supplied expterms.dat file is
derived from random sequences, any background distribution can be used by
supplying your own expterms.dat file. See below for the format of
expterms.dat. The values in expterms.dat depend on the scoring constants,
definition of u-rich regions, and the maximum allowed tail and hp scores.
Thus, if you change any of these constants using the options above, you should
regenerate expterms.dat.
The main output of TransTermHP is a list of terminators interleaved between a
listing of the gene annotations that were provided as input. This output can
be customized in a few ways:
-S Don't output the terminator sequences
--min-conf=n Only output terminators with confidence >= n (can
abbreviate this as -c n; default is 76.)
Additional analysis output can be obtained with the following options:
--bag-output file.bag Output the Best terminator After Gene
--t2t-perf file.t2t Output a summary of which tail-to-tail regions
have good terminators
5. RECALIBRATING USING DIFFERENT PARAMETERS
As mentioned above, if you change any of the basic scoring function and search
parameters and are using the version 2.0 confidence scheme (recommended) then
you have to recompute the values in the expterm.dat file. If you have python
installed this is easy (though perhaps time consuming). You can issue the
command:
% calibrate.sh newexpterms.dat [OPTIONS TO TRANSTERM]
where "[OPTIONS TO TRANSTERM]" are TransTermHP options (discussed above) that
set the parameters to what you want them to be. After calibrate.sh finishes,
newexpterms.dat will be in the current directory and can serve as an argument
to -p when using the same parameters you passed to calibrate.sh.
Note that for the newexpterms.dat to be valid, you must supply the same basic
parameters to TransTermHP on subsequent runs. TransTerm (or newexpterms.dat)
will not remember these parameters for you. The best way to handle this is to
make a shell script wrapper around transterm that always passes in your new
parameters.
Output formating parameters do not require regeneration of expterms.dat ---
see discussion above for which parameters expterm.dat depends on.
6. FORMAT OF THE EXPTERMS.DAT FILE
The 'pval-conf' confidence scheme, selected with the option "--pval-conf
expterms.dat" (or '-p expterms.dat') computes the confidence of a terminator
with HP energy E and tail energy T as follows. First, the ranges of HP
energies and tail energies are evenly divided into bins, and the appropriate
bins e and t are found for E and T. Then the confidence is computed as
described in [2].
The first line of expterms.dat contains 6 numbers:
seqlen num_bins
The (low_hp, high_hp) and (low_tail, high_tail) ranges give the bounds on the
hairpin and tail scores. The integer num_bins gives the number of
equally-sized bins into which those ranges are divided. Seqlen gives the
length of the random sequence that was used to generate the data in the rest
of the file.
Following this line are any number of (at, R, M) triples, where 'at' is the AT
content, R is a 4-tuple (low_hp, high_hp, low_tail, high_tail) giving the
range of the HP and tail scores observed in random sequences of this AT
content, and M is the distribution matrix. These (at, R, M) triples are
formated as follows:
at low_hp high_hp low_tail high_tail
n11 n12 n13 n14 ... n1,num_bins
n21 ...
...
n_num_bins,1 ...
The mu_r(e,t) term is computed by selecting the matrix with the at value
closest to the computed %AT of the region r. If the total length of region r
sequence is L_r, then
mu_r(e,t) = n_t_e * L_r/seqlen
where n_t_e is the entry in the t-th row and e-th column of the selected
matrix, and seqlen is the first number in the first line of the file.
7. PORTING NOTES
If you want to run TransTermHP on a non-UNIX-like system, you should take note
of the following:
* gene-reader.cc assumes that the filename extension separators is "." and the
path separator is "/".
* getopt_long() is used to process the command line arguments.
8. 2NDSCORE PROGRAM
The package also comes with a program '2ndscore' which will find the best
hairpin anchored at each position. The basic usage is:
2ndscore in.fasta > out.hairpins
For every position in the sequence this will output a line:
-0.6 52 .. 62 TTCCTAAAGGTTCCA GCG CAAAA TGC CATAAGCACCACATT
(score) (start .. end) (left context) (hairpin) (right contenxt)
For positions near the ends of the sequences, the context may be padded with
'x' characters. If no hairpin can be found, the score will be 'None'.
Multiple fasta files can be given and multiple sequences can be in each fasta
file. The output for each sequence will be separated by a line starting with
'>' and containing the FASTA description of the sequence.
Because the hairpin scores of the plus-strand and minus-strand may differ (due
to GU binding in RNA), by default 2ndscore outputs two sets of hairpins for
every sequence: the FORWARD hairpins and the REVERSE hairpins. All the forward
hairpins are output first, and are identified by having the word 'FORWARD' at
the end of the '>' line preceding them. Similarly, the REVERSE hairpins are
listed after a '>' line ending with 'REVERSE'. If you want to search only one
or the other strand, you can use:
--no-fwd Don't print the FORWARD hairpins
--no-rvs Don't print the REVERSE hairpins
You can set the energy function used, just as with transterm with the --gc,
--au, --gu, --mm, --gap options. The --min-loop, --max-loop, and --max-len
options are also supported.
9. FORMAT OF THE .BAG FILES
The columns for the .bag files are, in order:
1. gene_name
2. terminator_start
3. terminator_end
4. hairpin_score
5. tail_score
6. terminator_sequence
7. terminator_confidence: a combination of the hairpin and tail score that
takes into account how likely such scores are in a random sequence. This
is the main "score" for the terminator and is computed as described in
the paper.
8. APPROXIMATE_distance_from_end_of_gene: The *approximate* number of base
pairs between the end of the gene and the start of the terminator. This
is approximate in several ways: First, (and most important) TransTermHP
doesn't always use the real gene ends. Depending on the options you give
it may trim some off the ends of genes to handle terminators that
partially overlap with genes. Second, where the terminator "begins"
isn't that well defined. This field is intended only for a sanity check
(terminators reported to be the best near the ends of genes shouldn't be
_too far_ from the end of the gene).
10. USING TRANSTERM WITHOUT GENOME ANNOTATIONS
TransTermHP uses known gene information for only 3 things: (1) tagging the
putative terminators as either "inside genes" or "intergenic," (2) choosing the
background GC-content percentage to compute the scores, because genes often
have different GC content than the intergenic regions, and (3) producing
slightly more readable output. Items (1) and (3) are not really necessary, and
(2) has no effect if your genes have about the same GC-content as your
intergenic regions.
Unfortunately, TransTermHP doesn't yet have a simple option to run without an
annotation file (either .ptt or .coords), and requires at least 2 genes to be
present. The solution is to create fake, small genes that flank each
chromosome. To do this, make a fake.coords file that contains only these two
lines:
fakegene1 1 2 chome_id
fakegene2 L-1 L chrom_id
where L is the length of the input sequence and L-1 is 1 less than the length
of the input sequence. "chrom_id" should be the word directly following the ">"
in the .fasta file containing your sequence. (If, for example, your .fasta file
began with ">seq1", then chrom_id = seq1).
This creates a "fake" annotation with two 1-base-long genes flanking the
sequence in a tail-to-tail arrangement: --> <--. TransTermHP can then be run
with:
transterm -p expterm.dat sequence.fasta fake.coords
If the G/C content of your intergenic regions is about the same as your genes,
then this won't have too much of an effect on the scores terminators receive.
On the other hand, this use of TransTermHP hasn't been tested much at all, so
it's hard to vouch for its accuracy.
|