1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520
|
Shell Scripts for use with
fastDNAml and DNArates
SUMMARY
UNIX shell scripts have proven quite useful in running the fastDNAml and/or
DNArates programs. They have been used in two different contexts. First,
many of the program options can be invoked by simple editing of the input.
The second category are scripts that help run and maintain results of the
program.
bootstrap add B (bootstrap) option (and optional seed) to input
categories add C (rate categories) option and values to input
categories_file add Y (categories file) option to input (DNArates)
clean_checkpoints remove checkpoint files when there is a finished treefile
clean_jumbles remove all but one optimal jumble for a given result
fastDNAml_boot loop over bootstrap seeds, doing 1 or more jumbles each
fastDNAml_loop do jumbles, stopping when same best tree found n times
frequencies add F option and user-defined frequencies to input
global add G (global) option (and optional region size) to input
jumble add J (jumble) option (and optional seed) to input
min_info add M (minimum information) option and value to input
n_categories add C (categories) option (without rate values) to input
out.PID append process ID to output file name of a program
outgroup add O (outgroup) option and number to input
printdata add 1 (print data) option to input
quickadd add Q (quickadd) option to input
restart add R (restart) option and checkpoint tree to input
scores summarize and sort likelihoods from jumble output files
transition add T (transition/transversion) option and value to input
treefile add Y (treefile) option to input
trees2NEXUS combine trees and add a NEXUS wrapper for PAUP and MacClade
trees2prolog convert Newick format trees to prolog facts
userlengths add L (userlengths) option to input
usertree add U (usertree) option, tree count, and tree(s) to input
usertrees add U (usertree) option, tree count, and tree(s) to input
weights add W (userweight) option and values to the input
weights_categories add W and C options and values to the input
SCRIPTS THAT INVOKE DNAML OPTIONS
GENERAL COMMENTS:
The program fastDNAml takes data from standard input. Thus, to run the
program with data in the file called "infile", the command would be
fastDNAml <infile
In this case, the output goes to standard output (generally the user's
terminal). To put in into a file, one can use output redirection as in
fastDNAml <infile >outfile
Because of the use of standard input, the input to fastDNAml can by
preprocessed by a function, and then piped to the program. For example,
bootstrap <infile | fastDNAml >outfile
or
bootstrap 137 <infile | fastDNAml >outfile
can be used to add the bootstrap option and a random number seed to the input,
and then pass it on to fastDNAml for analysis.
Many of the fastDNAml options are amenable to this arrangement. In
each case, the preprocessing can simply add options (and auxiliary data
lines, as necessary) to the input. In addition to avoiding the need to play
with UNIX text editors, there are several advantages to this approach:
1. The files remain relatively compatible with PHYLIP DNAML.
2. It reduces the chance of introducing errors into the data.
3. It is easier to try alternative options on the same data.
4. If the data for each sequence are provided in one long line (so that
interleaved and non-interleaved formats are the same), then some text
editors will truncate the lines.
Shell scripts are available for each of the above program options. The
corresponding formats and effects are described below.
THE SCRIPTS:
BOOTSTRAP (B)
Format: bootstrap [random_seed]
Example: bootstrap <infile.phylip | fastDNAml >outfile
Example: bootstrap 137 <infile.phylip | fastDNAml >outfile
Adds a bootstrap option and a random number seed to the input. If the random
seed is not supplied, then the process ID of the bootstrap shell is used. Thus,
repeated executions of the first example will tend to generate different random
samples (note that many systems only use about 32000 process IDs, so once you
get above 100 repetitions, reuse of the same number may become a significant
concern).
CATEGORIES (C)
Format: categories categories_data_file
Example: categories archae.rates <archaea.phylip | fastDNAml >archaea.out
Adds the categories option and the corresponding data to the input. The data
must have the format specified for PHYLIP dnaml 3.3. The first line must be
the letter C, followed by the number of categories (a number in the range 1
through 35), and then a blank-separated list of the rates for each category.
(The list can take more than one line; the program reads until it finds the
specified number of rate values.) The next line should be the word Categories
followed by one rate category character per sequence position. The categories
1 - 35 are represented by the series 1, 2, 3, ..., 8, 9, A, B, C, ..., Y, Z.
These latter data can be on one or more lines. For example,
C 12 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 128
Categories 5111136343678975AAA8949995566778888889AAAAAA9239898629AAAAA9
633792246624457364222574877188898132984963499AA9899975
or, with more categories,
C 35 0.16529 0.29525 0.34482 0.40272 0.47035 0.54933 0.64157
0.74930 0.87512 1.02207 1.19369 1.39413 1.62823 1.90164
2.22096 2.59389 3.02945 3.53815 4.13227 4.82615 5.63654
6.58301 7.68841 8.97943 10.48723 12.24822 14.30490 16.70694
19.51232 22.78878 26.61541 31.08459 36.30423 42.40033 256.00000
Categories 4HHZ282111 21ED48H1HD Z1CD171411 1118F111EI IHI8ELBZZZ ZZZZZZZZZZ
ZZZZZZZZZZ 1MJZZMJLKL ZKL1ZZZZZZ ZZZZZZZZZZ ZZZZZZZZGH HHIGG43FOZ
Z2B9111324 1ZZZ171Z11 1184GH11ZZ IB1BBZ111J IB1ILKF4L1 21AEDE8111
111111ED9K 2219L3HGJ1 1Z1ZZMONMH ZZOMSQLM8Z 11411
(Notice that spaces are permitted in the categories data, and that the values
can extend across multiple lines. However, this means that extra values are
not permitted.)
In order to generate output compatible with PHYLIP dnaml v3.3, this should be
the first option added (so that the categories data are inserted immediately
before the sequence data).
CATEGORIES_FILE (Y)
Format: categories_file
Adds the Y option to the input data for the DNArates program. Makes the program write a file of weights and categories that can be directly added to the input
for the fastDNAml program (see weights_categories script).
Example: categories_file <archaea.phylip | n_categories 17 | \
usertree archaea.tree | DNArates
This command line will find the site-specific rates for the sequence data in
archaea.phylip and the tree in archaea.tree, categorize them into 17 groups,
and write the resulting categories (and a weighting mask removing sites of
undetermined rates) into a file called weight_rate.PID, where PID is a number
(the ID of the process running DNArates).
FREQUENCIES (F)
Format: frequencies
Example: frequencies <archaea.phylip | fastDNAml
Adds empirical frequencies option (F) to the input stream.
GLOBAL (G)
Format: global [final_tree_rearrangements [partial_tree_rearrangements]]
Example: global <archaea.phylip | fastDNAml
Example: global 4 <archaea.phylip | fastDNAml
Example: global 0 0 <archaea.phylip | fastDNAml
Adds a global option to the input. If a rearrangement distance is
specified, then this value is added as part of the option auxiliary
information. In this latter case, it is essential that the input contain
(or the global command be preceded by) a B, J or T option.
Example: global <archaea.phylip | fastDNAml
Example: transition 2.0 <archaea.phylip | global 4 | fastDNAml
Example: transition 2.0 <archaea.phylip | global 0 0 | fastDNAml
The first example invokes the global rearrangement option for the completed
tree, exactly as with DNAML in the PHYLIP package. The second example performs
"regional" rearrangements of the completed tree, such that subtrees are moved
across as many as tree branch points before being reconnected. The final
example does not perform any tree rearrangements whatsoever, it just builds a
tree by sequentially adding each of the sequences. Notice that in the last
two examples, the T 2.0 option is added to the input BEFORE the global option.
JUMBLE (J)
Format: jumble [random_seed]
Example: jumble <archaea.phylip | fastDNAml
Example: jumble 137 <archaea.phylip | fastDNAml
Adds a jumble option and a random number seed to the input. If the random
seed is not supplied, then the process id of the jumble shell is used. Thus,
in the first example, the seed used is the process ID of the jumble shell
script. Repeated executions of the command line will tend to get different
random number seeds, and hence different addition orders. In the second
example, the seed is 137.
MIN_INFO (M)
Format: min_info minimum_unambiguous_residues
Example: categories_file <archaea.phylip | n_categories 17 | \
usertree archaea.tree | min_info 8 | DNArates
Adds the minimum information option and an auxiliary data line to the input
for the DNArates program. This changes the threshold value of unambiguous
residues that must be present in a column before the program will consider the
rate to be "defined". The default value is currently 4.
N_CATEGORIES
Format: n_categories number
Example: categories_file <archaea.phylip | n_categories 17 | \
usertree archaea.tree | min_info 8 | DNArates
Adds the C options and an auxiliary data line of the form "C number" to the
input. When the DNArates program categorizes the rates that it infers for
the sites in the alignment, this is the number of categories it will use.
OUTGROUP (O)
Format: outgroup outgroup_number
Example: outgroup 5 <archaea.phylip | treefile | fastDNAml > archaea.out
Adds the outgroup option and appropriate auxiliary data line to the input. The
example will infer a tree for the archaea data, root it on sequence 5, and
write a tree to treefile.PID, where PID is a number (the process ID of
fastDNAml). The textual output from fastDNAml (a description of the analysis)
is written to archaea.out.
PRINTDATA (1)
Format: printdata
Example: printdata <archaea.phylip | fastDNAml > archaea.out
Adds a printdata option to the input. In the example, the file archaea.out
will include an echoing of the data in addition to the usual output.
QUICKADD (Q)
Format: quickadd
Example: quickadd <archaea.phylip | fastDNAml > archaea.out
Adds a quickadd option to the input. This greatly decreases the time in
initially placing a new sequence in the growing tree (but does not change the
time required to subsequently test rearrangements). This will probably become
the default program behavior in the near future.
Any possible downside of the quickadd option would be a decreased frequency of
finding the globally optimal tree. Since you should NEVER depend on a single
order of addition yielding the best tree, multiple jumble runs will still be
the best way to check the reproducibility of any presumptively optimal tree.
Quickadd should let you do this more quickly!
RESTART (R)
Format: restart checkpoint_file_name
Example: quickadd <archaea.phylip | restart checkpoint.13728 | fastDNAml
Adds a restart option and appends the last tree in the checkpoint file to the
input. This conflicts with a usertree option. Note that this script makes a
valiant attempt to figure out the beginning location of the last tree in the
file. In particular, you should not modify lines in the checkpoint file (or
any other treefile that you supply) so that they end with a semicolon or a
period (other than the final line of a tree). The example will continue
building a tree for the archaea.phylip data from whereever the checkpoint
file left off.
TRANSITION-TRANSVERSION RATIO (T)
Format: transition ratio_value
Example: transition 2.0 <archaea.phylip | global 4 | fastDNAml
Adds the transition-transversion option and appropriate auxiliary data line
to the input. This option with a value of 2.0 (the program's default value)
can be used before a global or treefile option with auxiliary data (see the
example).
TREEFILE (Y)
Format: treefile [format_number]
Adds the treefile option, and, if requested, the treefile format auxiliary
data line to the input data. In this latter case, it is essential that the
input contain (or the treefile command be preceded by) a B, J or T option.
The only formats at present are 1 (default Newick format) and 2 (a prolog
phylip_tree fact).
Example: outgroup 5 <archaea.phylip | treefile | fastDNAml > archaea.out
Example: transition 2.0 <archaea.phylip | treefile 2 | fastDNAml
The first example will infer a tree for the archaea data, root it on sequence
5, and write a tree to treefile.PID, where PID is a number (the process ID of
fastDNAml). The second example writes a prolog fact for the tree inferred from
these same data. The transition/transversion ratio is supplied with its
default value, only so that the treefile format will be found in the input file.
USERLENGTHS (L)
Format: userlengths
Example: usertree archaea.tree <archaea.phylip | userlengths | fastDNAml
Adds the userlengths option to the input. This must be used in conjunction
with the usertree option, so it is usually easier to add the second parameter
to the usertree script (see below). This script is useful if the usertree is
already part of the input, and you only need to alter the lengths option.
USERTREE (U)
Format: usertree tree_file [ L ]
Adds the usertree option and appends the tree count and all of the trees in
the named file to the input. If the L parameter is included, it is added to
the options line to invoke the program userlengths option.
Example: usertree archaea.tree <archaea.phylip | userlengths | fastDNAml
Example usertree archaea.tree L <archaea.phylip | fastDNAml
These two examples are equivalent.
All of the trees in the file are added, so this is not an appropriate way to
evaluate the last tree in a checkpoint file. Restart is the script to use for
that latter task.
WEIGHTS (W)
Format: weights weight_data_file
Example: weights archaea.weight <archaea.phylip | fastDNAml
Adds the weights option (W) and the corresponding data to the input. The data
must have the format specified for PHYLIP dnaml 3.3. The first line must be
the word Weights and AT LEAST 3 BLANK SPACES, followed by one weight value per
sequence position. The weights 0 - 35 are represented by the series 0, 1, 2,
3, ..., 8, 9, A, B, C, ..., Y, Z. These latter data can be on one or more
lines. For example,
Weights 111111111111001100000100011111100000000000000110000110000000
111101111111111111111111011100000111001011100000000011
WEIGHTS and CATEGORIES (W C)
Format: weights_categories weight_and_category_data_file
Adds both the userweights and categories options to the input, along with the
weights and categories data in the named file. This is useful for importing
the data from the DNArates program into fastDNAml. The format of the data
file should conform to that of weights auxiliary data followed by categories
auxiliary data. As noted for the categories option above, in order to
generate output compatible with PHYLIP dnaml, this should be the first option
added (so that the categories data are inserted immediately before the
sequence data).
SCRIPTS THAT PERFORM OTHER USEFUL FUNCTIONS
clean_checkpoints
Usage: clean_checkpoints
Locates checkpoint files for which there is a corresponding treefile, and
deletes the former. This is useful in cleaning up the directory when the
treefile option is in use.
clean_jumbles
Usage: clean_jumbles [ -nosummary ] file_name_root
Finds all files with the form "file_name_root.PID", where PID is the process
ID appended by fastDNAml_boot and fastDNAml_loop to results from analyses with
different jumble seeds. Unless the nosummary flag is included, all likelihoods
are summarized in a file named "file_name_root.summary". The likelihoods are
screened for the best value. An output file with the optimal score is renamed
to "file_name_root.out", and the corresponding treefile (if any) is renamed
"file_name_root.tree". All other treefiles and output files matching the name
are deleted. Corresponding checkpoint files are also deleted.
fastDNAml_boot
Usage: fastDNAml_boot [-seed seed] [-boots nboot] \
[-max maxjumble] [-nice nicevalue] [-noclean] \
in_file n_best [ "dnaml_opt1 [ | dnaml_opt2 ... ]" ]
For the current bootstrap seed, the sequence input order is jumbled (up to
maxjumble times) until the same best tree is found n_best times. The output
files are then reduced to a summary of the scores produced by jumbling, and one
example of the best tree. The number process is then repeated with new
bootstrap seeds until nboot samples have been analyzed.
Boot, jumble, treefile and quickadd are included by the script and should not
be specified by the user or in the data file. AdditionalfastDNAml
program options are enclosed in quotes, and separated by vertical bars (|).
Flags and parameters:
in_file -- name of the input data file
n_best -- input order is jumbled (up to maxjumble times) until
same tree is found n_best times
-boots nboot -- number of different bootstrap samples (Default=1)
-seed seed -- seed for first bootstrap (Default is based on the process
ID and time of day)
-max maxjumble -- maximum attempts at replicating inferred tree
(Default=10)
-nice nicevalue -- run fastDNAml with specified nice value (Default=10)
-noclean -- inhibits cleanup of the output files for the individual
seeds
Example:
fastDNAml_boot -boots 100 -seed 137 -max 8 -nice 20 test_data.phylip 2
fastDNAml_loop
Usage: fastDNAml_loop [-max maxjumble] [-nice nicevalue] [-noclean] \
in_file n_best [ "dnaml_opt1 [ | dnaml_opt2 ... ]" ]
For the given input file, the sequence input order is jumbled (up to maxjumble
times) until the same best tree is found n_best times. The output files are
then reduced to a summary of the scores produced by jumbling, and one example
of the best tree.
Jumble, treefile and quickadd are included by the script and should not
be specified by the user or in the data file. AdditionalfastDNAml
program options are enclosed in quotes, and separated by vertical bars (|).
Flags and parameters:
in_file -- name of the input data file
n_best -- input order is jumbled (up to maxjumble times) until
same tree is found n_best times
-max maxjumble -- maximum attempts at replicating inferred tree
(Default=10)
-nice nicevalue -- run fastDNAml with specified nice value (Default=10)
-noclean -- inhibits cleanup of the output files for the individual
seeds
Example:
fastDNAml_loop -max 15 -nice 20 test_data.phylip 3
out.PID
Usage: out.PID program outfile
or out.PID -o outfile program program_args
When a process sends its output to standard out, this script puts it in a
file whose name is the outfile.PID, where outfile is the name given by the
user and PID is the process ID of the running program. This is useful when
a program is run a number of times, and output needs to be put in different
files. Since the fastDNAml program uses the PID in generating the checkpoint
file name and the tree file name, it is useful to be able to put the program
output itself in a file with the corresponding suffix.
scores
Usage: scores files
Extracts and sorts likelihood values from fastDNAml output files. The
corresponding file names are also provided.
trees2NEXUS
Usage: trees2NEXUS [ treefile ... ]
Concatenate Newick format trees from named file(s) or from standard input
into a single file suitable for the PAUP (tested) or MacClade (untested)
programs. For example, it is possible to combine trees from bootstrap
replicates in order to use the PAUP concensus tree functions.
trees2prolog
Usage: trees2prolog [ treefile ... ]
Accepts Newick format trees from named file(s) or from standard input and
as converts them to prolog facts of the form pseudoNewick(Comment, Tree).
Each tree must be entirely on one line, and they must be similar to the
form produced by the fastDNAml program. In particular, the only permitted
comment is one preceeding the whole tree. If there is no leading commment,
then the fact produced will have arity 1.
|