1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273
|
\input texinfo @c -*-texinfo-*-
@c %**start of header
@setfilename physamp.info
@settitle PhySamp Manual 1.1.0
@documentencoding UTF-8
@afourpaper
@dircategory Science Biology Genetics
@direntry
* bppalnoptim: (bppalnoptim) Alignment Optimizer.
* bppphysamp: (bppphysamp) Phylogenetic Sampling.
@end direntry
@c %**end of header
@copying
This is the manual of the PhySamp package, version 1.1.0.
Copyright @copyright{} 2018 Julien Y. Dutheil
@end copying
@titlepage
@title PhySamp Manual
@author Julien Dutheil
@author @email{dutheil@@evolbio.mpg.de}
@c The following two commands start the copyright page.
@page
@vskip 0pt plus 1fill1
@insertcopying
@end titlepage
@c Output the table of contents at the beginning.
@contents
@ifnottex
@node Top, Introduction, (dir), (dir)
@top The PhySamp Manual
@insertcopying
@menu
* Introduction::
* bppalnoptim::
* bppphysamp::
@detailmenu
--- The Detailed Node Listing ---
Introduction
* Description::
* Run::
Output files
* bppphysamp::
@end detailmenu
@end menu
@end ifnottex
@c ------------------------------------------------------------------------------------------------------------------
@node Introduction, bppalnoptim, Top, Top
@chapter Introduction
PhySamp is a software package for phylogenetic sampling. Its aim is to optimize a sequence alignment data set based on its corresponding phylogeny.
PhySamp is written on the Bio++ libraries, and uses the BppO syntax for options, introduced with the Bio++ Program Suite software (BppSuite). Part of this manual will therefore link to the corresponding manual of BppSuite where needed, and only describe options specific to the PhySamp package.
Note that several detailed examples are provided along with the source code of the program, and can serve as good training starts. This manual intends to provide an exhaustive description of the options used in these examples.
@c ------------------------------------------------------------------------------------------------------------------
@menu
* Description::
* Run::
@end menu
@node Description, Run, Introduction, Introduction
@section Description of the programs
The PhySamp package currently contains two programs:
@table @samp
@item bppalnoptim
optimize a sequence alignment by removing sequences in order to maximize the number of sites with a minimal coverage.
@item bppphysamp
optimize a sequence alignment by removing similar sequences.
@end table
@c ------------------------------------------------------------------------------------------------------------------
@node Run, , Description, Introduction
@section How to run the program
Programs in the PhySamp package follow the @samp{BppO} syntax. They are command line driven, and take as input options with the format @samp{name}=@samp{value}. These options can be gathered into a file, and loaded using @command{param=optionfile}. Please refer to the Bio++ Program Suite manual for more details, including the use of variables, priority of option values, etc.
@c ------------------------------------------------------------------------------------------------------------------
@node bppalnoptim, bppphysamp, Introduction, Top
@chapter Sampling alignment to maximize the number of sites with minimum coverage
The option files for @sc{bppalnoptim} fall in three sections: input of alignment and guide tree, filter parameters and output files.
@section Input files
@sc{bppalnoptim} reads a sequence alignment file using the same syntax as all bppSuite programs (@ref{Sequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Sequences.html#Sequences}}). All formats are supported, as well as sequence filtering options.
In order to perform its sampling procedure, @sc{bppalnoptim} requires a @emph{guide tree}. This guide tree can be provided as input or built using a sequence hierarchical clustering approach.
Providing a guide tree will be in most cases faster, as software like FastTree can produce accurate phylogenetic trees for large data sets in a very short time.
@itemize @bullet
@item
User-defined guide tree are provided using the option @command{input.tree.method=File}, and the standard option for reading tree files in the BppO syntax (@ref{Sequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Tree.html#Tree}}).
@item
Clustering trees are built using the option @command{input.tree.method = AutoCluster(linkage=@{linkage_mode@})}, where @command{@{linkage_mode@}} is one of @emph{single}, @emph{complete}, @emph{median}, @emph{average}, @emph{centroid} or @emph{ward}.
@end itemize
@section Sampling mode and parameters
A few option enables the configuration of the sampling procedure:
@table @command
@item filter_gaps = @{yes/no@}
Tell if gap characters ('-') should be considered as missing data.
@item filter_unresolved = @{yes/no@}
Tell if unresolved characters such as 'N' (nucleotides) and 'X' (proteins) should be considered as missing data.
@item threshold = @{float[0,1]@}
The minimum coverage for a site to be included in the analysis. A threshold of 0.7 means that only sites with no more than 30\% gaps (or unresolved characters, if @command{filter_unresolved=yes}) will be considered suitable.
@item method = @{Diagnostic|Input|Auto@}
One of three possible running modes, which are detailed below.
@item comparison = @{MaxSites|MaxChars@}
Criterion to compare sequences subsets. @command{MaxSites} will maximize the number of sites with minimal coverage (default), while @command{MaxChars} will maximize the amount of data (number of sequences * number of sites).
@end table
Here comes a description of the three running modes:
@subsection Running mode: Input
This is an interactive mode, where the user is prompt for which sequence to select at each iteration:
@example
@group
Number of sequences in alignment.......: 1652
Number of sites in alignment...........: 129
Number of chars in alignment...........: 182603
Choice Node #seq %seq #sites %sites #chars %chars
1) 2030 1647 -5 131 +2 184714 2111
2) 404 1647 -5 131 +2 184636 2033
3) 1709 1647 -5 131 +2 184604 2001
4) 1735 1647 -5 131 +2 184592 1989
5) 1836 1647 -5 131 +2 184590 1987
Your choice (0 to stop):
@end group
@end example
Entering 0 will stop the filtering procedure and output the current sequence selection. Possible choices are ordered according to the @command{comparison} option.
In the above example, the current selection has 1,652 sequences and 129 sites with the given minimal coverage.
Choice 1 proposes to remove 5 sequences, which would bring the total number of sequences to 1,647. This would gain 2 more sites with the required minimum coverage.
Choice 2 also proposes to remove 5 sequences in order to add 2 more sites, but the total number of characters gained would be a bit lower. This means that even if both selection have the same number of sites, the ones in choice 1 have a higher coverage than the ones in choice 2. Picking option 1 leads to another proposal:
@example
@group
Number of sequences in alignment.......: 1647
Number of sites in alignment...........: 131
Number of chars in alignment...........: 184714
Choice Node #seq %seq #sites %sites #chars %chars
1) 1930 1645 -2 132 +1 185794 1080
2) 2246 1645 -2 132 +1 185792 1078
3) 1950 1645 -2 132 +1 185788 1074
4) 2093 1645 -2 132 +1 185786 1072
5) 413 1645 -2 132 +1 185784 1070
Your choice (0 to stop):
@end group
@end example
and so on. The alogrithm ends either when the user enter '0' or when no more improving option exist.
The number of proposal at each iteration can be specified using the @command{show=@{int>0@}} argument:
@command{method=Input(show=5)}.
@subsection Running mode: Diagnostic
This mode will automatically pick the best choice at each iteration according to the @command{comparison} criterion until no more improving choice is proposed.
The goal of this mode is mainly to plot a trade-off curve (see the example directory provided with the source code distribution).
Only log files are produced, no sequence or tree output are written.
@subsection Running mode: Auto
This mode works in a similar way as the Diagnostic mode, by picking the best choice at each iteration, but will stop when a given condition is reached and produce output alignments and trees.
Currently two stopping conditions are implemented, which can be combined:
@table @command
@item min_nb_sequences=@{int>0@}
will stop when no more improvement can be found or when all improving steps would bring the number of sequences below the given number.
@item min_relative_nb_sequences=@{int[0,1]@}
will stop when no more improvement can be found or when all improving steps would bring the proportion of remaining sequences below the given threshold.
This option only has an effect if min_nb_sequences is not set or set to 0.
@item min_nb_sites=@{int>0@}
will stop when the number of sites matching the requested coverage is at least the given number.
@item min_relative_nb_sites=@{int[0,1]@}
will stop when the number of sites matching the requested coverage is at least the given proportion of total alignment sites.
This option only has an effect if min_nb_sites is not set or set to 0.
@end table
For instance, @command{Auto(min_nb_sequences=100, min_nb_sites=300)} will stop the filtering when no improvement can be found, at least 300 sites have the minimum requested coverage or when there is no more than 200 sequences left.
@command{Auto(min_nb_sequences=100)} will perform all possible improvements which do not bring the final number of sequences below 100. @command{Auto(min_nb_sites=300)} will perform all possible improvements until 300 sites have the requested coverage. This might result in having only one sequence left in the output file.
@section Output files
@sc{bppalnoptim} generates several ouput files:
@table @command
@item output.log = @{path@}
Logfile containing the characteristics of each iteration. This file can be used to plot the trade-off curve (see Diagnostic mode).
@end table
The sampled alignment is written using the standard options in the BppO syntax (@ref{WritingSequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/WritingSequences.html#WritingSequences}}).
Optionally, the sampled tree can also be written using options from the BppO syntax (@ref{WritingTrees, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/WritingTrees.html#WritingTrees}}).
@c ------------------------------------------------------------------------------------------------------------------
@menu
* bppphysamp::
@end menu
@node bppphysamp, , bppalnoptim, Top
@chapter Sampling alignment to remove redundant sequences
The Bio++ Phylogenetic Sampler samples sequences from a file according to phylogenetic information.
The goal is to clean a big data set by removing redundant sequences, bringing only few additional information for evolutionary analyses.
The BppPhySamp programs uses the common options for setting the alphabet, loading the sequences (@ref{Sequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Sequences.html#Sequences}}) and (@ref{Tree, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Tree.html#Tree}}) and writing the resulting data set (@ref{WritingSequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/WritingSequences.html#WritingSequences}}, @ref{WritingTrees, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/WritingTrees.html#WritingTrees}}).
@table @command
@item input.method = @{tree|matrix@}
The method to provide phylogenetic information, either by a tree or a matrix.
If the @option{tree} option is used, then the options for reading trees are used (@ref{Tree, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Tree.html#Tree}}).
@item input.matrix = @{path@} [[input.method = matrix]]
The input matrix file.
@item deletion_method = @{random|threshold|sample@}
Method to use to remove sequence.
@item threshold = @{float>0@} [[deletion_method = threshold ]]
The minimum distance separating two sequences in the sampled data set.
Any sequences closer than this threshold in the original data set will be confronted so that only one is kept.
@item sample_size = @{int>0@} [[deletion_method = sample|random ]]
The number of sequences to keep in the final data set.
@item choice_criterion = @{length|length.complete|random@}
How to chose between closely related sequences? @option{length} takes the longest (maximum number of non-gap positions), @option{length.complete} takes the sequence with the maximum number of fully resolved positions and @option{random} picks one sequence at random.
@end table
@c ------------------------------------------------------------------------------------------------------------------
@c end of document
@c @node Index, , Reference, Top
@c @unnumbered Index
@c
@c @printindex cp
@bye
|