File: physamp.texi

package info (click to toggle)
physamp 1.1.0-6
links: PTS, VCS
area: main
in suites: forky, sid
size: 1,692 kB
sloc: cpp: 793; sh: 23; makefile: 3
file content (273 lines) | stat: -rw-r--r-- 13,002 bytes
parent folder | download | duplicates (5)
\input texinfo   @c -*-texinfo-*-
@c %**start of header
@setfilename physamp.info
@settitle PhySamp Manual 1.1.0
@documentencoding UTF-8
@afourpaper
@dircategory Science Biology Genetics
@direntry
* bppalnoptim: (bppalnoptim)    Alignment Optimizer.
* bppphysamp: (bppphysamp)      Phylogenetic Sampling.
@end direntry
@c %**end of header


@copying
This is the manual of the PhySamp package, version 1.1.0.

Copyright @copyright{} 2018 Julien Y. Dutheil
@end copying

@titlepage
@title PhySamp Manual
@author Julien Dutheil
@author @email{dutheil@@evolbio.mpg.de}

@c The following two commands start the copyright page.
@page
@vskip 0pt plus 1fill1
@insertcopying
@end titlepage

@c Output the table of contents at the beginning.
@contents

@ifnottex
@node Top, Introduction, (dir), (dir)
@top The PhySamp Manual

@insertcopying

@menu
* Introduction::
* bppalnoptim::
* bppphysamp::

@detailmenu
 --- The Detailed Node Listing ---

Introduction

* Description::
* Run::

Output files

* bppphysamp::

@end detailmenu
@end menu

@end ifnottex

@c ------------------------------------------------------------------------------------------------------------------

@node Introduction, bppalnoptim, Top, Top
@chapter Introduction

PhySamp is a software package for phylogenetic sampling. Its aim is to optimize a sequence alignment data set based on its corresponding phylogeny.

PhySamp is written on the Bio++ libraries, and uses the BppO syntax for options, introduced with the Bio++ Program Suite software (BppSuite). Part of this manual will therefore link to the corresponding manual of BppSuite where needed, and only describe options specific to the PhySamp package.

Note that several detailed examples are provided along with the source code of the program, and can serve as good training starts. This manual intends to provide an exhaustive description of the options used in these examples.

@c ------------------------------------------------------------------------------------------------------------------

@menu
* Description::
* Run::
@end menu

@node Description, Run, Introduction, Introduction
@section Description of the programs

The PhySamp package currently contains two programs:

@table @samp

@item bppalnoptim
optimize a sequence alignment by removing sequences in order to maximize the number of sites with a minimal coverage.

@item bppphysamp
optimize a sequence alignment by removing similar sequences.

@end table

@c ------------------------------------------------------------------------------------------------------------------

@node Run,  , Description, Introduction
@section How to run the program

Programs in the PhySamp package follow the @samp{BppO} syntax. They are command line driven, and take as input options with the format @samp{name}=@samp{value}. These options can be gathered into a file, and loaded using @command{param=optionfile}. Please refer to the Bio++ Program Suite manual for more details, including the use of variables, priority of option values, etc.

@c ------------------------------------------------------------------------------------------------------------------

@node bppalnoptim, bppphysamp, Introduction, Top
@chapter Sampling alignment to maximize the number of sites with minimum coverage

The option files for @sc{bppalnoptim} fall in three sections: input of alignment and guide tree, filter parameters and output files.

@section Input files

@sc{bppalnoptim} reads a sequence alignment file using the same syntax as all bppSuite programs (@ref{Sequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Sequences.html#Sequences}}). All formats are supported, as well as sequence filtering options.

In order to perform its sampling procedure, @sc{bppalnoptim} requires a @emph{guide tree}. This guide tree can be provided as input or built using a sequence hierarchical clustering approach.
Providing a guide tree will be in most cases faster, as software like FastTree can produce accurate phylogenetic trees for large data sets in a very short time.

@itemize @bullet
@item
User-defined guide tree are provided using the option @command{input.tree.method=File}, and the standard option for reading tree files in the BppO syntax (@ref{Sequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Tree.html#Tree}}).
@item
Clustering trees are built using the option @command{input.tree.method = AutoCluster(linkage=@{linkage_mode@})}, where @command{@{linkage_mode@}} is one of @emph{single}, @emph{complete}, @emph{median}, @emph{average}, @emph{centroid} or @emph{ward}.
@end itemize

@section Sampling mode and parameters 

A few option enables the configuration of the sampling procedure:
@table @command
@item filter_gaps = @{yes/no@}
Tell if gap characters ('-') should be considered as missing data.
@item filter_unresolved = @{yes/no@}
Tell if unresolved characters such as 'N' (nucleotides) and 'X' (proteins) should be considered as missing data.
@item threshold = @{float[0,1]@}
The minimum coverage for a site to be included in the analysis. A threshold of 0.7 means that only sites with no more than 30\% gaps (or unresolved characters, if @command{filter_unresolved=yes}) will be considered suitable.
@item method = @{Diagnostic|Input|Auto@}
One of three possible running modes, which are detailed below.
@item comparison = @{MaxSites|MaxChars@}
Criterion to compare sequences subsets. @command{MaxSites} will maximize the number of sites with minimal coverage (default), while @command{MaxChars} will maximize the amount of data (number of sequences * number of sites).
@end table

Here comes a description of the three running modes:

@subsection Running mode: Input

This is an interactive mode, where the user is prompt for which sequence to select at each iteration:
@example
@group
Number of sequences in alignment.......: 1652
Number of sites in alignment...........: 129
Number of chars in alignment...........: 182603
Choice	Node	#seq	%seq	#sites	%sites	#chars	%chars
1)	2030	1647	-5	131	+2	184714	2111
2)	404	1647	-5	131	+2	184636	2033
3)	1709	1647	-5	131	+2	184604	2001
4)	1735	1647	-5	131	+2	184592	1989
5)	1836	1647	-5	131	+2	184590	1987
Your choice (0 to stop):
@end group
@end example
Entering 0 will stop the filtering procedure and output the current sequence selection. Possible choices are ordered according to the @command{comparison} option.
In the above example, the current selection has 1,652 sequences and 129 sites with the given minimal coverage.
Choice 1 proposes to remove 5 sequences, which would bring the total number of sequences to 1,647. This would gain 2 more sites with the required minimum coverage.
Choice 2 also proposes to remove 5 sequences in order to add 2 more sites, but the total number of characters gained would be a bit lower. This means that even if both selection have the same number of sites, the ones in choice 1 have a higher coverage than the ones in choice 2. Picking option 1 leads to another proposal:
@example
@group
Number of sequences in alignment.......: 1647
Number of sites in alignment...........: 131
Number of chars in alignment...........: 184714
Choice	Node	#seq	%seq	#sites	%sites	#chars	%chars
1)	1930	1645	-2	132	+1	185794	1080
2)	2246	1645	-2	132	+1	185792	1078
3)	1950	1645	-2	132	+1	185788	1074
4)	2093	1645	-2	132	+1	185786	1072
5)	413	1645	-2	132	+1	185784	1070
Your choice (0 to stop):
@end group
@end example
and so on. The alogrithm ends either when the user enter '0' or when no more improving option exist.
The number of proposal at each iteration can be specified using the @command{show=@{int>0@}} argument:
@command{method=Input(show=5)}.

@subsection Running mode: Diagnostic

This mode will automatically pick the best choice at each iteration according to the @command{comparison} criterion until no more improving choice is proposed.
The goal of this mode is mainly to plot a trade-off curve (see the example directory provided with the source code distribution).
Only log files are produced, no sequence or tree output are written.

@subsection Running mode: Auto

This mode works in a similar way as the Diagnostic mode, by picking the best choice at each iteration, but will stop when a given condition is reached and produce output alignments and trees.
Currently two stopping conditions are implemented, which can be combined:

@table @command

@item min_nb_sequences=@{int>0@}
will stop when no more improvement can be found or when all improving steps would bring the number of sequences below the given number.

@item min_relative_nb_sequences=@{int[0,1]@}
will stop when no more improvement can be found or when all improving steps would bring the proportion of remaining sequences below the given threshold.
This option only has an effect if min_nb_sequences is not set or set to 0.

@item min_nb_sites=@{int>0@}
will stop when the number of sites matching the requested coverage is at least the given number.

@item min_relative_nb_sites=@{int[0,1]@}
will stop when the number of sites matching the requested coverage is at least the given proportion of total alignment sites.
This option only has an effect if min_nb_sites is not set or set to 0.

@end table

For instance, @command{Auto(min_nb_sequences=100, min_nb_sites=300)} will stop the filtering when no improvement can be found, at least 300 sites have the minimum requested coverage or when there is no more than 200 sequences left.
@command{Auto(min_nb_sequences=100)} will perform all possible improvements which do not bring the final number of sequences below 100. @command{Auto(min_nb_sites=300)} will perform all possible improvements until 300 sites have the requested coverage. This might result in having only one sequence left in the output file.

@section Output files

@sc{bppalnoptim} generates several ouput files:

@table @command

@item output.log = @{path@}
Logfile containing the characteristics of each iteration. This file can be used to plot the trade-off curve (see Diagnostic mode).

@end table

The sampled alignment is written using the standard options in the BppO syntax (@ref{WritingSequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/WritingSequences.html#WritingSequences}}).
Optionally, the sampled tree can also be written using options from the BppO syntax (@ref{WritingTrees, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/WritingTrees.html#WritingTrees}}).

@c ------------------------------------------------------------------------------------------------------------------

@menu
* bppphysamp::
@end menu

@node bppphysamp,  , bppalnoptim, Top
@chapter Sampling alignment to remove redundant sequences
 
The Bio++ Phylogenetic Sampler samples sequences from a file according to phylogenetic information.
The goal is to clean a big data set by removing redundant sequences, bringing only few additional information for evolutionary analyses.

The BppPhySamp programs uses the common options for setting the alphabet, loading the sequences (@ref{Sequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Sequences.html#Sequences}}) and (@ref{Tree, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Tree.html#Tree}}) and writing the resulting data set (@ref{WritingSequences, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/WritingSequences.html#WritingSequences}}, @ref{WritingTrees, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/WritingTrees.html#WritingTrees}}).

@table @command

@item input.method = @{tree|matrix@}
The method to provide phylogenetic information, either by a tree or a matrix.
If the @option{tree} option is used, then the options for reading trees are used (@ref{Tree, , , bppsuite, @uref{http://biopp.univ-montp2.fr/manual/html/bppsuite/2.2.0/Tree.html#Tree}}).

@item input.matrix = @{path@} [[input.method = matrix]]
The input matrix file.

@item deletion_method = @{random|threshold|sample@}
Method to use to remove sequence. 

@item threshold = @{float>0@} [[deletion_method = threshold ]]
The minimum distance separating two sequences in the sampled data set.
Any sequences closer than this threshold in the original data set will be confronted so that only one is kept.

@item sample_size = @{int>0@} [[deletion_method = sample|random ]]
The number of sequences to keep in the final data set.

@item choice_criterion = @{length|length.complete|random@}
How to chose between closely related sequences? @option{length} takes the longest (maximum number of non-gap positions), @option{length.complete} takes the sequence with the maximum number of fully resolved positions and @option{random} picks one sequence at random.
@end table 

@c ------------------------------------------------------------------------------------------------------------------

@c end of document

@c @node Index,  , Reference, Top
@c @unnumbered Index
@c
@c @printindex cp

@bye