1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
|
cons
Wiki
The master copies of EMBOSS documentation are available at
http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.
Please help by correcting and extending the Wiki pages.
Function
Create a consensus sequence from a multiple alignment
Description
cons calculates a consensus sequence from a multiple sequence
alignment. To obtain the consensus, the sequence weights and a scoring
matrix are used to calculate a score for each amino acid residue or
nucleotide at each position in the alignment. The highest scoring
residue goes into the consensus sequence if the score is higher than a
user-specified "plurality" value, otherwise, there is no consensus at
that position.
Algorithm
To obtain the consensus, the sequence weights and a scoring matrix are
used to calculate a score at each position in the alignment as follows.
The residue (or nucleotide) i in an alignment column, is compared to
all other residues (j) in the same column. The score for i is the sum
over all residues j (not i=j) of the score(ij)*weight(j), where
score(ij) is taken from a nucleotide or protein scoring matrix (see
-datafile qualifier) and the "weight(j)" is the weighting given to the
sequence j, which is given in the alignment file.
q
The highest scoring type of residue is then found in the column. If the
number of "positive matches" (see below) for this residue is greater
than the "plurality value" (see below), then this residue is the
consensus residue. Otherwise there is no consensus for that position
and an 'n' (nucleotide sequence alignment) or an 'x' (protein sequence
alignment) character is written to the consensus sequence.
The positive matches for a residue i are calculated as being the sum of
the corresponding sequence weights for all the residues that increase
the score of residue i (i.e. that have a positive score). The
"plurality" qualifier sets the cut-off for the number of positive
matches (weighted) below which there is no consensus.
Usage
Here is a sample session with cons
% cons
Create a consensus sequence from a multiple alignment
Input (aligned) sequence set: dna.msf
output sequence [dna.fasta]: aligned.cons
Go to the input files for this example
Go to the output files for this example
Command line arguments
Create a consensus sequence from a multiple alignment
Version: EMBOSS:6.6.0.0
Standard (Mandatory) qualifiers:
[-sequence] seqset File containing a sequence alignment.
[-outseq] seqout [.] Sequence filename and
optional format (output USA)
Additional (Optional) qualifiers:
-datafile matrix [EBLOSUM62 for protein, EDNAFULL for DNA]
This is the scoring matrix file used when
comparing sequences. By default it is the
file 'EBLOSUM62' (for proteins) or the file
'EDNAFULL' (for nucleic sequences). These
files are found in the 'data' directory of
the EMBOSS installation.
-plurality float [Half the total sequence weighting] Set a
cut-off for the number of positive matches
below which there is no consensus. The
default plurality is taken as half the total
weight of all the sequences in the
alignment. (Any numeric value)
-identity integer [0] Provides the facility of setting the
required number of identities at a site for
it to give a consensus at that position.
Therefore, if this is set to the number of
sequences in the alignment only columns of
identities contribute to the consensus.
(Integer 0 or more)
-setcase float [@( $(sequence.totweight) / 2)] Sets the
threshold for the positive matches above
which the consensus is is upper-case and
below which the consensus is in lower-case.
(Any numeric value)
-name string Name of the consensus sequence (Any string)
Advanced (Unprompted) qualifiers: (none)
Associated qualifiers:
"-sequence" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-scircular1 boolean Sequence is circular
-squick1 boolean Read id and sequence only
-sformat1 string Input sequence format
-iquery1 string Input query fields or ID list
-ioffset1 integer Input start position offset
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-outseq" associated qualifiers
-osformat2 string Output seq format
-osextension2 string File name extension
-osname2 string Base file name
-osdirectory2 string Output directory
-osdbname2 string Database name to add
-ossingle2 boolean Separate file for each entry
-oufo2 string UFO features
-offormat2 string Features format
-ofname2 string Features file name
-ofdirectory2 string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write first file to standard output
-filter boolean Read first file from standard input, write
first file to standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options and exit. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
-version boolean Report version number and exit
Input file format
The USA of a set of aligned sequences.
Input files for usage example
File: dna.msf
!!NA_MULTIPLE_ALIGNMENT
dna.msf MSF: 120 Type: N January 01, 1776 12:00 Check: 3196 ..
Name: MSFM1 Len: 120 Check: 8587 Weight: 1.00
Name: MSFM2 Len: 120 Check: 6178 Weight: 1.00
Name: MSFM3 Len: 120 Check: 8431 Weight: 1.00
//
MSFM1 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC
MSFM2 ACGTACGTAC GTACGTACGT ....ACGTAC GTACGTACGT ACGTACGTAC
MSFM3 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT CGTACGTACG
MSFM1 GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT
MSFM2 GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT
MSFM3 TACGTACGTA CGTACGTACG TACGTACGTA ACGTACGTAC GTACGTACGT
MSFM1 ACGTACGTAC GTACGTACGT
MSFM2 ACGTACGTTG CAACGTACGT
MSFM3 ACGTACGTAC GTACGTACGT
Output file format
The output consists of a sequence file holding the consensus sequence.
Output files for usage example
File: aligned.cons
>EMBOSS_001
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
Data files
cons uses the standard set of scoring matrix data files in the EMBOSS
data directory.
EMBOSS data files are distributed with the application and stored in
the standard EMBOSS data directory, which is defined by the EMBOSS
environment variable EMBOSS_DATA.
To see the available EMBOSS data files, run:
% embossdata -showall
To fetch one of the data files (for example 'Exxx.dat') into your
current directory for you to inspect or modify, run:
% embossdata -fetch -file Exxx.dat
Users can provide their own data files in their own directories.
Project specific files can be put in the current directory, or for
tidier directory listings in a subdirectory called ".embossdata". Files
for all EMBOSS runs can be put in the user's home directory, or again
in a subdirectory called ".embossdata".
The directories are searched in the following order:
* . (your current directory)
* .embossdata (under your current directory)
* ~/ (your home directory)
* ~/.embossdata
Notes
The "identity" qualifier provides an additional constrain to
"plurality" when determining a consensus residue at an alignment site.
"identity" sets the required number of identities at a site for it to
be included in the consensus. If for example this is set to the number
of sequences in the alignment, then only a site with the same residue
in all sequences would be included in the consensus.
The "setcase" qualifier sets the threshold for the positive matches
above which the consensus residue is given is upper-case and below
which it is in lower-case.
References
None.
Warnings
None.
Diagnostic Error Messages
None.
Exit status
It always exits with status 0.
Known bugs
None.
See also
Program name Description
consambig Create an ambiguous consensus sequence from a multiple
alignment
megamerger Merge two large overlapping DNA sequences
merger Merge two overlapping sequences
Author(s)
Tim Carver formerly at:
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
Genome Campus, Hinxton, Cambridge, CB10 1SB, UK
Please report all bugs to the EMBOSS bug team
(emboss-bug (c) emboss.open-bio.org) not to the original author.
History
Target users
This program is intended to be used by everyone and everything, from
naive users to embedded scripts.
Comments
None
|