1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338
|
trimseq
Function
Trim ambiguous bits off the ends of sequences
Description
This program is used to tidy up the ends of sequences, removing all
the bits that you would really rather were not published.
Specifically, it:
* removes all gap characters from the ends.
* removes X's and N's (in nucleic sequences) from the ends.
* optionally removes *'s from the ends
* optionally removes IUPAC ambiguity codes from the ends (B and Z in
proteins, M,R,W,S,Y,K,V,H,D and B in nucleic sequences)
It then optionally trims off poor quality regions from the end, using
a threshold percentage of unwanted characters in a window which is
moved along the sequence from the ends. The unwanted characters which
are used are X's and N's (in nucleic sequences), optionally *'s, and
optionally IUPAC ambiguity codes.
The program stops trimming the ends when the percentage of unwanted
characters in the moving window drops below the threshold percentage.
Thus if the window size is set to 1 and the percentage threshold is
100, no further poor quality regions will be removed. If the window
size is set to 5 and the percentage threshold is 40 then the sequence
AAGCTNNNNATT will be trimmed to AAGCT, while AAGCTNATT or
AAGCTNNNNATTT will not be trimmed as less than 40% of the last 5
characters are N's.
After trimming these poor quality regions, it will again then trim off
any dangling gap characters from the ends .
Usage
Here is a sample session with trimseq
% trimseq untrimmed.seq trim1.seq -window 1 -percent 100
Trim ambiguous bits off the ends of sequences
Go to the input files for this example
Go to the output files for this example
Example 2
% trimseq untrimmed.seq trim2.seq -window 5 -percent 40
Trim ambiguous bits off the ends of sequences
Go to the output files for this example
Example 3
% trimseq untrimmed.seq trim3.seq -window 5 -percent 50
Trim ambiguous bits off the ends of sequences
Go to the output files for this example
Example 4
% trimseq untrimmed.seq trim4.seq -window 5 -percent 50 -strict
Trim ambiguous bits off the ends of sequences
Go to the output files for this example
Example 5
% trimseq untrimmed.seq trim5.seq -window 5 -percent 50 -strict -noright
Trim ambiguous bits off the ends of sequences
Go to the output files for this example
Command line arguments
Standard (Mandatory) qualifiers:
[-sequence] seqall (Gapped) sequence(s) filename and optional
format, or reference (input USA)
[-outseq] seqoutall [.] Sequence set(s)
filename and optional format (output USA)
Additional (Optional) qualifiers:
-window integer [1] This determines the size of the region
that is considered when deciding whether the
percentage of ambiguity is greater than the
threshold. A value of 5 means that a region
of 5 letters in the sequence is shifted
along the sequence from the ends and
trimming is done only if there is a greater
or equal percentage of ambiguity than the
threshold percentage. (Any integer value)
-percent float [100.0] This is the threshold of the
percentage ambiguity in the window required
in order to trim a sequence. (Any numeric
value)
-strict boolean [N] In nucleic sequences, trim off not only
N's and X's, but also the nucleotide IUPAC
ambiguity codes M, R, W, S, Y, K, V, H, D
and B. In protein sequences, trim off not
only X's but also B and Z.
-star boolean [N] In protein sequences, trim off not only
X's, but also the *'s
Advanced (Unprompted) qualifiers:
-[no]left boolean [Y] Trim at the start
-[no]right boolean [Y] Trim at the end
Associated qualifiers:
"-sequence" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-outseq" associated qualifiers
-osformat2 string Output seq format
-osextension2 string File name extension
-osname2 string Base file name
-osdirectory2 string Output directory
-osdbname2 string Database name to add
-ossingle2 boolean Separate file for each entry
-oufo2 string UFO features
-offormat2 string Features format
-ofname2 string Features file name
-ofdirectory2 string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
Input file format
Normal sequence.
Input files for usage example
File: untrimmed.seq
>myseq
...ttyyyctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttca.gnntcynnnnnn
Output file format
Normal sequence file.
Output files for usage example
File: trim1.seq
>myseq
ttyyyctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttca-gnntcy
Output files for usage example 2
File: trim2.seq
>myseq
ttyyyctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttca-g
Output files for usage example 3
File: trim3.seq
>myseq
ttyyyctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttca-gnntcy
Output files for usage example 4
File: trim4.seq
>myseq
ctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcagctc
tttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcccag
atcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgctcctg
gcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccctgact
accctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggcccgtgct
ggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaagaagaca
ggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgcccaccttt
ggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttctctaa
taaaaaagccacttagttca-gnntc
Output files for usage example 5
File: trim5.seq
>myseq
ctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcagctc
tttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcccag
atcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgctcctg
gcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccctgact
accctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggcccgtgct
ggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaagaagaca
ggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgcccaccttt
ggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttctctaa
taaaaaagccacttagttca-gnntcynnnnnn
Data files
None.
Notes
If you use the '-star' qualifier and set the window size to greater
than 1, you may trim bits of sequence with internal *'s. This may not
be what you expected.
References
None.
Warnings
None.
Diagnostic Error Messages
None.
Exit status
It always exits with status 0.
Known bugs
None noted.
See also
Program name Description
biosed Replace or delete sequence sections
codcopy Reads and writes a codon usage table
cutseq Removes a specified section from a sequence
degapseq Removes gap characters from sequences
descseq Alter the name or description of a sequence
entret Reads and writes (returns) flatfile entries
extractalign Extract regions from a sequence alignment
extractfeat Extract features from a sequence
extractseq Extract regions from a sequence
listor Write a list file of the logical OR of two sets of sequences
makenucseq Creates random nucleotide sequences
makeprotseq Creates random protein sequences
maskfeat Mask off features of a sequence
maskseq Mask off regions of a sequence
newseq Type in a short new sequence
noreturn Removes carriage return from ASCII files
notseq Exclude a set of sequences and write out the remaining ones
nthseq Writes one sequence from a multiple set of sequences
pasteseq Insert one sequence into another
revseq Reverse and complement a sequence
seqret Reads and writes (returns) sequences
seqretsplit Reads and writes (returns) sequences in individual files
skipseq Reads and writes (returns) sequences, skipping first few
splitter Split a sequence into (overlapping) smaller sequences
trimest Trim poly-A tails off EST sequences
union Reads sequence fragments and builds one sequence
vectorstrip Strips out DNA between a pair of vector sequences
yank Reads a sequence range, appends the full USA to a list file
Author(s)
Gary Williams (gwilliam rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
Genome Campus, Hinxton, Cambridge, CB10 1SB, UK
History
Target users
This program is intended to be used by everyone and everything, from
naive users to embedded scripts.
Comments
None
|