File: notseq.html

package info (click to toggle)
emboss 5.0.0-7
links: PTS, VCS
area: main
in suites: lenny
size: 81,332 kB
ctags: 25,201
sloc: ansic: 229,873; java: 29,051; sh: 10,636; perl: 8,714; makefile: 1,227; csh: 520; asm: 351; pascal: 237; xml: 94; modula3: 8
file content (642 lines) | stat: -rw-r--r-- 19,301 bytes
<HTML>
<HEAD>
  <TITLE>
  EMBOSS: notseq
  </TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" text="#000000">

<table align=center border=0 cellspacing=0 cellpadding=0>
<tr><td valign=top>
<A HREF="/" ONMOUSEOVER="self.status='Go to the EMBOSS home page';return true"><img border=0 src="emboss_icon.jpg" alt="" width=150 height=48></a>
</td>
<td align=left valign=middle>
<b><font size="+6">
notseq
</font></b>
</td></tr>
</table>
<br>&nbsp;
<p>


<H2>
    Function
</H2>
Exclude a set of sequences and write out the remaining ones


<H2>
    Description
</H2>

When you have a set of sequences (a file of multiple sequences?) and you
wish to remove one or more of them from the set, then use <b>notseq</b>. 

<p>

This program was written for the case where a file containing several
sequences is being used as a small database, but some of the sequences
are no longer required and must be deleted from the file. 

<p>

<b>notseq</b> splits the input sequences into those that you wish to
keep and those you wish to exclude. 

<p>

<b>notseq</b> takes a set of sequences as input together with a list of
sequence names or accession numbers.  It also takes the name of a new
file to write the files that you want to keep into, and optionally the
name of a file that will contain the files that you want excluded from
the set. 

<p>

<b>notseq</b> then reads in the input sequences.  It outputs the ones
that match one of the sequence names or acession numbers to the file of
excluded sequences, and those that don't match are output to the file of
sequences to be kept. 

<p>

Note that the names of the sequences to be excluded are not standard
EMBOSS USAs.  Only the name or accession number shoudl be specified, not
the database or file that these entries may occur in.  These excluded
sequence names will be matched against the names of the input sequences
to see if there is a match.  Wildcarded names may be specified by using
'*'s.  Any specified names of sequences to be excluded that are not
found are simply ignored. 





<H2>
    Usage
</H2>
<b>Here is a sample session with notseq</b>
<p>
In this case the excluded sequences (myg_phyca and lgb2_luplu) are not saved to any file: 
<p>

<p>
<table width="90%"><tr><td bgcolor="#CCFFFF"><pre>

% <b>notseq </b>
Exclude a set of sequences and write out the remaining ones
Input (gapped) sequence(s): <b>globins.fasta</b>
Sequence names to exclude: <b>myg_phyca,lgb2_luplu</b>
output sequence(s) [hbb_human.fasta]: <b>mydata.seq</b>

</pre></td></tr></table><p>
<p>
<a href="#input.1">Go to the input files for this example</a><br><a href="#output.1">Go to the output files for this example</a><p><p>
<p>
<b>Example 2</b>
<p>
Here is an example where the sequences to be excluded are saved to another file: 
<p>

<p>
<table width="90%"><tr><td bgcolor="#CCFFFF"><pre>

% <b>notseq -junkout hb.seq </b>
Exclude a set of sequences and write out the remaining ones
Input (gapped) sequence(s): <b>globins.fasta</b>
Sequence names to exclude: <b>hb*</b>
output sequence(s) [hbb_human.fasta]: <b>mydata.seq</b>

</pre></td></tr></table><p>
<p>
<a href="#output.2">Go to the output files for this example</a><p><p>



<H2>
    Command line arguments
</H2>

<table CELLSPACING=0 CELLPADDING=3 BGCOLOR="#f5f5ff" ><tr><td>
<pre>
   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     (Gapped) sequence(s) filename and optional
                                  format, or reference (input USA)
  [-exclude]           string     Enter a list of sequence names or accession
                                  numbers to exclude from the sequences read
                                  in. The excluded sequences will be written
                                  to the file specified in the 'junkout'
                                  parameter. The remainder will be written out
                                  to the file specified in the 'outseq'
                                  parameter.
                                  The list of sequence names can be separated
                                  by either spaces or commas.
                                  The sequence names can be wildcarded.
                                  The sequence names are case independent.
                                  An example of a list of sequences to be
                                  excluded is:
                                  myseq, hs*, one two three
                                  a file containing a list of sequence names
                                  can be specified by giving the file name
                                  preceeded by a '@', eg: '@names.dat' (Any
                                  string is accepted)
  [-outseq]            seqoutall  [<sequence>.<format>] Sequence set(s)
                                  filename and optional format (output USA)

   Additional (Optional) qualifiers:
   -junkoutseq         seqoutall  [/dev/null] This file collects the sequences
                                  which you have excluded from the main
                                  output file of sequences.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outseq" associated qualifiers
   -osformat3          string     Output seq format
   -osextension3       string     File name extension
   -osname3            string     Base file name
   -osdirectory3       string     Output directory
   -osdbname3          string     Database name to add
   -ossingle3          boolean    Separate file for each entry
   -oufo3              string     UFO features
   -offormat3          string     Features format
   -ofname3            string     Features file name
   -ofdirectory3       string     Output directory

   "-junkoutseq" associated qualifiers
   -osformat           string     Output seq format
   -osextension        string     File name extension
   -osname             string     Base file name
   -osdirectory        string     Output directory
   -osdbname           string     Database name to add
   -ossingle           boolean    Separate file for each entry
   -oufo               string     UFO features
   -offormat           string     Features format
   -ofname             string     Features file name
   -ofdirectory        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

</pre>
</td></tr></table>
<P>
<table border cellspacing=0 cellpadding=3 bgcolor="#ccccff">
<tr bgcolor="#FFFFCC">
<th align="left" colspan=2>Standard (Mandatory) qualifiers</th>
<th align="left">Allowed values</th>
<th align="left">Default</th>
</tr>

<tr>
<td>[-sequence]<br>(Parameter 1)</td>
<td>(Gapped) sequence(s) filename and optional format, or reference (input USA)</td>
<td>Readable sequence(s)</td>
<td><b>Required</b></td>
</tr>

<tr>
<td>[-exclude]<br>(Parameter 2)</td>
<td>Enter a list of sequence names or accession numbers to exclude from the sequences read in. The excluded sequences will be written to the file specified in the 'junkout' parameter. The remainder will be written out to the file specified in the 'outseq' parameter.
The list of sequence names can be separated by either spaces or commas.
The sequence names can be wildcarded.
The sequence names are case independent.
An example of a list of sequences to be excluded is:
myseq, hs*, one two three
a file containing a list of sequence names can be specified by giving the file name preceeded by a '@', eg: '@names.dat'</td>
<td>Any string is accepted</td>
<td><i>An empty string is accepted</i></td>
</tr>

<tr>
<td>[-outseq]<br>(Parameter 3)</td>
<td>Sequence set(s) filename and optional format (output USA)</td>
<td>Writeable sequence(s)</td>
<td><i>&lt;*&gt;</i>.<i>format</i></td>
</tr>

<tr bgcolor="#FFFFCC">
<th align="left" colspan=2>Additional (Optional) qualifiers</th>
<th align="left">Allowed values</th>
<th align="left">Default</th>
</tr>

<tr>
<td>-junkoutseq</td>
<td>This file collects the sequences which you have excluded from the main output file of sequences.</td>
<td>Writeable sequence(s)</td>
<td>/dev/null</td>
</tr>

<tr bgcolor="#FFFFCC">
<th align="left" colspan=2>Advanced (Unprompted) qualifiers</th>
<th align="left">Allowed values</th>
<th align="left">Default</th>
</tr>

<tr>
<td colspan=4>(none)</td>
</tr>

</table>


<H2>
    Input file format
</H2>
<b>notseq</b> reads normal sequence USAs.

<p>


<a name="input.1"></a>
<h3>Input files for usage example </h3>
<p><h3>File: globins.fasta</h3>
<table width="90%"><tr><td bgcolor="#FFCCFF">
<pre>
&gt;HBB_HUMAN Sw:Hbb_Human =&gt; HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
&gt;HBB_HORSE Sw:Hbb_Horse =&gt; HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
&gt;HBA_HUMAN Sw:Hba_Human =&gt; HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
&gt;HBA_HORSE Sw:Hba_Horse =&gt; HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
&gt;MYG_PHYCA Sw:Myg_Phyca =&gt; MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
&gt;GLB5_PETMA Sw:Glb5_Petma =&gt; GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
&gt;LGB2_LUPLU Sw:Lgb2_Luplu =&gt; LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA
</pre>
</td></tr></table><p>

<p>

The names (or accession numbers) of the sequences to be excluded can be
entered as a file of such names by specifying an '@' followed by the
name of the file containing the sequence names.  For example:
'@names.dat'. 

<p>

The names or accession numbers of the sequences to be
excluded are not standard EMBOSS USAs.  Only the ID name or accession
number can be specified, you cannot specify the sequences as
'database:ID', 'file:accession', 'format::file', etc. 




<H2>
    Output file format
</H2>

<b>notseq</b> writes normal a sequence file.

<p>


<a name="output.1"></a>
<h3>Output files for usage example </h3>
<p><h3>File: mydata.seq</h3>
<table width="90%"><tr><td bgcolor="#CCFFCC">
<pre>
&gt;HBB_HUMAN Sw:Hbb_Human =&gt; HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
&gt;HBB_HORSE Sw:Hbb_Horse =&gt; HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
&gt;HBA_HUMAN Sw:Hba_Human =&gt; HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
&gt;HBA_HORSE Sw:Hba_Horse =&gt; HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
&gt;GLB5_PETMA Sw:Glb5_Petma =&gt; GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
</pre>
</td></tr></table><p>

<a name="output.2"></a>
<h3>Output files for usage example 2</h3>
<p><h3>File: mydata.seq</h3>
<table width="90%"><tr><td bgcolor="#CCFFCC">
<pre>
&gt;MYG_PHYCA Sw:Myg_Phyca =&gt; MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
&gt;GLB5_PETMA Sw:Glb5_Petma =&gt; GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
&gt;LGB2_LUPLU Sw:Lgb2_Luplu =&gt; LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA
</pre>
</td></tr></table><p>
<p><h3>File: hb.seq</h3>
<table width="90%"><tr><td bgcolor="#CCFFCC">
<pre>
&gt;HBB_HUMAN Sw:Hbb_Human =&gt; HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
&gt;HBB_HORSE Sw:Hbb_Horse =&gt; HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
&gt;HBA_HUMAN Sw:Hba_Human =&gt; HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
&gt;HBA_HORSE Sw:Hba_Horse =&gt; HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
</pre>
</td></tr></table><p>


<H2>
    Data files
</H2>

None.

<H2>
    Notes
</H2>

Note that the names or accession numbers of the sequences to be
excluded are not standard EMBOSS USAs.  Only the ID name or accession
number can be specified, you cannot specify the sequences as
'database:ID', 'file:accession', 'format::file', etc. 

<H2>
    References
</H2>


None.

<H2>
    Warnings
</H2>

None.

<H2>
    Diagnostic Error Messages
</H2>

If no matches are found to any of the specified sequence names, the message
"This is a warning: No matches found." is displayed.


<H2>
    Exit status
</H2>

It exits with a status of 0 unless no matches are found to any of the
input sequences name, in which case it exits with a status of -1. 


<H2>
    Known bugs
</H2>

None.

<h2><a name="See also">See also</a></h2>
<table border cellpadding=4 bgcolor="#FFFFF0">
<tr><th>Program name</th><th>Description</th></tr>
<tr>
<td><a href="biosed.html">biosed</a></td>
<td>Replace or delete sequence sections</td>
</tr>

<tr>
<td><a href="codcopy.html">codcopy</a></td>
<td>Reads and writes a codon usage table</td>
</tr>

<tr>
<td><a href="cutseq.html">cutseq</a></td>
<td>Removes a specified section from a sequence</td>
</tr>

<tr>
<td><a href="degapseq.html">degapseq</a></td>
<td>Removes gap characters from sequences</td>
</tr>

<tr>
<td><a href="descseq.html">descseq</a></td>
<td>Alter the name or description of a sequence</td>
</tr>

<tr>
<td><a href="entret.html">entret</a></td>
<td>Reads and writes (returns) flatfile entries</td>
</tr>

<tr>
<td><a href="extractalign.html">extractalign</a></td>
<td>Extract regions from a sequence alignment</td>
</tr>

<tr>
<td><a href="extractfeat.html">extractfeat</a></td>
<td>Extract features from a sequence</td>
</tr>

<tr>
<td><a href="extractseq.html">extractseq</a></td>
<td>Extract regions from a sequence</td>
</tr>

<tr>
<td><a href="listor.html">listor</a></td>
<td>Write a list file of the logical OR of two sets of sequences</td>
</tr>

<tr>
<td><a href="makenucseq.html">makenucseq</a></td>
<td>Creates random nucleotide sequences</td>
</tr>

<tr>
<td><a href="makeprotseq.html">makeprotseq</a></td>
<td>Creates random protein sequences</td>
</tr>

<tr>
<td><a href="maskfeat.html">maskfeat</a></td>
<td>Mask off features of a sequence</td>
</tr>

<tr>
<td><a href="maskseq.html">maskseq</a></td>
<td>Mask off regions of a sequence</td>
</tr>

<tr>
<td><a href="newseq.html">newseq</a></td>
<td>Type in a short new sequence</td>
</tr>

<tr>
<td><a href="noreturn.html">noreturn</a></td>
<td>Removes carriage return from ASCII files</td>
</tr>

<tr>
<td><a href="nthseq.html">nthseq</a></td>
<td>Writes one sequence from a multiple set of sequences</td>
</tr>

<tr>
<td><a href="pasteseq.html">pasteseq</a></td>
<td>Insert one sequence into another</td>
</tr>

<tr>
<td><a href="revseq.html">revseq</a></td>
<td>Reverse and complement a sequence</td>
</tr>

<tr>
<td><a href="seqret.html">seqret</a></td>
<td>Reads and writes (returns) sequences</td>
</tr>

<tr>
<td><a href="seqretsplit.html">seqretsplit</a></td>
<td>Reads and writes (returns) sequences in individual files</td>
</tr>

<tr>
<td><a href="skipseq.html">skipseq</a></td>
<td>Reads and writes (returns) sequences, skipping first few</td>
</tr>

<tr>
<td><a href="splitter.html">splitter</a></td>
<td>Split a sequence into (overlapping) smaller sequences</td>
</tr>

<tr>
<td><a href="trimest.html">trimest</a></td>
<td>Trim poly-A tails off EST sequences</td>
</tr>

<tr>
<td><a href="trimseq.html">trimseq</a></td>
<td>Trim ambiguous bits off the ends of sequences</td>
</tr>

<tr>
<td><a href="union.html">union</a></td>
<td>Reads sequence fragments and builds one sequence</td>
</tr>

<tr>
<td><a href="vectorstrip.html">vectorstrip</a></td>
<td>Strips out DNA between a pair of vector sequences</td>
</tr>

<tr>
<td><a href="yank.html">yank</a></td>
<td>Reads a sequence range, appends the full USA to a list file</td>
</tr>

</table>


<H2>
    Author(s)
</H2>

Gary Williams (gwilliam&nbsp;&copy;&nbsp;rfcgr.mrc.ac.uk)
<br>
MRC Rosalind Franklin Centre for Genomics Research
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK


<H2>
    History
</H2>

Written (9 Jan 2001) - Gary Williams
<p>

Added ability to specify names to exclude as a list file (June 2002) -
Gary Williams


<H2>
    Target users
</H2>
This program is intended to be used by everyone and everything, from naive users to embedded scripts.

<H2>
    Comments
</H2>
None

</BODY>
</HTML>