File: tranalign.txt

package info (click to toggle)
emboss 5.0.0-7
links: PTS, VCS
area: main
in suites: lenny
size: 81,332 kB
ctags: 25,201
sloc: ansic: 229,873; java: 29,051; sh: 10,636; perl: 8,714; makefile: 1,227; csh: 520; asm: 351; pascal: 237; xml: 94; modula3: 8
file content (371 lines) | stat: -rw-r--r-- 15,885 bytes

                                 tranalign 



Function

   Align nucleic coding regions given the aligned proteins

Description

   tranalign is a re-implementation in EMBOSS of the program mrtrans by
   Bill Pearson.

   tranalign is a simple program that allows you to produce aligned cDNA
   sequences from aligned protein sequences. This can be very useful for
   phylogeny programs, e.g. in PHYLIP - dnadist, dnapars, dnaml, etc. In
   general, it is better to use protein sequences for multiple
   alignments, but to use DNA sequences for phylogeny. This can be time
   consuming when there are gaps in the aligned protein sequences.

   tranalign takes a set of (unaligned) nucleic sequences and a set of
   aligned protein sequences. It reads the first nucleic sequence and the
   first protein sequence, translates the nucleic sequence in each of the
   three forward frames, compares the protein sequence to the translated
   nucleic sequence to find the protein coding region, and then writes
   out the nucleic sequence that encoded the protein.

   The sequences must be in the same order in both sets of sequences. A
   common problem you should be aware of is that some alignment program
   (including clustalw/emma) will re-order the aligned sequences to group
   similar sequences together.

   The protein library may include '-' characters to specify alignments.
   Each '-' character in the protein library is ignored during the
   sequence comparison but replaced by '---' in the nucleic sequence
   output to form the aligned nucleic sequences.

   tranalign finds the coding regions for contiguous sequences only. It
   will not splice together different exons to produce a coding sequence.
   You should therefore use either mRNA sequences, or nucleic sequences
   which you have constructed to hold a contiguous coding region (maybe
   using extractseq or yank and union?).

Usage

   Here is a sample session with tranalign


% tranalign ../data/tranalign.pep tranalign2.seq 
Align nucleic coding regions given the aligned proteins

   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-asequence]         seqall     Nucleotide sequence(s) filename and optional
                                  format, or reference (input USA)
  [-bsequence]         seqset     (Aligned) protein sequence set filename and
                                  optional format, or reference (input USA)
  [-outseq]            seqoutset  [.] (Aligned) nucleotide
                                  sequence set filename and optional format
                                  (output USA)

   Additional (Optional) qualifiers:
   -table              menu       [0] Code to use (Values: 0 (Standard); 1
                                  (Standard (with alternative initiation
                                  codons)); 2 (Vertebrate Mitochondrial); 3
                                  (Yeast Mitochondrial); 4 (Mold, Protozoan,
                                  Coelenterate Mitochondrial and
                                  Mycoplasma/Spiroplasma); 5 (Invertebrate
                                  Mitochondrial); 6 (Ciliate Macronuclear and
                                  Dasycladacean); 9 (Echinoderm
                                  Mitochondrial); 10 (Euplotid Nuclear); 11
                                  (Bacterial); 12 (Alternative Yeast Nuclear);
                                  13 (Ascidian Mitochondrial); 14 (Flatworm
                                  Mitochondrial); 15 (Blepharisma
                                  Macronuclear); 16 (Chlorophycean
                                  Mitochondrial); 21 (Trematode
                                  Mitochondrial); 22 (Scenedesmus obliquus);
                                  23 (Thraustochytrium Mitochondrial))

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-asequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-bsequence" associated qualifiers
   -sbegin2            integer    Start of each sequence to be used
   -send2              integer    End of each sequence to be used
   -sreverse2          boolean    Reverse (if DNA)
   -sask2              boolean    Ask for begin/end/reverse
   -snucleotide2       boolean    Sequence is nucleotide
   -sprotein2          boolean    Sequence is protein
   -slower2            boolean    Make lower case
   -supper2            boolean    Make upper case
   -sformat2           string     Input sequence format
   -sdbname2           string     Database name
   -sid2               string     Entryname
   -ufo2               string     UFO features
   -fformat2           string     Features format
   -fopenfile2         string     Features file name

   "-outseq" associated qualifiers
   -osformat3          string     Output seq format
   -osextension3       string     File name extension
   -osname3            string     Base file name
   -osdirectory3       string     Output directory
   -osdbname3          string     Database name to add
   -ossingle3          boolean    Separate file for each entry
   -oufo3              string     UFO features
   -offormat3          string     Features format
   -ofname3            string     Features file name
   -ofdirectory3       string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Input file format

   The input is a set of unaligned nucleic sequences and the set of
   aligned protein sequences to be used as a guide in the alignment of
   the output nucleic sequences.

   The ID names of the nucleic acid and protein sequences are NOT checked
   to see if they correspond to each other. They can have any names.

   There must be at least as many protein sequences as nucleic acid
   sequence - extra protein sequences are ignored.

   Each of the nucleic acid sequences must have a corresponding protein
   sequence which is derived from the coding region of that nucleic acid
   sequence. The two sets of sequences must be in the same order.

  Input files for usage example

  File: tranalign.seq

>HSFAU1
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggccccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU2
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU3
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU4
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU5
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtaggccgcatgctttttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

  File: tranalign.pep

>HSFAU1_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV
LLAG-PLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKVAKQEKKK
KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS
>HSFAU2_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV
LLAGAPLEDALWASAGWRP
>HSFAU3_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV
LLAGAPLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKGAKQEKKK
KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS
>HSFAU4_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHEIASLEGIAPEDQVV
LLAGAPLEDEATLGQCGVEALTTLEVAGRMLARGKVHGSLARAGKVRGQTPKVAKQEKKK
KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS
>HSFAU5_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV
LLAGAPLEDEATLGQCGVEALTTLEVGRMLFG-GKVHGSLARAGKVRGQTPKVAKQEKKK
KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS

Output file format

  Output files for usage example

  File: tranalign2.seq

>HSFAU1
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg
ctcctggcaggc---cccctggaggatgaggccactctgggccagtgcggggtggaggcc
ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactct
>HSFAU2
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg
ctcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc---
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------
>HSFAU3
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg
ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc
ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactct
>HSFAU4
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtg
ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc
ctgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactct
>HSFAU5
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg
ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc
ctgactaccctggaagtaggccgcatgctttttgga---ggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactct

   The output is the regions of the nucleic acid sequences which code for
   the corresponding protein sequence, with gap characters ('-')
   introduced so that they have the same alignment as the corresponding
   protein sequences.

Data files

   None.

Notes

   The sequences must be in the same order in both sets of sequences. A
   common problem you should be aware of is that some alignment program
   (including clustalw/emma) will re-order the aligned sequences to group
   similar sequences together.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   "No guide protein sequence available for nucleic sequence xxx" - the
   corresponding protein sequence for this nucleic sequence has not been
   input. You have input more nucleic acid sequences than protein
   sequences.

   "Guide protein sequence xxx not found in nucleic sequence xxx" - the
   region of the nucleic sequence which codes for the protein was not
   found. The coding region in the nucleic acid sequence must be a single
   contiguous sequence. The protein sequence might not be the
   corresponding one for this nucleic acid sequence if they are out of
   order.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

   Program name                        Description
   edialign     Local multiple alignment of sequences
   emma         Multiple alignment program - interface to ClustalW program
   infoalign    Information on a multiple sequence alignment
   plotcon      Plot quality of conservation of a sequence alignment
   prettyplot   Displays aligned sequences, with colouring and boxing
   showalign    Displays a multiple sequence alignment

Author(s)

   The original program mrtrans was written by Bill Pearson
   (wrp@virginia.edu)

   tranalign was written in EMBOSS code using the description of mrtrans
   as a guide by Gary Williams (gwilliam  rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

   mrtrans written (Jan 1991, July 1987) - Bill Pearson

   tranalign written (March 2002) - Gary Williams

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.

Comments

   None