File: codonW.hlp

package info (click to toggle)
codonw 1.4.4-6
links: PTS, VCS
area: main
in suites: bullseye
size: 1,072 kB
sloc: ansic: 11,227; makefile: 209; sh: 194; perl: 36
file content (502 lines) | stat: -rwxr-xr-x 22,861 bytes
parent folder | download | duplicates (5)
#main_menu#

CodonW is a package for codon usage analysis. It was designed to 
simplify Multivariate Analysis (MVA) of codon usage. The MVA method 
employed in CodonW is correspondence analysis (COA), the most widely 
used codon usage MVA method. COA can be performed on codon usage, 
relative synonymous codon usage or amino acid usage. Integrated into 
CodonW is the ability to work with genetic codes other than the 
universal code. Other indices of codon usage and codon bias, 
dinucleotide bias and mutation bias are also analysed by CodonW.

Modes of use:
a) There are an extensive number of command line options available if 
your platform supports command line parameters. For more information 
type 

codonw -help

b) Maximum functionality is obtained by running CodonW using the 
interactive menus. Each menu has its own online help.

c) CodonW also emulates a large number of useful utility programs used 
in our labs to aid the analysis of codon usage.  If the first  argument 
to the CodonW program is one of a recognised list of programs (rscu, cu, 
aau, raau, tidy, reader, cutab, cutot, transl, bases, base3s, dinuc, 
cai, fop, gc3s or enc), CodonW assumes that you want to accomplish or 
calculate one of these simpler tasks/indices and bypasses the menu 
system. For a fuller description of what these pseudo programs 
calculate, see the README file.

To Run CodonW:

a) You must load a file containing all your sequences in fasta/Pearson 
Pearson format, either from the command line or using menu 1.

b) You may change many of the default values using menu 3.

c) Select which codon usage indices to measure (menu 4). Choose the type 
of correspondence analysis, if any (menu 5). Other data analysis options 
may also be selected using menu 8.

d) Return to the first (main) menu and type R to run an analysis.

Output files from the correspondence analysis have the extension  .coa. 
See summary.coa for an overall explanation of what is being generated by 
the analysis. 
 
Other output will be stored in the files that you choose using menu 1 or 
as specified on the command line. Depending on the options chosen there 
will either be one or two result files; usually they will have the 
extensions .out and .blk.

//
#open_file_query#

Open file dialog. 

You have been requested to choose a file for the analysis. If the 
request is for an input filename, this file must contain all your 
sequences that you wish to analyse in a sequential fasta formatted file. 
That is, all sequences should be in one file and individual sequences 
separated by a single header line that starts with an angle bracket 
character ">".  

If you use GCG, the output from the program tofasta is acceptable. 

If prompted for either the "bulk" or "output" file names, these 
filenames will be used to record the results of the analysis. These 
files will be opened for writing which may destroy the content of the 
files, should the files already exist. So if a file already exists with 
the name you have chosen, you will be asked whether you wish to 
overwrite the file, append the results to the file, or choose a new 
filename (that is, unless you have chosen the option to overwrite files 
silently).

//
#File_not_found#

File not found

The name of the input file that you have chosen does not exist in the 
current working directory. Either choose a new filename or give the 
fully qualified filename (e.g. e:\codon\cu\input.dat).

Depending on the system that you are using, the names of all files in 
the current working directory may or may not be displayed when a file 
cannot be located. 

//
#file_exists#

File exists

If the filename that you have chosen as the output file exists, it will 
be deleted if opened for writing. You now have the choice of whether or 
not to overwrite this file (thus deleting the original). If you choose 
not to overwrite you have the further choice of either appending the 
results to the file you originally choose or selecting a new filename.

(Note: If you select overwrite silently from the defaults menu you will 
not be prompted if a file of the same name already exists; it will be 
overwritten.)

//
#file_append#

File Append 

You decided not to overwrite the file. You can either append the results 
to this file or choose a new filename. 

//
#menu_2#

Menu 2 Purifying sequences menu

This menu was originally used to eliminate sequences from data that had 
high sequence identity to other sequences in the dataset and thus might 
bias the output results. 

This functionality is not currently portable and is not being made 
available at present. Try using the NCBI program nrdb or the EGCG9 
program clean_up to remove identical or almost identical sequences. 

//
#menu_3#

Menu 3 Defaults menu 

To improve flexibility, many of the default values used internally by 
CodonW (defined in the header file codonW.h) can be altered at runtime 
using this menu. Ten options can be customised. 

Option (1) Change ASCII delimiter in output. The default ASCII delimiter 
used to separate information in machine readable output files is a 
comma. The delimiter can be changed via this option to either the tab or 
space character. 

Option (2) Run silently. This option can be used when running from a 
script file or as a batch job. If TRUE, it suppresses warnings about 
overwriting files, the prompting for a personal choice of Fop, CBI or 
CAI values (although these can still be given via command line 
arguments) and the pause after each page of error or warning messages 
has been displayed. 

Option (3) Log warnings/information to a file. The default value for 
this option is set as FALSE, in which case all warning or error messages 
generated by CodonW are written to the screen via the standard error 
stream. When TRUE, the errors are redirected to a log file:- you will be 
prompted for the filename for this log file. This option is useful if 
there are a large number of sequences in the input file or there are 
many warning messages.

Option (4) Number of lines on screen. This is used to set the screen 
length, which is used during screen refreshing and the pagination of 
error messages. 

Option (5) Change the genetic code. By default, CodonW assumes the 
universal genetic code when translating and processing codons. This 
option allows alternative genetic codes to be selected.

Option (6) Change the Fop/CBI values. To calculate either the CBI or Fop 
indices, a set of optimal codons is required; by default the optimal 
codons of E. coli are assumed. This option displays a submenu which 
lists eight species where optimal codons have been identified. When 
calculating the Fop/CBI of genes from these species the appropriate set 
of codons should be selected. Personal selections of optimal codons can 
be input at runtime. 

Option (7) Change the CAI values. To calculate the codon adaptation 
index it is necessary to assign fitness values to each codon; by default 
the fitness values of E. coli codons are assumed. However, these values 
are very species-specific and so using E. coli fitness values to 
calculate CAI values for other species is nonsensical. Before assigning 
fitness values to a codon a set of genes which have been experimentally 
verified to be highly expressed must be identified. Such sets have been 
created for relatively few species. This menu lists the species where a 
reference set of highly expressed genes is known, and fitness values 
assigned. Personal selections of fitness values can be input at runtime 
if calculating CAI.  

Option (8) Toggle human or machine-readable output. The default format 
for most CodonW output files is human readable. Machine-readable output 
is fixed width numerical data separated by an ASCII delimiter. This 
format is readily imported into a wide range of statistical and 
graphical analysis programs but not easily read by eye. Human readable 
output is more verbose but easier to read. The output formats for codon 
usage, tabulation of codon usage, relative synonymous codon usage and 
base compositions are the most radically affected by this option. 

Option (9) Toggle output for each or all genes. By default, CodonW 
processes each gene individually. When the option "all genes" is 
selected, sequences are concatenated and processed as a single sequence. 
This option can be used to calculate total codon or amino acid usage, 
the average G+C content, Fop, etc.

Option (10) Correspondence analysis defaults. This option allows access 
to the "advanced correspondence analysis" menu. This menu is normally 
accessed as a submenu of "Correspondence analysis" (Menu 5), but is 
included here so that all runtime options are accessible via the "Change 
default values" menu. 

//
#menu_4#

Menu 4 Codon Usage Indices

This menu is used to choose the indices calculated by CodonW; by default 
only the G+C content of the sequence is selected. The calculation of 
these indices (except G+C content) is dependent on the genetic code 
selected under Menu 3. More than one index may be calculated at once.

 Option (1) Codon Adaptation Index (CAI). CAI measures the relative 
adaptation of a gene to the codon usage of highly expressed genes. The 
relative adaptiveness (w) of a codon is the ratio of the usage of that 
codon to that of the most abundant codon for the same amino acid. The 
relative adaptiveness of codons (for albeit a limited choice of species) 
can be selected from Menu 3.

Option (2) Frequency of Optimal codons (Fop). This index is the ratio of 
optimal codons to synonymous codons (genetic code dependent). Optimal 
codons for several species are in-built and can be selected using Menu 
3. By default, the optimal codons of E. coli are assumed. The user may 
also enter a personal choice of optimal codons. If rare synonymous 
codons have been identified, there is a choice of calculating the 
original Fop index or a modified Fop index. Fop values for the original 
index are always between 0 (where no optimal codons are used) and 1 
(where only optimal codons are used). When calculating the modified Fop 
index, any negative values are adjusted to zero. 

Option (3) Codon Bias Index (CBI). The codon bias index is a measure of 
directional codon bias. It measures the extent to which a gene uses a 
subset of optimal codons. 

Option (4) The effective number of codons (NC). This index is a simple 
measure of overall codon bias and is analogous to the effective number 
of alleles measure used in population genetics. Knowledge of the optimal 
codons or a reference set of highly expressed genes is unnecessary when 
calculating this index. 

Option (5) G+C content of the gene. This is calculated as the frequency 
of nucleotides that are guanine or cytosine.

Option (6) G+C content 3rd position of synonymous codons (GC3s). This is 
the fraction of codons, synonymous at the third codon position, which 
have either a guanine of cytosine at that third codon position. 

Option (7) Silent base composition. Selection of this option calculates 
four separate indices, i.e. G3s, C3s, A3s & T3s. Although correlated 
with GC3s, this index is not directly comparable with it. It quantifies 
the usage of each base at synonymous third codon positions. 

Option (8) Length silent sites (Lsil). This is the frequency of 
synonymous codons within each gene.

Option (9) Length amino acids (Laa). This is the number of translatable 
codons.

Option (10) Hydropathicity of protein. This is the general average 
hydropathicity or (GRAVY) score for the hypothetical translated gene 
product. It is the arithmetic mean of the sum of the hydropathic indices 
of each amino acid.

Option (11) Aromaticity score of protein. This is the frequency of 
aromatic amino acids (Phe, Tyr, Trp) in the hypothetical translated gene 
product. 

The hydropathicity and aromaticity protein scores are indices of amino 
acid usage. The strongest trend in the variation in the amino acid 
composition of E. coli genes is correlated with protein hydropathicity, 
the second strongest trend is correlated with gene expression, while the 
third is correlated with aromaticity. 
//
#menu_5_coa#

Menu 5 Correspondence analysis

In many unicellular organisms, protein coding genes have non-random 
usage of synonymous codons (see Andersson and Kurland (1990) and Sharp 
et al. (1993) for reviews). Correspondence analysis uses contingency 
tables (counts of the joint occurrences of rows and columns of a table). 
Therefore, the sequence data must be transformed into a contingency 
table. The frequency of each codon (or amino acid) is tabulated for each 
gene. This is then converted into an Euclidean distance measurement of 
distance between the rows or columns. CodonW calculates a scaled 
distance measurement as recommended by Grantham and co-workers (Grantham 
et al 1981).  

Analysis of a large number of distances would ordinarily be very time 
consuming. Correspondence analysis provides a simple visualisation of 
these distances by projecting the points from their original 
multidimensional space onto lower dimensions, with genes with similar 
distances plotted as neighbours. In addition to calculating the 
coordinates for the projection of these points, correspondence analysis 
(as implemented in CodonW) also calculates the total inertia of the 
data, together with the eigenvalue and relative variation explained by 
each axis. CodonW can also quantify the absolute and relative 
contribution of each gene, codon or amino acid on each identified trend. 
To limit variation due to stochastic noise, it is recommended that short 
genes (less than 50 codons) be excluded from a correspondence analysis.

The correspondence analysis menu (Menu 5) has four options, the default 
option being not to generate a correspondence analysis, i.e. Do not 
perform a COA. 

Option (1) Correspondence analysis of codon usage. This generates a 
correspondence analysis on the total codon usage. By default, this is on 
synonymous codons, although the advanced menu may be used to adjust 
which codons are included/excluded. If analysing synonymous codon usage, 
the analysis has 58 degrees of freedom. 

Option (2) Correspondence analysis of RSCU. This generates a 
correspondence analysis of relative synonymous codon usage (RSCU). RSCU 
is calculated as the ratio of the observed frequency of a codon to the 
frequency expected under unbiased codon usage within a synonymous codon 
group. Correspondence analysis of RSCU is useful because variation 
caused by unequal usage of amino acids is removed; however  the number 
of degrees of freedom is reduced to 40. 

Option (3) Correspondence analysis of Amino Acid usage. This generates a 
correspondence analysis of amino acid composition, with 19 degrees of 
freedom.

Option (4) Do not perform a correspondence analysis. This is the default 
option.
//
#menu_6#

Menu 6 Basic Stats

This menu was originally designed to calculate some basic statistics on 
the output from the various codon usage indices. 

This functionality is not currently portable and is not being made 
available at present. 


//
#menu_7#

Menu 7 Relaxation (almost) 

This menu was designed to help teach the genetic code(s). It asks 
various random questions about codon translation and codon usage. The 
genetic code used as the basis for the correct answers can be changed 
under the default menu (Menu 3).

//

#fun#

Teach yourself the genetic codes and codon usage. 

To exit type "quit" or "exit" (without the quotation marks). 

If you don't know the answer to the question, you can type "?" (without 
the quotation marks) . 
You will then be prompted with the correct answer. Beware:- you will be 
penalised for incorrect answers :).

The questions are:
What is the three-letter name?    
(You must convert the one-letter code given to the three-letter code.) 

How synonymous is Amino Acid?    
(How many synonyms are there for this amino acid?)

Name the Amino Acid?              
(Which amino acid is coded by this codon?) 

//

#menu_8_blk#

Menu 8 Bulk output options in CodonW 

Non-correspondence analysis output from CodonW which cannot easily be 
summarised as a single index is bulk output. Under this menu there are 
10 options. Multiple options cannot be selected simultaneously. Each 
time this menu is selected you will be prompted for an alternative 
output filename.

Option (1) Fasta format output of DNA sequence. The input sequences are 
reformatted and written to a file in a Fasta /Pearson-like format.

Option (2) Reader format output of DNA sequence. This format is derived 
from the fasta format, except that the sequence is written as codons 
with three bases separated by a space, and the size of the sequence is 
recorded at column 70. 

Option (3) Translate input file to amino acids. This translates DNA to 
amino acids using the selected genetic code. The amino acids are written 
in a Fasta/Pearson compatible format.

Option (4) Codon Usage. This is the default option. The frequency of 
each codon is written to a file in four rows with 16 columns per row. 
The codons are written in sequential numerical order, left to right.

Option (5) Amino acid usage. The frequency of each amino acid, 
untranslatable codons and stop codons are recorded, one row per gene and 
23 columns per row. The first column contains a unique gene description, 
the second column records number of untranslatable codons, the third and 
subsequent columns summarize the amino acid and termination codon usage.

Option (6) Relative Synonymous Codon Usage (RSCU). Relative synonymous 
codon usage is calculated as the ratio of the observed frequency of a 
codon to the frequency expected if codon usage were random.

Option (7) Relative Amino acid usage (RAAU). Relative Amino acid usage 
is the frequency of the amino acid relative to the total amino acid 
usage.

Option (8) Dinucleotide frequencies. The frequency of the 16 
dinucleotides is calculated in each of the three possible codon 
positions. The data are recorded with one row per position and 16 
columns per row. 

Option (9) Base composition analysis. This option records the frequency 
of nucleotides in each codon position. It also reports GC, GC3s and GCns 
(GC content excluding synonymous third position codons). 

Option (10) No output written to file. This option is useful when 
working with large datasets and disk storage or disk access is a 
limiting factor. This option suppresses all the output to the bulk 
output file.
//

#menu_coa#

Advanced Correspondence Analysis menu.

This menu allows much greater control over the correspondence analysis. 

 Option (1) Unselect or select. This menu changes slightly depending on 
whether correspondence analysis is of amino acid or codon usage.It 
simplifies the selection of the codons/amino acids that are to be 
included in the COA. This allows the user to override the default 
selections, which if the COA is of codon usage, is the exclusion of non-
synonymous codons and termination codons. 

Option (2) Change the number of axes. The number of axes generated by a 
correspondence analysis is N-1, where N is either the number of genes or 
columns (whichever is the lesser in value). However, the default is to 
generate information about the first four axes (or trends). This option 
allows the user to record coordinates on any number of axes, up to the 
maximum generated by the analysis. 

Each axis generated by correspondence analysis is represented by a 
multidimensional vector. The position of a gene on any axis is the 
product of that gene's codon usage and the axis vector. As the vector is 
itself a product of the codon usage, the vectors can be affected by 
unusual codon usage. An analysis of nuclear and plasmid genes would be 
difficult, as the codon usage of each would perturb the other. Each 
dataset could be analysed individually but as the vectors for the axes 
would be different, it would be difficult to make direct comparisons 
between the analyses. To overcome this problem it is necessary to 
generate the COA vectors using one dataset and then to apply the same 
vectors to another. Thus direct comparison between the ordination of 
genes is possible. In CodonW, this is possible by using the following 
option (Option 3).

Option (3) Add additional genes after correspondence analysis. The user 
is prompted for the file containing the additional sequences, to which 
the vectors are to be applied. The vectors are calculated, as normal, 
using the genes contained in the standard input file (Menu 1). The co-
ordinates and any additional information about these original genes are 
recorded as normal. Next the additional genes are read in and the 
original vectors applied to them. The ordinations of these additional 
genes are then appended to the COA output files (for an explanation 
about the COA output files see below).
 
Option (4) Toggle level of correspondence analysis output. By default 
this option is set to "normal" but can be toggled to "exhaustive". If 
the exhaustive output option is selected, then in addition to the 
standard information about gene and codon/amino acid ordination, 
additional information about inertia of the rows and columns is 
generated. This additional information includes the absolute 
contribution of the inertia of each row or column to each of the 
recorded axes, and the fraction of the variation within each row or 
column explained by each axis.

Option (5) Change number of genes used to identify optimal codons. 
Correspondence analysis of either RSCU or codon usage where the major 
trend correlates with gene expression can be used to identify optimal 
codons. This is achieved by comparing the codon usage of the genes that 
lie at the extremes of the principal trend (axis 1). By default this is 
the top and bottom 10% of genes (as defined by axis 1 ordination). Using 
this option this can be set to a percentage between 1% and 50%, or to an 
absolute number of genes.  
//

#select#

Codon or Amino acid selection

The codons or amino acids that will NOT be analysed in this 
correspondence analysis are surrounded by curly brackets. The choices of 
which codons/amino acids that are to be excluded can be changed. Simply 
give the number associated with each codon/amino acid for which you want 
to change the status. 

//