CodonW is a package for codon usage analysis. It was designed to
simplify Multivariate Analysis (MVA) of codon usage. The MVA method
employed in CodonW is correspondence analysis (COA), the most widely
used codon usage MVA method. COA can be performed on codon usage,
relative synonymous codon usage or amino acid usage. Integrated into
CodonW is the ability to work with genetic codes other than the
universal code. Other indices of codon usage and codon bias,
dinucleotide bias and mutation bias are also analysed by CodonW.
Modes of use:
a) There are an extensive number of command line options available if
your platform supports command line parameters. For more information
b) Maximum functionality is obtained by running CodonW using the
interactive menus. Each menu has its own online help.
c) CodonW also emulates a large number of useful utility programs used
in our labs to aid the analysis of codon usage. If the first argument
to the CodonW program is one of a recognised list of programs (rscu, cu,
aau, raau, tidy, reader, cutab, cutot, transl, bases, base3s, dinuc,
cai, fop, gc3s or enc), CodonW assumes that you want to accomplish or
calculate one of these simpler tasks/indices and bypasses the menu
system. For a fuller description of what these pseudo programs
calculate, see the README file.
To Run CodonW:
a) You must load a file containing all your sequences in fasta/Pearson
Pearson format, either from the command line or using menu 1.
b) You may change many of the default values using menu 3.
c) Select which codon usage indices to measure (menu 4). Choose the type
of correspondence analysis, if any (menu 5). Other data analysis options
may also be selected using menu 8.
d) Return to the first (main) menu and type R to run an analysis.
Output files from the correspondence analysis have the extension .coa.
See summary.coa for an overall explanation of what is being generated by
Other output will be stored in the files that you choose using menu 1 or
as specified on the command line. Depending on the options chosen there
will either be one or two result files; usually they will have the
extensions .out and .blk.
Open file dialog.
You have been requested to choose a file for the analysis. If the
request is for an input filename, this file must contain all your
sequences that you wish to analyse in a sequential fasta formatted file.
That is, all sequences should be in one file and individual sequences
separated by a single header line that starts with an angle bracket
If you use GCG, the output from the program tofasta is acceptable.
If prompted for either the "bulk" or "output" file names, these
filenames will be used to record the results of the analysis. These
files will be opened for writing which may destroy the content of the
files, should the files already exist. So if a file already exists with
the name you have chosen, you will be asked whether you wish to
overwrite the file, append the results to the file, or choose a new
filename (that is, unless you have chosen the option to overwrite files
File not found
The name of the input file that you have chosen does not exist in the
current working directory. Either choose a new filename or give the
fully qualified filename (e.g. e:\codon\cu\input.dat).
Depending on the system that you are using, the names of all files in
the current working directory may or may not be displayed when a file
cannot be located.
If the filename that you have chosen as the output file exists, it will
be deleted if opened for writing. You now have the choice of whether or
not to overwrite this file (thus deleting the original). If you choose
not to overwrite you have the further choice of either appending the
results to the file you originally choose or selecting a new filename.
(Note: If you select overwrite silently from the defaults menu you will
not be prompted if a file of the same name already exists; it will be
You decided not to overwrite the file. You can either append the results
to this file or choose a new filename.
Menu 2 Purifying sequences menu
This menu was originally used to eliminate sequences from data that had
high sequence identity to other sequences in the dataset and thus might
bias the output results.
This functionality is not currently portable and is not being made
available at present. Try using the NCBI program nrdb or the EGCG9
program clean_up to remove identical or almost identical sequences.
Menu 3 Defaults menu
To improve flexibility, many of the default values used internally by
CodonW (defined in the header file codonW.h) can be altered at runtime
using this menu. Ten options can be customised.
Option (1) Change ASCII delimiter in output. The default ASCII delimiter
used to separate information in machine readable output files is a
comma. The delimiter can be changed via this option to either the tab or
Option (2) Run silently. This option can be used when running from a
script file or as a batch job. If TRUE, it suppresses warnings about
overwriting files, the prompting for a personal choice of Fop, CBI or
CAI values (although these can still be given via command line
arguments) and the pause after each page of error or warning messages
has been displayed.
Option (3) Log warnings/information to a file. The default value for
this option is set as FALSE, in which case all warning or error messages
generated by CodonW are written to the screen via the standard error
stream. When TRUE, the errors are redirected to a log file:- you will be
prompted for the filename for this log file. This option is useful if
there are a large number of sequences in the input file or there are
many warning messages.
Option (4) Number of lines on screen. This is used to set the screen
length, which is used during screen refreshing and the pagination of
Option (5) Change the genetic code. By default, CodonW assumes the
universal genetic code when translating and processing codons. This
option allows alternative genetic codes to be selected.
Option (6) Change the Fop/CBI values. To calculate either the CBI or Fop
indices, a set of optimal codons is required; by default the optimal
codons of E. coli are assumed. This option displays a submenu which
lists eight species where optimal codons have been identified. When
calculating the Fop/CBI of genes from these species the appropriate set
of codons should be selected. Personal selections of optimal codons can
be input at runtime.
Option (7) Change the CAI values. To calculate the codon adaptation
index it is necessary to assign fitness values to each codon; by default
the fitness values of E. coli codons are assumed. However, these values
are very species-specific and so using E. coli fitness values to
calculate CAI values for other species is nonsensical. Before assigning
fitness values to a codon a set of genes which have been experimentally
verified to be highly expressed must be identified. Such sets have been
created for relatively few species. This menu lists the species where a
reference set of highly expressed genes is known, and fitness values
assigned. Personal selections of fitness values can be input at runtime
if calculating CAI.
Option (8) Toggle human or machine-readable output. The default format
for most CodonW output files is human readable. Machine-readable output
is fixed width numerical data separated by an ASCII delimiter. This
format is readily imported into a wide range of statistical and
graphical analysis programs but not easily read by eye. Human readable
output is more verbose but easier to read. The output formats for codon
usage, tabulation of codon usage, relative synonymous codon usage and
base compositions are the most radically affected by this option.
Option (9) Toggle output for each or all genes. By default, CodonW
processes each gene individually. When the option "all genes" is
selected, sequences are concatenated and processed as a single sequence.
This option can be used to calculate total codon or amino acid usage,
the average G+C content, Fop, etc.
Option (10) Correspondence analysis defaults. This option allows access
to the "advanced correspondence analysis" menu. This menu is normally
accessed as a submenu of "Correspondence analysis" (Menu 5), but is
included here so that all runtime options are accessible via the "Change
default values" menu.
Menu 4 Codon Usage Indices
This menu is used to choose the indices calculated by CodonW; by default
only the G+C content of the sequence is selected. The calculation of
these indices (except G+C content) is dependent on the genetic code
selected under Menu 3. More than one index may be calculated at once.
Option (1) Codon Adaptation Index (CAI). CAI measures the relative
adaptation of a gene to the codon usage of highly expressed genes. The
relative adaptiveness (w) of a codon is the ratio of the usage of that
codon to that of the most abundant codon for the same amino acid. The
relative adaptiveness of codons (for albeit a limited choice of species)
can be selected from Menu 3.
Option (2) Frequency of Optimal codons (Fop). This index is the ratio of
optimal codons to synonymous codons (genetic code dependent). Optimal
codons for several species are in-built and can be selected using Menu
3. By default, the optimal codons of E. coli are assumed. The user may
also enter a personal choice of optimal codons. If rare synonymous
codons have been identified, there is a choice of calculating the
original Fop index or a modified Fop index. Fop values for the original
index are always between 0 (where no optimal codons are used) and 1
(where only optimal codons are used). When calculating the modified Fop
index, any negative values are adjusted to zero.
Option (3) Codon Bias Index (CBI). The codon bias index is a measure of
directional codon bias. It measures the extent to which a gene uses a
subset of optimal codons.
Option (4) The effective number of codons (NC). This index is a simple
measure of overall codon bias and is analogous to the effective number
of alleles measure used in population genetics. Knowledge of the optimal
codons or a reference set of highly expressed genes is unnecessary when
calculating this index.
Option (5) G+C content of the gene. This is calculated as the frequency
of nucleotides that are guanine or cytosine.
Option (6) G+C content 3rd position of synonymous codons (GC3s). This is
the fraction of codons, synonymous at the third codon position, which
have either a guanine of cytosine at that third codon position.
Option (7) Silent base composition. Selection of this option calculates
four separate indices, i.e. G3s, C3s, A3s & T3s. Although correlated
with GC3s, this index is not directly comparable with it. It quantifies
the usage of each base at synonymous third codon positions.
Option (8) Length silent sites (Lsil). This is the frequency of
synonymous codons within each gene.
Option (9) Length amino acids (Laa). This is the number of translatable
Option (10) Hydropathicity of protein. This is the general average
hydropathicity or (GRAVY) score for the hypothetical translated gene
product. It is the arithmetic mean of the sum of the hydropathic indices
of each amino acid.
Option (11) Aromaticity score of protein. This is the frequency of
aromatic amino acids (Phe, Tyr, Trp) in the hypothetical translated gene
The hydropathicity and aromaticity protein scores are indices of amino
acid usage. The strongest trend in the variation in the amino acid
composition of E. coli genes is correlated with protein hydropathicity,
the second strongest trend is correlated with gene expression, while the
third is correlated with aromaticity.
Menu 5 Correspondence analysis
In many unicellular organisms, protein coding genes have non-random
usage of synonymous codons (see Andersson and Kurland (1990) and Sharp
et al. (1993) for reviews). Correspondence analysis uses contingency
tables (counts of the joint occurrences of rows and columns of a table).
Therefore, the sequence data must be transformed into a contingency
table. The frequency of each codon (or amino acid) is tabulated for each
gene. This is then converted into an Euclidean distance measurement of
distance between the rows or columns. CodonW calculates a scaled
distance measurement as recommended by Grantham and co-workers (Grantham
et al 1981).
Analysis of a large number of distances would ordinarily be very time
consuming. Correspondence analysis provides a simple visualisation of
these distances by projecting the points from their original
multidimensional space onto lower dimensions, with genes with similar
distances plotted as neighbours. In addition to calculating the
coordinates for the projection of these points, correspondence analysis
(as implemented in CodonW) also calculates the total inertia of the
data, together with the eigenvalue and relative variation explained by
each axis. CodonW can also quantify the absolute and relative
contribution of each gene, codon or amino acid on each identified trend.
To limit variation due to stochastic noise, it is recommended that short
genes (less than 50 codons) be excluded from a correspondence analysis.
The correspondence analysis menu (Menu 5) has four options, the default
option being not to generate a correspondence analysis, i.e. Do not
perform a COA.
Option (1) Correspondence analysis of codon usage. This generates a
correspondence analysis on the total codon usage. By default, this is on
synonymous codons, although the advanced menu may be used to adjust
which codons are included/excluded. If analysing synonymous codon usage,
the analysis has 58 degrees of freedom.
Option (2) Correspondence analysis of RSCU. This generates a
correspondence analysis of relative synonymous codon usage (RSCU). RSCU
is calculated as the ratio of the observed frequency of a codon to the
frequency expected under unbiased codon usage within a synonymous codon
group. Correspondence analysis of RSCU is useful because variation
caused by unequal usage of amino acids is removed; however the number
of degrees of freedom is reduced to 40.
Option (3) Correspondence analysis of Amino Acid usage. This generates a
correspondence analysis of amino acid composition, with 19 degrees of
Option (4) Do not perform a correspondence analysis. This is the default
Menu 6 Basic Stats
This menu was originally designed to calculate some basic statistics on
the output from the various codon usage indices.
This functionality is not currently portable and is not being made
available at present.
Menu 7 Relaxation (almost)
This menu was designed to help teach the genetic code(s). It asks
various random questions about codon translation and codon usage. The
genetic code used as the basis for the correct answers can be changed
under the default menu (Menu 3).
Teach yourself the genetic codes and codon usage.
To exit type "quit" or "exit" (without the quotation marks).
If you don't know the answer to the question, you can type "?" (without
the quotation marks) .
You will then be prompted with the correct answer. Beware:- you will be
penalised for incorrect answers :).
The questions are:
What is the three-letter name?
(You must convert the one-letter code given to the three-letter code.)
How synonymous is Amino Acid?
(How many synonyms are there for this amino acid?)
Name the Amino Acid?
(Which amino acid is coded by this codon?)
Menu 8 Bulk output options in CodonW
Non-correspondence analysis output from CodonW which cannot easily be
summarised as a single index is bulk output. Under this menu there are
10 options. Multiple options cannot be selected simultaneously. Each
time this menu is selected you will be prompted for an alternative
Option (1) Fasta format output of DNA sequence. The input sequences are
reformatted and written to a file in a Fasta /Pearson-like format.
Option (2) Reader format output of DNA sequence. This format is derived
from the fasta format, except that the sequence is written as codons
with three bases separated by a space, and the size of the sequence is
recorded at column 70.
Option (3) Translate input file to amino acids. This translates DNA to
amino acids using the selected genetic code. The amino acids are written
in a Fasta/Pearson compatible format.
Option (4) Codon Usage. This is the default option. The frequency of
each codon is written to a file in four rows with 16 columns per row.
The codons are written in sequential numerical order, left to right.
Option (5) Amino acid usage. The frequency of each amino acid,
untranslatable codons and stop codons are recorded, one row per gene and
23 columns per row. The first column contains a unique gene description,
the second column records number of untranslatable codons, the third and
subsequent columns summarize the amino acid and termination codon usage.
Option (6) Relative Synonymous Codon Usage (RSCU). Relative synonymous
codon usage is calculated as the ratio of the observed frequency of a
codon to the frequency expected if codon usage were random.
Option (7) Relative Amino acid usage (RAAU). Relative Amino acid usage
is the frequency of the amino acid relative to the total amino acid
Option (8) Dinucleotide frequencies. The frequency of the 16
dinucleotides is calculated in each of the three possible codon
positions. The data are recorded with one row per position and 16
columns per row.
Option (9) Base composition analysis. This option records the frequency
of nucleotides in each codon position. It also reports GC, GC3s and GCns
(GC content excluding synonymous third position codons).
Option (10) No output written to file. This option is useful when
working with large datasets and disk storage or disk access is a
limiting factor. This option suppresses all the output to the bulk
Advanced Correspondence Analysis menu.
This menu allows much greater control over the correspondence analysis.
Option (1) Unselect or select. This menu changes slightly depending on
whether correspondence analysis is of amino acid or codon usage.It
simplifies the selection of the codons/amino acids that are to be
included in the COA. This allows the user to override the default
selections, which if the COA is of codon usage, is the exclusion of non-
synonymous codons and termination codons.
Option (2) Change the number of axes. The number of axes generated by a
correspondence analysis is N-1, where N is either the number of genes or
columns (whichever is the lesser in value). However, the default is to
generate information about the first four axes (or trends). This option
allows the user to record coordinates on any number of axes, up to the
maximum generated by the analysis.
Each axis generated by correspondence analysis is represented by a
multidimensional vector. The position of a gene on any axis is the
product of that gene's codon usage and the axis vector. As the vector is
itself a product of the codon usage, the vectors can be affected by
unusual codon usage. An analysis of nuclear and plasmid genes would be
difficult, as the codon usage of each would perturb the other. Each
dataset could be analysed individually but as the vectors for the axes
would be different, it would be difficult to make direct comparisons
between the analyses. To overcome this problem it is necessary to
generate the COA vectors using one dataset and then to apply the same
vectors to another. Thus direct comparison between the ordination of
genes is possible. In CodonW, this is possible by using the following
option (Option 3).
Option (3) Add additional genes after correspondence analysis. The user
is prompted for the file containing the additional sequences, to which
the vectors are to be applied. The vectors are calculated, as normal,
using the genes contained in the standard input file (Menu 1). The co-
ordinates and any additional information about these original genes are
recorded as normal. Next the additional genes are read in and the
original vectors applied to them. The ordinations of these additional
genes are then appended to the COA output files (for an explanation
about the COA output files see below).
Option (4) Toggle level of correspondence analysis output. By default
this option is set to "normal" but can be toggled to "exhaustive". If
the exhaustive output option is selected, then in addition to the
standard information about gene and codon/amino acid ordination,
additional information about inertia of the rows and columns is
generated. This additional information includes the absolute
contribution of the inertia of each row or column to each of the
recorded axes, and the fraction of the variation within each row or
column explained by each axis.
Option (5) Change number of genes used to identify optimal codons.
Correspondence analysis of either RSCU or codon usage where the major
trend correlates with gene expression can be used to identify optimal
codons. This is achieved by comparing the codon usage of the genes that
lie at the extremes of the principal trend (axis 1). By default this is
the top and bottom 10% of genes (as defined by axis 1 ordination). Using
this option this can be set to a percentage between 1% and 50%, or to an
absolute number of genes.
Codon or Amino acid selection
The codons or amino acids that will NOT be analysed in this
correspondence analysis are surrounded by curly brackets. The choices of
which codons/amino acids that are to be excluded can be changed. Simply
give the number associated with each codon/amino acid for which you want
to change the status.