1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350
|
Tutorial
Codon usage analysis
Included with this distribution of codonW should be a test dataset of
sequences (input.dat). We will use this set of sequence as a typical example
of a codon usage analysis. This test dataset is derived from the open
reading frames (ORFs) of Saccharomyces cerevisiae chromosome III as
annotated in the EMBL feature table for the sequence entry SCCHRIII
(accession number X59720). In the current EMBL (Release 51 June 1997) the
number of annotated ORFs was 172. The file input.dat contains 111 of these
ORFs. The rational and why some ORFs where removed is explained below.
The commandline syntax of codonW will be used in this tutorial, all options
selected from the commandline are also selectable using the menu system. For
more information please read the command line help (codonw -help) or just
type "codonw" and use the menu specific online.
Build your dataset of genes carefully.
Always remember that as in any analysis, but particularly with codon usage,
GIGO (garbage in, garbage out). Examine as many sources of information about
the data as possible, particularly the original publication and sequence
annotations. It is important that the sequences are a representative sample.
Five ORFs where removed from the dataset because they where annotated (and
had sequence identity) with genes within the previously identified
transposable elements Ty2 and Ty5. These ORFs where annotated at positions
1537-2127, 2118-2558, 2816-3742, 84714-86030, 84714-90384. The codon usage
of transposable element genes differs from that of chromosomal genes.
Further checks of sequence annotation was carried out, those sequences which
had not been assigned gene names or SwissProt accession numbers where
removed. The SwissProt annotation was also checked, genes described as
hypothetical but which did not have any sequence identity with other
proteins where removed.
Check basic sequence integrity
Sequences should be checked to confirm that they match some basic gene
characteristics. Each sequence might reasonably be expected to have an
initiation codon and a translation termination codon, and no internal stop
codons. Those sequences that do not match these characteristics, or
sequences that have partial codons or untranslatable codons are flagged by
codonw with warning messages.
To make a first pass of the input data to check for simple sequence
problems:
codonw input.dat -nomenu
By default codonw will report the codon usage of each gene to the file
input.blk. As there are no problems with this dataset there should be no
warning messages. However analysis of a previous version of this dataset
based on EMBL Release 50 where SCCHRIII had 230 annotated ORFs, generated
these typical warning messages.
Warning: Sequence 178 "SCCHRIII.PE178______" does not begin with a
recognised start codon
Warning: Sequence 178 "SCCHRIII.PE178______" is not terminated by a stop
codon
Warning: Sequence 202 "SCCHRIII.PE202______" does not begin with a
recognised start codon
Warning: Sequence 202 "SCCHRIII.PE202______" has 1 internal stop codon(s)
Warning: Sequence 202 "SCCHRIII.PE202______" is not terminated by a stop
codon
Each sequence is labelled by its numerical occurrence in the input file
(i.e. these are the 178th and 202nd sequences in the input file) and its
sequence header line.
Sequences that generate warning messages should be examined closely to
ascertain why. Some sequences may be annotated as partial sequences and
therefore the absence of a start or stop codon or the presence of a 3'
partial codon is to be expected. Note the presence of a 5' partial codon
would cause a frame shift, it is ESSENTIAL that 5' partial codons are
removed. Unless the frame shift that they produce, results in a (incorrect)
reading frame that contains internal stop codons, codonw cannot detect this
problem. The codon usage of a frame shifted gene sequence could adversely
affect the correspondence analysis (COA) (though such genes are often
recognisable as being outliers on the COA plots).
If a sequence warning is due to incorrect annotation this should be
corrected manually. Sequences that produce warnings that cannot be explained
or justified (e.g. a gene with internal stop codon) should be excluded.
These warning are informational only and do not exclude sequences from the
analysis.
Codon usage indices
Once the initial quality checks have been made for the data we can then
proceed with the codon usage analysis (strictly speaking we can generate COA
and codon usage indices tasks at the same time). Some of the indices of
codon usage bias that CodonW calculates (i.e. Fop, CAI and CBI) use
information about a preferred set of codons for highly expressed genes. This
information is species specific and does not apply to all species (most
eukaryotes and many prokaryotes appear to display no codon preference in
highly expressed genes). Therefore care must be taken that the appropriate
set of optimal codons are used. For most species the optimal codons are not
know and therefore the indices should not be calculated at this stage.
However this information is known for Saccharomyces cerevisiae, so we can
immediately calculate these indices of codon usage. Later we will see how
codonW identifies optimal codons and can generate this information for your
species.
The default optimal codons and codon adaptation values are those of E. coli.
To select an alternative choice we use the c_type (for CAI values ) and
f_type (for FOP/CBI) commandline arguments. These switches requires an
integer values, this value is the same as the option number if we where
using the menu system to change the codon information.
Example "-c_type 2" is equivalent to
Choose "Main Menu"
Choose "Changes Defaults Menu"
Choose "Change the CAI values"
Choose "(2) Saccharomyces cerevisiae"
Example "-f_type 4" is equivalent to
Choose "Main Menu"
Choose "Changes Defaults Menu"
Choose "Change the Fop/CBI values"
Choose "(4) Saccharomyces cerevisiae"
Therefore to select all the codon usage indices calculated by codonw and to
use the optimal codons of Saccharomyces cerevisiae type:
codonw input.dat -all_indices -c_type 2 -f_type 4 -nomenu
See below for the output of this command
The commandline flag -nomenu by passes the menu system, the -all_indices
indicates to codonw that you wish to calculate all the codon and amino acid
usage indices. These indices areT3s, C3s, A3s, G3s, CAI, CBI, Fop, Nc, GC3s,
GC, L_sym, L_aa, Gravy and Aromaticity. For a fuller explanation of what
these indices are see Readme.indices. These indices can also be used to
check whether there are any identical or almost identical sequences in the
input file. If we sort the result file "input.out" we it is much easier to
identify the sequences which are similar.
sort -k 2n input.out (unix for "sort using the second
numerical field")
The sorted output reveals the presence of two pairs of identical sequences
(Mating type proteins)
ALPHA2____________63 0.3636 0.2273 0.4939 0.2177 0.109
MATALPHA2_________63 0.3636 0.2273 0.4939 0.2177 0.109
and
ALPHA1____________52 0.4361 0.2180 0.4228 0.2589 0.112
MATALPHA1_________52 0.4361 0.2180 0.4228 0.2589 0.112
Sequences which appear to be multiple copies of the same gene are normally
removed from our codon usage datasets, even if the sequences are not
identical but where the differences c codon usage bias as observed, lower values indicate stronger bias. A
useful feature of ENc is that the affect of GC biases have on the index can
be estimated. This allows the comparison of GC3s and ENc against the
theoretical values if codon bias was simply caused due to GC mutational
bias. A plot of ENc vs. GC3s can be seen at
http://www.molbiol.ox.ac.uk/cu/EncVsGC3s.gif. Although the majority of genes
in this plot have a degree of codon bias that can be explained in terms of
GC mutation, the cluster of genes (six genes with ENc <40) which have much
stronger codon bias than be simply explained in terms of mutational biases.
These genes are good candidates as genes whose codon usage has been
determined by natural selection, probably selection for translational
efficiency.
Correspondence Analysis (COA)
We are now ready to generate a correspondence analysis of the codon usage of
SCCHRIII genes. We have a choice about how much information is generated. In
this example we will use the default values.
codonw input.dat -coa_cu -nomenu -silent (-silent stops all
prompting)
This generates a COA of codon usage. The summary file is "summary.coa" and
contains most of the data generated by the COA. One of the first sections is
the "Explanation of the variation by axis" also stored in eigen.coa.
The total inertia of the data was 0.263176
Num. Eigenval. R.Iner. R.Sum |Num. Eigenval. R.Iner. R.Sum |
01 +4.5755E-02 +0.1739 +0.1739 |02 +3.2372E-02 +0.1230 +0.2969 |
03 +1.8405E-02 +0.0699 +0.3668 |04 +1.2499E-02 +0.0475 +0.4143 |
The relative inertia explained by the first axis is 17.4%, the 2nd axis
explains 12.3%, the 3rd 7.0%, etc. (17.45% is not remarkably high for
relative inertia explained by the first axis, but as there are ORFs included
which are described as hypothetical there may be random noise present in the
data if they are not real).
The next two sections report position of each gene and codon on the trends.
label Axis1 Axis2 Axis3 Axis4
1_YCG9_Probable_____ 0.00904 0.13153 0.34028 -0.05372
2_YCG8________573_re 0.07429 -0.24652 -0.05502 -0.39837
3_ALPHA2________633_ 0.30675 0.04259 -0.22864 -0.03878
4_ALPHA1________528_ 0.16444 0.00399 -0.02000 0.00937
5_CHA1_________1083_ -0.00322 0.10387 0.07137 0.11896
this information is best viewed graphically, an example of the location of
the genes on the two principal axes can be seen here
http://www.molbiol.ox.ac.uk/cu/axes.gif.
Automatic Identification of Putative Optimal Codons
Codonw automatically tries to identify the optimal codons in your data, or
more precisely identify the codons which contribute to the major trend (if
the main trend is selection for translational optimality these should be the
optimal codons). It does this by comparing the codon usage of groups of
genes taken from each extreme of the principle trend (axis 1). It identifies
the set of genes with the highest bias (using the effective number of codons
index) and tests for significant differences in the codon usage of between
the higher bias set with a two way Chi-squared contingency test. The
putative optimal codons are listed in summary.coa and hilo.coa. It is the
responsibility of the user to confirm that the major codon usage trend is
selection for translational optimality, and not due to some other mutational
pressure (see GC variation). The number of genes included in the two groups
can be selected using the command line switch ( -coa_num ) as an absolute
number of genes, of a percentage of the total genes in the dataset (by
default 5%).
The analysis of this dataset identified 19 codons that appeared to be
optimal. 18 of these agree with optimal codon identified previously using a
larger dataset set of 575 genes [Sharp, 1991 #46]. The codon identified in
this analysis as being optimal but not in the previous analysis, was GCC;
this codon has been previously suggested as being an optimal codon in S.
cerevisiae [Bennetzen, 1982 #92]. The U ending codons, AUU, GUU and UGU,
which have been previously identified as optimal [Sharp, 1991 #46], where
not identified here at p<0.01; although UGU was identified as potentially
optimal with a p<0.02. The main reason that the U ending codons where not
identified from this dataset was their much higher usage in the lower biased
dataset.
Caveats
1) The codons identified by codonw, as being optimal will be dependent on
the strength of the trend and the size of the datasets.
2) The composition of the genes from chromosome III is quite different from
the 575-gene dataset used by Sharp and Cowe. Only one of the 30 genes they
considered to be highly expressed, and none of the genes they considered
lowly expressed are present in this dataset. The reader is reminded that
there are approximately 15,000 yeast genes, so just a little over 1% are
located on chromosome III.
Codonw generated personal choice of codons
On the assumption that the principle trend identified by codonw is selection
for translational optimality, and that the genes assigned to the highly bias
codon usage group are highly expressed, codonw outputs files with the
"optimal codons" and "CAI adaptation fitness values". These files are
fop.coa, cbi.coa and cai.coa, their filenames are related to the index they
have been formatted for. These files can be used to calculate the indices
in species where the preferred codon usage has not been hardwired into
codonW.
codonw input.dat -fop_file fop.coa
codonw input.day -cai_file cai.coa -cbi_file cbi.coa
Caveats
1) The original CAI paper calculated fitness values from experimentally
determined highly expressed genes. The fitness values that are internal
to codonW where derived from these criteria. CAI indices calculated using
fitness values derived from genes identified solely by COA, as being
highly expressed should not be regarded as true CAI values.
2) The optimal codons stored in the files cbi.coa and fop.coa where
identified by codonw using a statistical test of significance, this test
is dependent on sample size.
3) The size of the sample taken from the extremes of the axis will affect
the identified optimal codons.
4) The principle trend in the variation of codon usage may not be
translation optimality.
When we calculate the indexes CAI, CBI and Fop using the "codonw" generated
optimal codons and fitness values based on this small dataset, as we would
expect differ from when these indices are calculated using the codonw
internal codon usage information for S. cerevisiae. The internal values are
more accurate because the datasets used to generate them where larger, and
contained experimentally verified gene sequences.
Although the two sets of indices differ, they remain highly correlated, all
three indices have correlation coefficients greater than 0.96. Therefore if
comparisons between the index values are internally consistent (i.e. they
where both calculated using the same optimal codon information) relative
comparisons of codon usage and bias can be made. Based on a dataset of 111
genes we have been able to identify optimal codons, which give us some
insight into the codon usage of S. cerevisiae.
Axis2 is highly correlated with GC3s content
Alternative datasets could have been chosen that would present a much
simpler analyses of codon usage (i.e. where the optimal codons identified
better matched those previously published). This dataset was specifically
chosen as the codon usage variation for genes from this chromosome is know
to have a second trend, GC3s varies with chromosomal location in a
systematic fashion [Sharp, 1993 #39]. When we examine correlation
coefficients between the first 4 axes the correlation coefficient between
axis2 and GC3s is highly significant (r=0.89). Interestingly the bias is most
strong among the U ending codons it is possible that the presence of this
trend contributed to why the three U ending codons where not identified here
as optimal codons. This trend is quite strong accounting for 12.3% of the
relative inertia of the data, the principle trend (apparently selection for
translation optimality) accounted for 17.4%. We therefore see how it is
possible that the strongest influence on the choice of codon usage might not
be translation optimality but mutation biases.
Typical output from codonw -all_indices -nomenu
======================= Output ======================================
Genetic code is currently set to Universal Genetic code TGA=* TAA=* TAG=*
Welcome to CodonW 1.3 for Help type h
Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678)
w values to calculate CAI
Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678)
optimal codons to calculate CBI
Using Saccharomyces cerevisiae (Sharp and Cowe (1991) Yeast 7:657-678)
optimal codons to calculate Fop
..................................................................
Number of sequences: 111
Files used:
Input file was input.dat
Output file was input.out (codon usage indices, e.g. gc3s)
Output file was input.blk (bulk output e.g. raw codon usage)
CodonW has finished
======================================================
Tabulation of total codon usage
Phe UUU 1483 1.14 Ser UCU 1094 1.47 Tyr UAU 1000 1.12 Cys UGU 434 1.18
UUC 1117 0.86 UCC 773 1.04 UAC 789 0.88 UGC 303 0.82
Leu UUA 1349 1.55 UCA 882 1.19 TER UAA 47 1.27 TER UGA 36 0.97
UUG 1549 1.78 UCG 487 0.66 UAG 28 0.76 Trp UGG 665 1.00
CUU 698 0.80 Pro CCU 747 1.27 His CAU 677 1.15 Arg CGU 328 0.86
CUC 364 0.42 CCC 415 0.71 CAC 499 0.85 CGC 171 0.45
CUA 671 0.77 CCA 911 1.55 Gln CAA 1388 1.35 CGA 151 0.39
CUG 604 0.69 CCG 281 0.48 CAG 668 0.65 CGG 103 0.27
Ile AUU 1612 1.35 Thr ACU 1052 1.38 Asn AAU 1778 1.17 Ser AGU 717 0.97
AUC 1018 0.85 ACC 660 0.87 AAC 1262 0.83 AGC 500 0.67
AUA 943 0.79 ACA 883 1.16 Lys AAA 2118 1.13 Arg AGA 1038 2.71
Met AUG 1156 1.00 ACG 444 0.58 AAG 1645 0.87 AGG 504 1.32
Val GUU 1184 1.49 Ala GCU 1055 1.40 Asp GAU 1905 1.25 Gly GGU 1284 1.87
GUC 674 0.85 GCC 765 1.01 GAC 1145 0.75 GGC 552 0.80
GUA 622 0.78 GCA 836 1.11 Glu GAA 2371 1.41 GGA 557 0.81
GUG 690 0.87 GCG 368 0.49 GAG 995 0.59 GGG 355 0.52
53400 codons (used Universal Genetic code)
======================================================
|