1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
|
/* This is an example HY-PHY Batch File.
It reads in a '#' nucleotide dataset data/hiv.nuc and estimates
maximum ln-likelihood based on the tree contained in the data file,
using the general non-reversible 12 parameter (with 3 constraints)
model. The tree from the data file is unrooted.
Output is printed out as a Newick Style tree with branch lengths
representing the number of expected substitutions per branch (which
is the default setting for nucleotide models w/o rate variation).
Sergei L. Kosakovsky Pond and Spencer V. Muse
November 2002.
*/
/* 1. Read in the data and store the result in a DataSet variable.*/
DataSet nucleotideSequences = ReadDataFile ("data/hiv.nuc");
/* 2. Filter the data, specifying that all of the data is to be used
and that it is to be treated as nucleotides.*/
DataSetFilter filteredData = CreateFilter (nucleotideSequences,1);
/* 3. Collect observed nucleotide frequencies from the filtered data. observedFreqs will
store the vector of frequencies. */
HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1);
/* 4. Define the 12 parameter non-reversible substitution matrix.
The constraints on some of the substitution rates are necessary to
ensure that observed nucleotide frequencies are equilibrium
frequencies for the model.
'*' is defined to be -(sum of off-diag row elements) */
/* frequency ratios */
r0 = observedFreqs[0]/observedFreqs[3];
r1 = observedFreqs[1]/observedFreqs[3];
r2 = observedFreqs[2]/observedFreqs[3];
/* All the global rate parameters are defined relative to
the rate for A->C. For instance, CG represents the ratio
of the rates C->G/A->C. */
global AG;
global AT;
global CA;
global CG;
global CT;
global GA;
global GC;
global GT;
/* note that these constraints are
satisfied if we restrict the model to the
general reversible case */
global TA:=AT+(1-CA)*r1__+(AG-GA)*r2__;
global TC:=CT+(CA-1)*r0__+(CG-GC)*r2__;
global TG:=GT+(GA-AG)*r0__+(GC-CG)*r1__;
NRRateMatrix =
{{*,t,t*AG,t*AT}
{t*CA,*,t*CG,t*CT}
{t*GA,t*GC,*,t*GT}
{t*TA,t*TC,t*TG,*}};
/*5. Define the Non-Rev model, by combining the substitution matrix with the vector of observed (equilibrium)
frequencies. */
Model NRM = (NRRateMatrix, observedFreqs);
/*6. Now we can define the tree variable, using the tree string read from the data file,
and, by default, assigning the last defined model (NRM) to all tree branches. */
Tree givenTree = DATAFILE_TREE;
/*7. Since all the likelihood function ingredients (data, tree, equilibrium frequencies)
have been defined we are ready to construct the likelihood function. */
LikelihoodFunction theLnLik = (filteredData, givenTree);
/*8. Maximize the likelihood function, storing parameter values in the matrix paramValues */
Optimize (paramValues, theLnLik);
/*9. Print the tree with optimal branch lengths to the console. */
fprintf (stdout, theLnLik);
|