1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328
|
.. _multivariate-analysis:
**************************
Multivariate data analysis
**************************
.. sectionauthor Justin Kuczynski, Catherine Lozupone, Andreas Wilm
Principal Coordinates Analysis
==============================
Principal Coordinates Analysis works on a matrix of pairwise distances. In this example we start by calculating the pairwise distances for a set of aligned sequences, though any distance matrix can be used with PCoA, relating any objects, not only sequences.
.. doctest::
>>> from cogent import LoadSeqs
>>> from cogent.phylo import distance
>>> from cogent.cluster.metric_scaling import PCoA
Import a substitution model (or create your own).
.. doctest::
>>> from cogent.evolve.models import HKY85
Load the alignment.
.. doctest::
>>> al = LoadSeqs("data/test.paml")
Create a pairwise distances object calculator for the alignment, providing a substitution model instance.
.. doctest::
>>> d = distance.EstimateDistances(al, submodel= HKY85())
>>> d.run(show_progress=False)
Now use this matrix to perform principal coordinates analysis.
.. doctest::
>>> PCoA_result = PCoA(d.getPairwiseDistances())
>>> print PCoA_result # doctest: +SKIP
======================================================================================
Type Label vec_num-0 vec_num-1 vec_num-2 vec_num-3 vec_num-4
--------------------------------------------------------------------------------------
Eigenvectors NineBande -0.02 0.01 0.04 0.01 0.00
Eigenvectors DogFaced -0.04 -0.06 -0.01 0.00 0.00
Eigenvectors HowlerMon -0.07 0.01 0.01 -0.02 0.00
Eigenvectors Mouse 0.20 0.01 -0.01 -0.00 0.00
Eigenvectors Human -0.07 0.04 -0.03 0.01 0.00
Eigenvalues eigenvalues 0.05 0.01 0.00 0.00 -0.00
Eigenvalues var explained (%) 85.71 9.60 3.73 0.95 -0.00
--------------------------------------------------------------------------------------
We can save these results to a file in a delimited format (we'll use tab here) that can be opened up in any data analysis program, like R or Excel. Here the principal coordinates can be plotted against each other for visualization.
.. doctest::
>>> PCoA_result.writeToFile('PCoA_results.txt',sep='\t')
Fast-MDS
========
The eigendecomposition step in Principal Coordinates Analysis (PCoA)
doesn't scale very well. And for thousands of objects the computation
of all pairwise distance alone can get very slow, because it scales
quadratically. For a huge number of objects this might even pose a
memory problem. Fast-MDS methods approximate an MDS/PCoA solution and
do not suffer from these problems.
First, let's simulate a big data sample by creating 1500 objects living
in 10 dimension. Then compute their pairwise distances and perform a
principal coordinates analysis on it. Note that the last two steps might take
already a couple of minutes.
.. doctest::
>>> from cogent.maths.distance_transform import dist_euclidean
>>> from cogent.cluster.metric_scaling import principal_coordinates_analysis
>>> from numpy import random
>>> objs = random.random((1500, 10))
>>> distmtx = dist_euclidean(objs)
>>> full_pcoa = principal_coordinates_analysis(distmtx)
PyCogent implements two fast MDS approximations called
Split-and-Combine MDS (SCMDS, still in development) and Nystrom (also known as
Landmark-MDS). Both can easily handle many thousands objects. One
reason is that they don't require all distances to be computed.
Instead you pass down the distance function and only required
distances are calculated.
Nystrom works by using a so called seed-matrix, which contains (only) k by
n distances, where n is the total number of objects and k<<n. The
bigger k, the more exact the approximation will be and the longer the
computation will take. One further difference to normal Principal
Coordinates Analysis is, that no eigenvalues, but only approximate
eigenvectors of length dim will be returned.
.. doctest::
>>> from cogent.cluster.approximate_mds import nystrom
>>> from random import sample
>>> from numpy import array
>>> n_seeds = 100
>>> seeds = array(sample(distmtx,n_seeds))
>>> dims = 3
>>> nystrom_3d = nystrom(seeds, dims)
A good rule of thumb for picking n_seeds is log(n), log(n)**2 or
sqrt(n).
SCMDS works by dividing the pairwise distance matrix into chunks of
certain size and overlap. MDS is performed on each chunk individually
and the resulting solutions are progressively joined. As in the case
of Nystrom not all distances will be computed, but only those of the
overlapping tiles. The size and overlap of the tiles determine the
quality of the approximation as well as the run-time.
.. doctest::
>>> from cogent.cluster.approximate_mds import CombineMds, cmds_tzeng
>>> combine_mds = CombineMds()
>>> tile_overlap = 100
>>> dims = 3
>>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[0:500,0:500], dims)
>>> combine_mds.add(tile_eigvecs, tile_overlap)
>>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[400:900,400:900], dims)
>>> combine_mds.add(tile_eigvecs, tile_overlap)
>>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[800:1300,800:1300], dims)
>>> combine_mds.add(tile_eigvecs, tile_overlap)
>>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[1200:1500,1200:1500], dims)
>>> combine_mds.add(tile_eigvecs, tile_overlap)
>>> combien_mds_3d = combine_mds.getFinalMDS()
If you want to know how good the returned approximations are, you will
have to perform principal_coordinates_analysis() on a smallish
submatrix and perform a goodness_of_fit analysis.
NMDS
====
NMDS (Non-metric MultiDimensional Scaling) works on a matrix of pairwise distances. In this example, we generate a matrix based on the euclidean distances of an abundance matrix.
.. doctest::
>>> from cogent.cluster.nmds import NMDS
>>> from cogent.maths.distance_transform import dist_euclidean
>>> from numpy import array
We start with an abundance matrix, samples (rows) by sequences/species (cols)
.. doctest::
>>> abundance = array(
... [[7,1,0,0,0,0,0,0,0],
... [4,2,0,0,0,1,0,0,0],
... [2,4,0,0,0,1,0,0,0],
... [1,7,0,0,0,0,0,0,0],
... [0,8,0,0,0,0,0,0,0],
... [0,7,1,0,0,0,0,0,0],#idx 5
... [0,4,2,0,0,0,2,0,0],
... [0,2,4,0,0,0,1,0,0],
... [0,1,7,0,0,0,0,0,0],
... [0,0,8,0,0,0,0,0,0],
... [0,0,7,1,0,0,0,0,0],#idx 10
... [0,0,4,2,0,0,0,3,0],
... [0,0,2,4,0,0,0,1,0],
... [0,0,1,7,0,0,0,0,0],
... [0,0,0,8,0,0,0,0,0],
... [0,0,0,7,1,0,0,0,0],#idx 15
... [0,0,0,4,2,0,0,0,4],
... [0,0,0,2,4,0,0,0,1],
... [0,0,0,1,7,0,0,0,0]], 'float')
Then compute a distance matrix using euclidean distance, and perform nmds on that matrix
.. doctest::
>>> euc_distmtx = dist_euclidean(abundance)
>>> nm = NMDS(euc_distmtx, verbosity=0)
The NMDS object provides a list of points, which can be plotted if desired
.. doctest::
>>> pts = nm.getPoints()
>>> stress = nm.getStress()
With matplotlib installed, we could then do ``plt.plot(pts[:,0], pts[:,1])``
Hierarchical clustering (UPGMA, NJ)
===================================
Hierarchical clustering techniques work on a matrix of pairwise distances. In this case, we use the distance matrix from the NMDS example, relating samples of species to one another using UPGMA (NJ below).
.. note:: UPGMA should not be used for phylogenetic reconstruction.
.. doctest::
>>> from cogent.cluster.UPGMA import upgma
we start with the distance matrix and list of sample names:
.. doctest::
>>> sample_names = ['sample'+str(i) for i in range(len(euc_distmtx))]
make 2d dict:
.. doctest::
>>> euc_distdict = {}
>>> for i in range(len(sample_names)):
... for j in range(len(sample_names)):
... euc_distdict[(sample_names[i],sample_names[j])]=euc_distmtx[i,j]
e.g.: ``euc_distdict[('sample6', 'sample5')] == 3.7416573867739413``
Now use this matrix to build a UPGMA cluster.
.. doctest::
>>> mycluster = upgma(euc_distdict)
>>> print mycluster.asciiArt()
/-sample10
/edge.3--|
/edge.2--| \-sample8
| |
| \-sample9
/edge.1--|
| | /-sample12
| | /edge.5--|
| | | \-sample11
| \edge.4--|
| | /-sample6
| \edge.6--|
/edge.0--| \-sample7
| |
| | /-sample15
| | /edge.10-|
| | /edge.9--| \-sample14
| | | |
| | /edge.8--| \-sample13
| | | |
| \edge.7--| \-sample16
-root----| |
| | /-sample17
| \edge.11-|
| \-sample18
|
| /-sample5
| /edge.14-|
| /edge.13-| \-sample4
| | |
| | \-sample3
\edge.12-|
| /-sample2
| /edge.16-|
\edge.15-| \-sample1
|
\-sample0
We demonstrate saving this UPGMA cluster to a file.
.. doctest::
>>> mycluster.writeToFile('test_upgma.tree')
..
We don't actually want to keep that file now, so I'm importing the ``os`` module to delete it.
.. doctest::
:hide:
>>> import os
>>> os.remove('test_upgma.tree')
We can use neighbor joining (NJ) instead of UPGMA:
.. doctest::
>>> from cogent.phylo.nj import nj
>>> njtree = nj(euc_distdict)
>>> print njtree.asciiArt()
/-sample16
|
| /-sample12
| /edge.2--|
| | | /-sample13
| | \edge.1--|
| | | /-sample14
| | \edge.0--|
| | \-sample15
| |
| | /-sample7
|-edge.14-| /edge.5--|
| | | | /-sample8
| | | \edge.4--|
| | /edge.6--| | /-sample10
| | | | \edge.3--|
| | | | \-sample9
-root----| | | |
| | | \-sample11
| | |
| \edge.13-| /-sample6
| | |
| | | /-sample4
| | /edge.10-| /edge.7--|
| | | | /edge.8--| \-sample3
| | | | | |
| | | \edge.9--| \-sample5
| \edge.12-| |
| | \-sample2
| |
| | /-sample0
| \edge.11-|
| \-sample1
|
| /-sample18
\edge.15-|
\-sample17
|