File: multivariate_data_analysis.rst

package info (click to toggle)
python-cogent 1.5.1-2
  • links: PTS, VCS
  • area: main
  • in suites: wheezy
  • size: 15,348 kB
  • sloc: python: 127,116; makefile: 85; ansic: 17; sh: 9
file content (328 lines) | stat: -rw-r--r-- 12,991 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
.. _multivariate-analysis:

**************************
Multivariate data analysis
**************************

.. sectionauthor Justin Kuczynski, Catherine Lozupone, Andreas Wilm

Principal Coordinates Analysis
==============================

Principal Coordinates Analysis works on a matrix of pairwise distances. In this example we start by calculating the pairwise distances for a set of aligned sequences, though any distance matrix can be used with PCoA, relating any objects, not only sequences.

.. doctest::

    >>> from cogent import LoadSeqs
    >>> from cogent.phylo import distance
    >>> from cogent.cluster.metric_scaling import PCoA

Import a substitution model (or create your own).

.. doctest::

    >>> from cogent.evolve.models import HKY85

Load the alignment.

.. doctest::

    >>> al = LoadSeqs("data/test.paml")

Create a pairwise distances object calculator for the alignment, providing a substitution model instance.

.. doctest::

    >>> d = distance.EstimateDistances(al, submodel= HKY85())
    >>> d.run(show_progress=False)

Now use this matrix to perform principal coordinates analysis.

.. doctest::

    >>> PCoA_result = PCoA(d.getPairwiseDistances())
    >>> print PCoA_result # doctest: +SKIP
    ======================================================================================
            Type              Label  vec_num-0  vec_num-1  vec_num-2  vec_num-3  vec_num-4
    --------------------------------------------------------------------------------------
    Eigenvectors          NineBande      -0.02       0.01       0.04       0.01       0.00
    Eigenvectors           DogFaced      -0.04      -0.06      -0.01       0.00       0.00
    Eigenvectors          HowlerMon      -0.07       0.01       0.01      -0.02       0.00
    Eigenvectors              Mouse       0.20       0.01      -0.01      -0.00       0.00
    Eigenvectors              Human      -0.07       0.04      -0.03       0.01       0.00
     Eigenvalues        eigenvalues       0.05       0.01       0.00       0.00      -0.00
     Eigenvalues  var explained (%)      85.71       9.60       3.73       0.95      -0.00
    --------------------------------------------------------------------------------------

We can save these results to a file in a delimited format (we'll use tab here) that can be opened up in any data analysis program, like R or Excel. Here the principal coordinates can be plotted against each other for visualization.

.. doctest::

    >>> PCoA_result.writeToFile('PCoA_results.txt',sep='\t')


Fast-MDS
========

The eigendecomposition step in Principal Coordinates Analysis (PCoA)
doesn't scale very well. And for thousands of objects the computation
of all pairwise distance alone can get very slow, because it scales
quadratically. For a huge number of objects this might even pose a
memory problem. Fast-MDS methods approximate an MDS/PCoA solution and
do not suffer from these problems.

First, let's simulate a big data sample by creating 1500 objects living
in 10 dimension. Then compute their pairwise distances and perform a
principal coordinates analysis on it. Note that the last two steps might take
already a couple of minutes.

.. doctest::

    >>> from cogent.maths.distance_transform import dist_euclidean
    >>> from cogent.cluster.metric_scaling import principal_coordinates_analysis
    >>> from numpy import random
    >>> objs = random.random((1500, 10))
    >>> distmtx = dist_euclidean(objs)
    >>> full_pcoa = principal_coordinates_analysis(distmtx)


PyCogent implements two fast MDS approximations called
Split-and-Combine MDS (SCMDS, still in development) and Nystrom (also known as
Landmark-MDS). Both can easily handle many thousands objects. One
reason is that they don't require all distances to be computed.
Instead you pass down the distance function and only required
distances are calculated.

Nystrom works by using a so called seed-matrix, which contains (only) k by
n distances, where n is the total number of objects and k<<n. The
bigger k, the more exact the approximation will be and the longer the
computation will take. One further difference to normal Principal
Coordinates Analysis is, that no eigenvalues, but only approximate
eigenvectors of length dim will be returned.

.. doctest::

   >>> from cogent.cluster.approximate_mds import nystrom
   >>> from random import sample
   >>> from numpy import array
   >>> n_seeds = 100
   >>> seeds = array(sample(distmtx,n_seeds))
   >>> dims = 3
   >>> nystrom_3d = nystrom(seeds, dims)
   
A good rule of thumb for picking n_seeds is log(n), log(n)**2 or
sqrt(n).


SCMDS works by dividing the pairwise distance matrix into chunks of
certain size and overlap. MDS is performed on each chunk individually
and the resulting solutions are progressively joined. As in the case
of Nystrom not all distances will be computed, but only those of the
overlapping tiles. The size and overlap of the tiles determine the
quality of the approximation as well as the run-time.

.. doctest::

   >>> from cogent.cluster.approximate_mds import CombineMds, cmds_tzeng
   >>> combine_mds = CombineMds()
   >>> tile_overlap = 100
   >>> dims = 3
   >>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[0:500,0:500], dims)
   >>> combine_mds.add(tile_eigvecs, tile_overlap)
   >>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[400:900,400:900], dims)
   >>> combine_mds.add(tile_eigvecs, tile_overlap)
   >>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[800:1300,800:1300], dims)
   >>> combine_mds.add(tile_eigvecs, tile_overlap)
   >>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[1200:1500,1200:1500], dims)
   >>> combine_mds.add(tile_eigvecs, tile_overlap)
   >>> combien_mds_3d = combine_mds.getFinalMDS()

If you want to know how good the returned approximations are, you will
have to perform principal_coordinates_analysis() on a smallish
submatrix and perform a goodness_of_fit analysis.



NMDS
====

NMDS (Non-metric MultiDimensional Scaling) works on a matrix of pairwise distances. In this example, we generate a matrix based on the euclidean distances of an abundance matrix.

.. doctest::

    >>> from cogent.cluster.nmds import NMDS
    >>> from cogent.maths.distance_transform import dist_euclidean
    >>> from numpy import array

We start with an abundance matrix, samples (rows) by sequences/species (cols)

.. doctest::

    >>> abundance = array(
    ...        [[7,1,0,0,0,0,0,0,0],
    ...        [4,2,0,0,0,1,0,0,0],
    ...        [2,4,0,0,0,1,0,0,0],
    ...        [1,7,0,0,0,0,0,0,0],
    ...        [0,8,0,0,0,0,0,0,0],
    ...        [0,7,1,0,0,0,0,0,0],#idx 5
    ...        [0,4,2,0,0,0,2,0,0],
    ...        [0,2,4,0,0,0,1,0,0],
    ...        [0,1,7,0,0,0,0,0,0],
    ...        [0,0,8,0,0,0,0,0,0],
    ...        [0,0,7,1,0,0,0,0,0],#idx 10
    ...        [0,0,4,2,0,0,0,3,0],
    ...        [0,0,2,4,0,0,0,1,0],
    ...        [0,0,1,7,0,0,0,0,0],
    ...        [0,0,0,8,0,0,0,0,0],
    ...        [0,0,0,7,1,0,0,0,0],#idx 15
    ...        [0,0,0,4,2,0,0,0,4],
    ...        [0,0,0,2,4,0,0,0,1],
    ...        [0,0,0,1,7,0,0,0,0]], 'float')

Then compute a distance matrix using euclidean distance, and perform nmds on that matrix

.. doctest::

    >>> euc_distmtx = dist_euclidean(abundance)
    >>> nm = NMDS(euc_distmtx, verbosity=0)

The NMDS object provides a list of points, which can be plotted if desired

.. doctest::

    >>> pts = nm.getPoints()
    >>> stress = nm.getStress()

With matplotlib installed, we could then do ``plt.plot(pts[:,0], pts[:,1])``

Hierarchical clustering (UPGMA, NJ)
===================================

Hierarchical clustering techniques work on a matrix of pairwise distances. In this case, we use the distance matrix from the NMDS example, relating samples of species to one another using UPGMA (NJ below).

.. note:: UPGMA should not be used for phylogenetic reconstruction.

.. doctest::

    >>> from cogent.cluster.UPGMA import upgma

we start with the distance matrix and list of sample names:

.. doctest::

    >>> sample_names = ['sample'+str(i) for i in range(len(euc_distmtx))]

make 2d dict:

.. doctest::

    >>> euc_distdict = {}
    >>> for i in range(len(sample_names)):
    ...    for j in range(len(sample_names)):
    ...        euc_distdict[(sample_names[i],sample_names[j])]=euc_distmtx[i,j]

e.g.: ``euc_distdict[('sample6', 'sample5')] == 3.7416573867739413``

Now use this matrix to build a UPGMA cluster.

.. doctest::

    >>> mycluster = upgma(euc_distdict)
    >>> print mycluster.asciiArt()
                                                      /-sample10
                                            /edge.3--|
                                  /edge.2--|          \-sample8
                                 |         |
                                 |          \-sample9
                        /edge.1--|
                       |         |                    /-sample12
                       |         |          /edge.5--|
                       |         |         |          \-sample11
                       |          \edge.4--|
                       |                   |          /-sample6
                       |                    \edge.6--|
              /edge.0--|                              \-sample7
             |         |
             |         |                                        /-sample15
             |         |                              /edge.10-|
             |         |                    /edge.9--|          \-sample14
             |         |                   |         |
             |         |          /edge.8--|          \-sample13
             |         |         |         |
             |          \edge.7--|          \-sample16
    -root----|                   |
             |                   |          /-sample17
             |                    \edge.11-|
             |                              \-sample18
             |
             |                              /-sample5
             |                    /edge.14-|
             |          /edge.13-|          \-sample4
             |         |         |
             |         |          \-sample3
              \edge.12-|
                       |                    /-sample2
                       |          /edge.16-|
                        \edge.15-|          \-sample1
                                 |
                                  \-sample0

We demonstrate saving this UPGMA cluster to a file.

.. doctest::

    >>> mycluster.writeToFile('test_upgma.tree')

..
    We don't actually want to keep that file now, so I'm importing the ``os`` module to delete it.

.. doctest::
    :hide:

    >>> import os
    >>> os.remove('test_upgma.tree')

We can use neighbor joining (NJ) instead of UPGMA:

.. doctest::

    >>> from cogent.phylo.nj import nj
    >>> njtree = nj(euc_distdict)
    >>> print njtree.asciiArt()
              /-sample16
             |
             |                    /-sample12
             |          /edge.2--|
             |         |         |          /-sample13
             |         |          \edge.1--|
             |         |                   |          /-sample14
             |         |                    \edge.0--|
             |         |                              \-sample15
             |         |
             |         |                              /-sample7
             |-edge.14-|                    /edge.5--|
             |         |                   |         |          /-sample8
             |         |                   |          \edge.4--|
             |         |          /edge.6--|                   |          /-sample10
             |         |         |         |                    \edge.3--|
             |         |         |         |                              \-sample9
    -root----|         |         |         |
             |         |         |          \-sample11
             |         |         |
             |          \edge.13-|                    /-sample6
             |                   |                   |
             |                   |                   |                              /-sample4
             |                   |          /edge.10-|                    /edge.7--|
             |                   |         |         |          /edge.8--|          \-sample3
             |                   |         |         |         |         |
             |                   |         |          \edge.9--|          \-sample5
             |                    \edge.12-|                   |
             |                             |                    \-sample2
             |                             |
             |                             |          /-sample0
             |                              \edge.11-|
             |                                        \-sample1
             |
             |          /-sample18
              \edge.15-|
                        \-sample17