1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
|
******************
Community analysis
******************
alpha diversity
===============
Phylogenetic Diversity (PD)
---------------------------
For each environment (i.e. sample), calculates the amount of branch length in a phylogenetic tree that lead to its sequences.
First we will load in a Newick formatted tree.
.. doctest::
>>> from cogent.parse.tree import DndParser
>>> from cogent.maths.unifrac.fast_tree import UniFracTreeNode
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> tree = DndParser(tree_in, UniFracTreeNode)
Next we will load information on which sequences in the tree come from which environment.
.. doctest::
>>> from cogent.maths.unifrac.fast_unifrac import count_envs
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> envs = count_envs(envs_in)
Finally, we can calculate the PD values for each environment in the tree
.. doctest::
>>> from cogent.maths.unifrac.fast_unifrac import PD_whole_tree
>>> envs, PD_values = PD_whole_tree(tree, envs)
>>> print envs
['E_FL', 'E_PA', 'O_FL', 'O_UN', 'R_FL', 'R_PA']
>>> print PD_values#doctest: +SKIP
[ 5.85389 7.60352 2.19215 2.81821 3.93728 3.7534 ]
``PD_vals`` is a ``numpy`` ``array`` with the values representing each environment in envs.
Rarefaction
-------------
*To be written.*
Parametric methods
------------------
*To be written.*
Nonparametric methods
---------------------
*To be written.*
beta diversity
==============
Unifrac
-------
The Fast UniFrac implementation in PyCogent is the source code for the `Fast UniFrac web tool <http://bmf2.colorado.edu/fastunifrac>`_ and the `QIIME pipeline <http://qiime.sourceforge.net>`_ for Microbial community analysis.
Calculate a UniFrac Distance Matrix and apply PCoA and UPGMA
------------------------------------------------------------
The UniFrac analysis is run on open tree and environment file objects. The resulting dictionary has a distance matrix of pairwise UniFrac values ('distance_matrix'), a Newick string representing the results of performing UPGMA clustering on this distance matrix ('cluster_envs') and the results of running Principal Coordinates Analysis on the distance matrix ('pcoa'). One can specify weighted UniFrac with ``weighted=True``. Here we run an unweighted analysis.
.. doctest::
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_file(tree_in, envs_in, weighted=False)
>>> print result['cluster_envs']#doctest: +SKIP
((((('E_FL':0.339607103063,'R_FL':0.339607103063):0.0279991540511,'R_PA':0.367606257114):0.0103026524101,'E_PA':0.377908909524):0.0223322024492,'O_UN':0.400241111973):0.00976759866402,'O_FL':0.410008710637);
>>> print result['pcoa']#doctest: +SKIP
=================================================================================================
Type Label vec_num-0 vec_num-1 vec_num-2 vec_num-3 vec_num-4 vec_num-5
-------------------------------------------------------------------------------------------------
Eigenvectors E_FL 0.05 0.22 -0.09 -0.26 -0.29 -0.00
Eigenvectors E_PA -0.36 0.24 0.21 -0.08 0.18 -0.00
Eigenvectors O_FL 0.32 -0.26 0.30 -0.13 0.05 -0.00
Eigenvectors O_UN -0.28 -0.40 -0.24 -0.04 0.01 -0.00
Eigenvectors R_FL 0.29 0.18 -0.28 0.09 0.22 -0.00
Eigenvectors R_PA -0.02 0.02 0.11 0.42 -0.17 -0.00
Eigenvalues eigenvalues 0.40 0.36 0.29 0.27 0.19 -0.00
Eigenvalues var explained (%) 26.34 23.84 19.06 18.02 12.74 -0.00
-------------------------------------------------------------------------------------------------
Perform pairwise tests of whether samples are significantly different with UniFrac
----------------------------------------------------------------------------------
The analysis is run on open tree and environment file objects. In this example, we use unweighted unifrac (``weighted=False``), we permute the environment assignments on the tree 50 times (``num_iters=50``) and we perform UniFrac on all pairs of environments (``test_on="Pairwise"``). A list is returned with a tuple for each pairwise comparison with items: 0 - the first environment, 1 - the second environment, 2- the uncorrected p-value and 3 - the p-value after correcting for multiple comparisons with the Bonferroni correction.
.. doctest::
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_permutations_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_permutations_file(tree_in, envs_in, weighted=False, num_iters=50, test_on="Pairwise")
>>> print result[0]#doctest: +SKIP
('E_FL', 'E_PA', 0.17999999999999999, 1.0)
Perform a single UniFrac significance test on the whole tree
------------------------------------------------------------
The analysis is run on open tree and environment file objects. In this example, we use weighted unifrac (``weighted=True``), we permute the environment assignments on the tree 50 times (``num_iters=50``) and we perform a unifrac significance test on the whole tree (``test_on="Tree"``). The resulting list has only one item since a single test was performed. It is a 3 item tuple where the second and third values are the p-value.
.. doctest::
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_permutations_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_unifrac_permutations_file(tree_in, envs_in, weighted=True, num_iters=50, test_on="Tree")
>>> print result#doctest: +SKIP
[('whole tree', 0.56000000000000005, 0.56000000000000005)]
P-test
-------
Perform pairwise tests of whether samples are significantly different with the P-test (Martin, 2002)
----------------------------------------------------------------------------------------------------
The analysis is run on open tree and environment file objects. In this example, we permute the environment assignments on the tree 50 times (``num_iters=50``) and perform the p test for all pairs of environments (``test_on="Pairwise"``). A list is returned with a tuple for each pairwise comparison with items: 0 - the first environment, 1 - the second environment, 2- the uncorrected p-value and 3 - the p-value after correcting for multiple comparisons with the Bonferroni correction.
.. doctest::
>>> from cogent.maths.unifrac.fast_unifrac import fast_p_test_file
>>> tree_in = open("data/Crump_example_tree_newick.txt")
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
>>> result = fast_p_test_file(tree_in, envs_in, num_iters=50, test_on="Pairwise")
>>> print result[0]#doctest: +SKIP
('E_FL', 'E_PA', 0.040000000000000001, 0.59999999999999998)
Taxon-based
-----------
Computing a distance matrix between samples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyCogent provides many different ways to compute pairwise distances between objects. ``cogent/maths/distance_transform.py`` provides a set of functions to calculate dissimilarities/distances between samples, given an abundance matrix. Here is one example:
.. doctest::
>>> from cogent.maths.distance_transform import dist_euclidean
>>> from numpy import array
>>> abundance_data = array([[1, 3],
... [5, 2],
... [0.1, 22]],'float')
.. note:: see ``distance_transform.py`` for other metrics than euclidean
We now have 3 samples, and the abundance of each column (e.g.: species) in that sample. The first sample has 1 individual of species 1, 3 individuals of species 2. We now compute the relatedness between these samples, using euclidean distance between the rows:
.. doctest::
>>> dists = dist_euclidean(abundance_data)
>>> print str(dists.round(2)) # doctest: +SKIP
[[ 0. , 4.12, 19.02]
[ 4.12, 0. , 20.59 ]
[ 19.02, 20.59 , 0. ]]
this distance matrix can be visualized via multivariate reduction techniques such as :ref:`multivariate-analysis`.
Taxonomy
========
*To be written.*
.. need to decide on methods here
|