1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
|
Concatenation
=============
With :func:`~anndata.concat`, :class:`~anndata.AnnData` objects can be combined via a composition of two operations: concatenation and merging.
* Concatenation is when we keep all sub elements of each object, and stack these elements in an ordered way.
* Merging is combining a set of collections into one resulting collection which contains elements from the objects.
.. note::
This function borrows from similar functions in pandas_ and xarray_. Argument which are used to control concatenation are modeled after :func:`pandas.concat` while strategies for merging are inspired by :func:`xarray.merge`'s `compat` argument.
.. _pandas: https://pandas.pydata.org
.. _xarray: http://xarray.pydata.org
Concatenation
-------------
Let's start off with an example:
>>> import scanpy as sc, anndata as ad, numpy as np, pandas as pd
>>> from scipy import sparse
>>> from anndata import AnnData
>>> pbmc = sc.datasets.pbmc68k_reduced()
>>> pbmc
AnnData object with n_obs × n_vars = 700 × 765
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
If we split this object up by clusters of observations, then stack those subsets we'll obtain the same values – just ordered differently.
>>> groups = pbmc.obs.groupby("louvain", observed=True).indices
>>> pbmc_concat = ad.concat([pbmc[inds] for inds in groups.values()], merge="same")
>>> assert np.array_equal(pbmc.X, pbmc_concat[pbmc.obs_names].X)
>>> pbmc_concat
AnnData object with n_obs × n_vars = 700 × 765
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
Note that we concatenated along the observations by default, and that most elements aligned to the observations were concatenated as well.
A notable exception is :attr:`~anndata.AnnData.obsp`, which can be re-enabled with the `pairwise` keyword argument.
This is because it's not obvious that combining graphs or distance matrices padded with 0s is particularly useful, and may be unintuitive.
Inner and outer joins
~~~~~~~~~~~~~~~~~~~~~
When the variables present in the objects to be concatenated aren't exactly the same, you can choose to take either the intersection or union of these variables.
This is otherwise called taking the `"inner"` (intersection) or `"outer"` (union) join.
For example, given two anndata objects with differing variables:
>>> a = AnnData(sparse.eye(3, format="csr"), var=pd.DataFrame(index=list("abc")))
>>> b = AnnData(sparse.eye(2, format="csr"), var=pd.DataFrame(index=list("ba")))
>>> ad.concat([a, b], join="inner").X.toarray()
array([[1., 0.],
[0., 1.],
[0., 0.],
[0., 1.],
[1., 0.]])
>>> ad.concat([a, b], join="outer").X.toarray()
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]])
The join argument is used for any element which has both (1) an axis being concatenated and (2) an axis not being concatenated.
When concatenating along the `obs` dimension, this means elements of `.X`, `obs`, `.layers`, and `.obsm` will be affected by the choice of `join`.
To demonstrate this, let's say we're trying to combine a droplet based experiment with a spatial one.
When building a joint anndata object, we would still like to store the coordinates for the spatial samples.
>>> coords = np.hstack([np.repeat(np.arange(10), 10), np.tile(np.arange(10), 10)]).T
>>> spatial = AnnData(
... sparse.random(5000, 10000, format="csr"),
... obsm={"coords": np.random.randn(5000, 2)}
... )
>>> droplet = AnnData(sparse.random(5000, 10000, format="csr"))
>>> combined = ad.concat([spatial, droplet], join="outer")
>>> sc.pl.embedding(combined, "coords") # doctest: +SKIP
.. TODO: Get the above plot to show up
Annotating data source (`label`, `keys`, and `index_unique`)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Often, you'd like to be able to tell which values came from which object.
This can be accomplished with the `label`, `keys`, and `index_unique` keyword arguments.
For an example, we'll show how you can keep track of the original dataset by passing a `Mapping` of dataset names to `AnnData` objects to `concat`:
>>> adatas = {
... "a": ad.AnnData(
... sparse.random(3, 50, format="csr", density=0.1),
... obs=pd.DataFrame(index=[f"a-{i}" for i in range(3)])
... ),
... "b": ad.AnnData(
... sparse.random(5, 50, format="csr", density=0.1),
... obs=pd.DataFrame(index=[f"b-{i}" for i in range(5)])
... ),
... }
>>> ad.concat(adatas, label="dataset").obs
dataset
a-0 a
a-1 a
a-2 a
b-0 b
b-1 b
b-2 b
b-3 b
b-4 b
Here, a categorical column (with the name specified by `label`) was added to the result.
As an alternative to passing a `Mapping`, you can also specify dataset names with the `keys` argument.
In some cases, your objects may share names along the axes being concatenated.
These values can be made unique by appending the relevant key using the `index_unique` argument:
.. TODO: skipping example since doctest does not capture stderr, but it's relevant to show the unique message
>>> adatas = {
... "a": ad.AnnData(
... sparse.random(3, 10, format="csr", density=0.1),
... obs=pd.DataFrame(index=[f"cell-{i}" for i in range(3)])
... ),
... "b": ad.AnnData(
... sparse.random(5, 10, format="csr", density=0.1),
... obs=pd.DataFrame(index=[f"cell-{i}" for i in range(5)])
... ),
... }
>>> ad.concat(adatas).obs # doctest: +SKIP
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Empty DataFrame
Columns: []
Index: [cell-0, cell-1, cell-2, cell-0, cell-1, cell-2, cell-3, cell-4]
>>> ad.concat(adatas, index_unique="_").obs
Empty DataFrame
Columns: []
Index: [cell-0_a, cell-1_a, cell-2_a, cell-0_b, cell-1_b, cell-2_b, cell-3_b, cell-4_b]
Merging
-------
Combining elements not aligned to the axis of concatenation is controlled through the `merge` arguments.
We provide a few strategies for merging elements aligned to the alternative axes:
* `None`: No elements aligned to alternative axes are present in the result object.
* `"same"`: Elements that are the same in each of the objects.
* `"unique"`: Elements for which there is only one possible value.
* `"first"`: The first element seen in each from each position.
* `"only"`: Elements that show up in only one of the objects.
We'll show how this works with elements aligned to the alternative axis, and then how merging works with `.uns`.
First, our example case:
>>> import scanpy as sc
>>> blobs = sc.datasets.blobs(n_variables=30, n_centers=5)
>>> sc.pp.pca(blobs)
>>> blobs
AnnData object with n_obs × n_vars = 640 × 30
obs: 'blobs'
uns: 'pca'
obsm: 'X_pca'
varm: 'PCs'
Now we will split this object by the categorical `"blobs"` and recombine it to illustrate different merge strategies.
>>> adatas = []
>>> for group, idx in blobs.obs.groupby("blobs").indices.items():
... sub_adata = blobs[idx].copy()
... sub_adata.obsm["qc"], sub_adata.varm[f"{group}_qc"] = sc.pp.calculate_qc_metrics(
... sub_adata, percent_top=(), inplace=False, log1p=False
... )
... adatas.append(sub_adata)
>>> adatas[0]
AnnData object with n_obs × n_vars = 128 × 30
obs: 'blobs'
uns: 'pca'
obsm: 'X_pca', 'qc'
varm: 'PCs', '0_qc'
`adatas` is now a list of datasets with disjoint sets of observations and a common set of variables.
Each object has had QC metrics computed, with observation-wise metrics stored under `"qc"` in `.obsm`, and variable-wise metrics stored with a unique key for each subset.
Taking a look at how this affects concatenation:
>>> ad.concat(adatas)
AnnData object with n_obs × n_vars = 640 × 30
obs: 'blobs'
obsm: 'X_pca', 'qc'
>>> ad.concat(adatas, merge="same")
AnnData object with n_obs × n_vars = 640 × 30
obs: 'blobs'
obsm: 'X_pca', 'qc'
varm: 'PCs'
>>> ad.concat(adatas, merge="unique")
AnnData object with n_obs × n_vars = 640 × 30
obs: 'blobs'
obsm: 'X_pca', 'qc'
varm: 'PCs', '0_qc', '1_qc', '2_qc', '3_qc', '4_qc'
Note that comparisons are made after indices are aligned.
That is, if the objects only share a subset of indices on the alternative axis, it's only required that values for those indices match when using a strategy like `"same"`.
>>> a = AnnData(
... sparse.eye(3, format="csr"),
... var=pd.DataFrame({"nums": [1, 2, 3]}, index=list("abc"))
... )
>>> b = AnnData(
... sparse.eye(2, format="csr"),
... var=pd.DataFrame({"nums": [2, 1]}, index=list("ba"))
... )
>>> ad.concat([a, b], merge="same").var
nums
a 1
b 2
Merging `.uns`
~~~~~~~~~~~~~~
We use the same set of strategies for merging `uns` as we do for entries aligned to an axis, but these strategies are applied recursively.
This is a little abstract, so we'll look at some examples of this. Here's our setup:
>>> from anndata import AnnData
>>> import numpy as np
>>> a = AnnData(np.zeros((10, 10)), uns={"a": 1, "b": 2, "c": {"c.a": 3, "c.b": 4}})
>>> b = AnnData(np.zeros((10, 10)), uns={"a": 1, "b": 3, "c": {"c.b": 4}})
>>> c = AnnData(np.zeros((10, 10)), uns={"a": 1, "b": 4, "c": {"c.a": 3, "c.b": 4, "c.c": 5}})
For quick reference, these are the results from each of the merge strategies.
These are discussed in more depth below:
=========== =======================================================
`uns_merge` Result
=========== =======================================================
`None` `{}`
`"same"` `{"a": 1, "c": {"c.b": 4}}`
`"unique"` `{"a": 1, "c": {"c.a": 3, "c.b": 4, "c.c": 5}}`
`"only"` `{"c": {"c.c": 5}}`
`"first"` `{"a": 1, "b": 2, "c": {"c.a": 3, "c.b": 4, "c.c": 5}}`
=========== =======================================================
The default returns a fairly obvious result:
>>> ad.concat([a, b, c]).uns == {}
True
But let's take a look at the others in a bit more depth. Here, we'll be wrapping the output data in a `dict` for simplicity of the return value.
>>> dict(ad.concat([a, b, c], uns_merge="same").uns)
{'a': 1, 'c': {'c.b': 4}}
Here only the values for `uns["a"]` and `uns["c"]["c.b"]` were exactly the same, so only they were kept.
`uns["b"]` has a number of values and neither `uns["c"]["c.a"]` or `uns["c"]["c.b"]` appears in each `uns`.
A key feature to note is that comparisons are aware of the nested structure of `uns` and will be applied at any depth.
This is why `uns["c"]["c.b"]` was kept.
Merging `uns` in this way can be useful when there is some shared data between the objects being concatenated.
For example, if each was put through the same pipeline with the same parameters, those parameters used would still be present in the resulting object.
Now let's look at the behaviour of `unique`:
>>> dict(ad.concat([a, b, c], uns_merge="unique").uns)
{'a': 1, 'c': {'c.a': 3, 'c.b': 4, 'c.c': 5}}
The results here are a super-set of those from `"same"`. Note that there was only one possible value at each position in the resulting mapping.
That is, there were not alternative values present for `uns["c"]["c.c"]` even though it appeared only once.
This can be useful when the object's were both run through the same pipeline but contain specific metadata per object.
An example of this would be a spatial dataset, where the images are stored in `uns`.
>>> dict(ad.concat([a, b, c], uns_merge="only").uns)
{'c': {'c.c': 5}}
`uns["c"]["c.c"]` is the only value that is kept, since it is the only one which was specified in only one `uns`.
>>> dict(ad.concat([a, b, c], uns_merge="first").uns)
{'a': 1, 'b': 2, 'c': {'c.a': 3, 'c.b': 4, 'c.c': 5}}
In this case, the result has the union of the keys from all the starting dictionaries.
The value is taken from the first object to have a value at this key.
|