1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
|
# Sparse
This implements sparse arrays of arbitrary dimension on top of
[numpy][] and
[`scipy.sparse`][]. It generalizes the
[`scipy.sparse.coo_matrix`][] and
[`scipy.sparse.dok_matrix`][] layouts, but
extends beyond just rows and columns to an arbitrary number of
dimensions.
Additionally, this project maintains compatibility with the
[`numpy.ndarray`][] interface rather than the
[`numpy.matrix`][] interface used in
[`scipy.sparse`][]
These differences make this project useful in certain situations where
scipy.sparse matrices are not well suited, but it should not be
considered a full replacement. The data structures in pydata/sparse
complement and can be used in conjunction with the fast linear algebra
routines inside [`scipy.sparse`][]. A format conversion or copy may be
required.
## Motivation
Sparse arrays, or arrays that are mostly empty or filled with zeros, are
common in many scientific applications. To save space we often avoid
storing these arrays in traditional dense formats, and instead choose
different data structures. Our choice of data structure can
significantly affect our storage and computational costs when working
with these arrays.
## Design
The main data structure in this library follows the [Coordinate List
(COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO))
layout for sparse matrices, but extends it to multiple dimensions.
The COO layout, which stores the row index, column index, and value of
every element:
| row | col | data |
|-----|-----|------|
| 0 | 0 | 10 |
| 0 | 2 | 13 |
| 1 | 3 | 9 |
| 3 | 8 | 21 |
It is straightforward to extend the COO layout to an arbitrary number of
dimensions:
| dim1 | dim2 | dim3 | \... | data |
|------|------|------|------|------|
| 0 | 0 | 0 | . | 10 |
| 0 | 0 | 3 | . | 13 |
| 0 | 2 | 2 | . | 9 |
| 3 | 1 | 4 | . | 21 |
This makes it easy to *store* a multidimensional sparse array, but we
still need to reimplement all of the array operations like transpose,
reshape, slicing, tensordot, reductions, etc., which can be challenging
in general.
This library also includes several other data structures. Similar to
COO, the [Dictionary of Keys
(DOK)](https://en.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_(DOK))
format for sparse matrices generalizes well to an arbitrary number of
dimensions. DOK is well-suited for writing and mutating. Most other
operations are not supported for DOK. A common workflow may involve
writing an array with DOK and then converting to another format for
other operations.
The [Compressed Sparse Row/Column
(CSR/CSC)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS))
formats are widely used in scientific computing are now supported by
pydata/sparse. The CSR/CSC formats excel at compression and mathematical
operations. While these formats are restricted to two dimensions,
pydata/sparse supports the GCXS sparse array format, based on [GCRS/GCCS
from](https://ieeexplore.ieee.org/abstract/document/7237032/similar#similar)
which generalizes CSR/CSC to n-dimensional arrays. Like their
two-dimensional CSR/CSC counterparts, GCXS arrays compress well. Whereas
the storage cost of COO depends heavily on the number of dimensions of
the array, the number of dimensions only minimally affects the storage
cost of GCXS arrays, which results in favorable compression ratios
across many use cases.
Together these formats cover a wide array of applications of sparsity.
Additionally, with each format complying with the
[`numpy.ndarray`][] interface and following
the appropriate dispatching protocols, pydata/sparse arrays can interact
with other array libraries and seamlessly take part in
pydata-ecosystem-based workflows.
## LICENSE
This library is licensed under BSD-3.
|