1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
|
Roadmap
=======
For a brochure version of this roadmap, see
`this link <https://docs.wixstatic.com/ugd/095d2c_ac81d19db47047c79a55da7a6c31cf66.pdf>`_.
Background
----------
The aim of PyData/Sparse is to create sparse containers that implement the ndarray
interface. Traditionally in the PyData ecosystem, sparse arrays have been provided
by the ``scipy.sparse`` submodule. All containers there depend on and emulate the
``numpy.matrix`` interface. This means that they are limited to two dimensions and also
don’t work well in places where ``numpy.ndarray`` would work.
PyData/Sparse is well on its way to replacing ``scipy.sparse`` as the de-facto sparse array
implementation in the PyData ecosystem.
Topics
------
* More storage formats
* Better performance/algorithms
* Covering more of the NumPy API
* SciPy Integration
* Dask integration for high scalability
* CuPy integration for GPU-acceleration
* Maintenance and General Improvements
More Storage Formats
--------------------
In the sparse domain, you have to make a choice of format when representing your array in
memory, and different formats have different trade-offs. For example:
* CSR/CSC are usually expected by external libraries, and have good space characteristics
for most arrays
* DOK allows in-place modification and writes
* LIL has faster writes if written to in-order.
* BSR allows block-writes and reads
The most important formats are, of course, CSR and CSC, because they allow zero-copy interaction
with a number of libraries including MKL, LAPACK and others. This will allow PyData/Sparse to
quickly reach the functionality of ``scipy.sparse``, accelerating the path to its replacement.
Better Performance/Algorithms
-----------------------------
There are a few places in scipy.sparse where algorithms are sub-optimal, sometimes due to reliance
on NumPy which doesn’t have these algorithms. We intend to both improve the algorithms in NumPy,
giving the broader community a chance to use them; as well as in PyData/Sparse, to reach optimal
efficiency in the broadest use-cases.
Covering More of the NumPy API
------------------------------
Our eventual aim is to cover all areas of NumPy where algorithms exist that give sparse arrays an edge
over dense arrays. Currently, PyData/Sparse supports reductions, element-wise functions and other common
functions such as stacking, concatenating and tensor products. Common uses of sparse arrays include
linear algebra and graph theoretic subroutines, so we plan on covering those first.
SciPy Integration
-----------------
PyData/Sparse aims to build containers and elementary operations on them, such as element-wise operations,
reductions and so on. We plan on modifying the current graph theoretic subroutines in ``scipy.sparse.csgraph``
to support PyData/Sparse arrays. The same applies for linear algebra and ``scipy.sparse.linalg``.
CuPy integration for GPU-acceleration
-------------------------------------
CuPy is a project that implements a large portion of NumPy’s ndarray interface on GPUs. We plan to integrate
with CuPy so that it’s possible to accelerate sparse arrays on GPUs.
Completed Tasks
===============
Dask Integration for High Scalability
-------------------------------------
Dask is a project that takes ndarray style containers and then allows them to scale across multiple cores or
clusters. We plan on tighter integration and cooperation with the Dask team to ensure the highest amount of
Dask functionality works with sparse arrays.
Currently, integration with Dask is supported via array protocols. When more of the NumPy API (e.g. array
creation functions) becomes available through array protocols, it will be automatically be supported by Dask.
(Partial) SciPy Integration
---------------------------
Support for ``scipy.sparse.linalg`` has been completed. We hope to add support for ``scipy.sparse.csgraph``
in the future.
More Storage Formats
--------------------
GCXS, a compressed n-dimensional array format based on the GCRS/GCCS formats of
`Shaikh and Hasan 2015 <https://ieeexplore.ieee.org/document/7237032>`_, has been added.
In conjunction with this work, the CSR/CSC matrix formats have been are now a part of pydata/sparse.
We plan to add better-performing algorithms for many of the operations currently supported.
|