1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
|
Simulated Workload Performance Comparison
=========================================
While the :doc:`competitive performance comparison<performance>` is useful for
identifying strengths and weaknesses in the API, it doesn't reflect the
variable nature of real-life workloads. In August 2014, a survey of use cases
for sorted list data types was made through repeated searches of Github. The
analysis identified several common use patterns.
These benchmarks attempt to mimic the usage patterns observed in other
projects. Those patterns are summarized in the following names:
* *Priority Queue* -- Represents a priority queue data structure. In addition
to the normal `add` and `pop` routines that can be efficiently managed with
the `heapq` module in the standard library, projects required the ability to
test for list ownership of values and occasional removal. A common use case
also involved iterating the entire priority queue. The recipe described in
the Python docs is insufficient as a priority queue for these projects in two
ways: not every inserted value corresponds to a unique key and sorted
iteration is not supported in linear time.
* *Multiset* -- Represents a multiset data structure. While the
`collections.Counter` data type is commonly suggested for implementing a
multiset in Python, some projects required the ability to efficiently lookup
the greatest or least item in the set.
* *Ranking* -- Represents those projects that repeatedly looked up the index of
items in the sorted list. Sometimes this occurred as part of prioritizing
elements and reporting their rank. Imagine trying to identify an element's
position in a work queue.
* *Neighbor* -- This pattern was observed in several implementations of
machine-learning algorithms (e.g. K-nearest-neighbor search). An iterative
process was applied to a set of values in which they were repeatedly bisected
and occasionally iterated. The bisect process attempts to find the nearest
values to a given one.
* *Intervals* -- Perhaps the most complex usage pattern, these projects
maintained a list of intervals and were often querying a sorted list to
determine those which overlapped. In addition to bisecting the list to
identify nearest intervals, range queries were used to determine
overlap. Frequent indexing was also apparent.
The performance of each benchmark is displayed below along with each of the
other data types implementing the necessary operations. Before the benchmark
begins, the list is populated with values and then corresponding to the list
size, a number of operations are performed. Each workload has been reduced to a
set of operations with associated frequencies. For example, the priority queue
simulates `add` and `pop` each 40% of the time.
Though these workloads strive to be realistic, they are still quite
synthetic. No attempt is made to exercise memory allocation or cache
interference while sorted list operations are performed. The frequency of each
operation is also estimated because no projects had performance benchmarks that
were easily evaluated.
The legends of the graphs below correlate the underlying data structure used to
the Python project. The correlation is as follows:
.. currentmodule:: sortedcontainers
====================== ==================================
Data Structure Project
====================== ==================================
:class:`SortedList` :doc:`Sorted Containers<index>`
:class:`SortedKeyList` :doc:`Sorted Containers<index>`
B-Tree `blist on PyPI`_
List `sortedcollection recipe`_
====================== ==================================
.. _`blist on PyPI`: https://pypi.org/project/blist/
.. _`sortedcollection recipe`: http://code.activestate.com/recipes/577197-sortedcollection/
Sorted List
-----------
Graphs comparing :doc:`sortedlist` performance.
Priority Queue
..............
Simulates a *Priority Queue* workload as described above. The mix of operations
and their frequencies:
* 40% :func:`SortedList.add`
* 40% :func:`SortedList.pop`
* 10% :func:`SortedList.discard`
* 9% :func:`SortedList.__contains__`
* 1% :func:`SortedList.__iter__` (limited to first 100 elements)
.. image:: _static/SortedList-priorityqueue.png
.. image:: _static/SortedList_runtime-priorityqueue.png
.. image:: _static/SortedList_load-priorityqueue.png
Multiset
........
Simulates a *Multiset* workload as described above. The mix of operations and
their frequencies:
* 75% :func:`SortedList.__contains__`
* 10% :func:`SortedList.add`
* 10% :func:`SortedList.remove`
* 5% :func:`SortedList.__getitem__`
.. image:: _static/SortedList-multiset.png
.. image:: _static/SortedList_runtime-multiset.png
.. image:: _static/SortedList_load-multiset.png
Ranking
.......
Simulates a *Ranking* workload as described above. The mix of operations and
their frequencies:
* 40% :func:`SortedList.__getitem__`
* 40% :func:`SortedList.index`
* 10% :func:`SortedList.add`
* 10% :func:`SortedList.remove`
.. image:: _static/SortedList-ranking.png
.. image:: _static/SortedList_runtime-ranking.png
.. image:: _static/SortedList_load-ranking.png
Neighbor
........
Simulates a *Neighbor* workload as described above. The mix of operations and
their frequencies:
* 75% :func:`SortedList.bisect`
* 10% :func:`SortedList.add`
* 10% :func:`SortedList.remove`
* 5% :func:`SortedList.__iter__` (limited to first 100 elements)
.. image:: _static/SortedList-neighbor.png
.. image:: _static/SortedList_runtime-neighbor.png
.. image:: _static/SortedList_load-neighbor.png
Intervals
.........
Simulates an *Intervals* workload as described above. The mix of operations and
their frequencies:
* 30% :func:`SortedList.bisect`
* 20% :func:`SortedList.__getitem__`
* 20% :func:`SortedList.__delitem__`
* 10% :func:`SortedList.__getitem__` (range query)
* 10% :func:`SortedList.add`
* 10% :func:`SortedList.discard`
.. image:: _static/SortedList-intervals.png
.. image:: _static/SortedList_runtime-intervals.png
.. image:: _static/SortedList_load-intervals.png
Other Performance Comparisons
-----------------------------
:doc:`Sorted Containers<index>` uses a segmented-list data structure similar to
a B-tree limited to two levels of nodes. As part of the implementation, a load
factor is used to determine how many values should be stored in each node. This
can have a significant impact on performance and a :doc:`load factor
performance comparison<performance-load>` is also provided.
Because :doc:`Sorted Containers<index>` is pure-Python, its performance also
depends directly on the Python runtime. A :doc:`runtime performance
comparison<performance-runtime>` is also included with data from popular Python
runtimes.
|