File: benchmark_splits_fill.py

package info (click to toggle)
sortedcontainers 2.4.0-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 11,136 kB
  • sloc: python: 6,613; sh: 60; makefile: 18
file content (146 lines) | stat: -rw-r--r-- 4,569 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
"""Count splits that occur in SortedList._expand with varying loads.

NOTE: Benchmarking performance at scale eventually used sampling rather than
what is described below. Do not use this method for benchmarking. This file is
kept for SortedListWithSplits which presents an example for recording the
number of sublist splits that occur.

Compare three load styles:

1. Low - Simply initializing a SortedList creates each sublist with length
equal to `load`.

2. Normal Random - See `init_sorted_list` below. Each sublist is given a length
chosen at random from a normal distribution with mean 1500 and standard
deviation 100. See `plot_lengths_histogram.py` for a visualization of the
sublist length and motivation for the normal distribution parameters.

3. Added Random - Each sublist is initialized by simply adding random values to
the list. This one is the most realistic but expensive to compute. It's also
biased by the number of values added. Because SortedList amortizes some costs,
this metric is prone to under or over-estimating impact.

Sample output: LIMIT = int(1e6), RATIO = 1

Low [504, 490, 487, 503, 488, 494, 496, 502, 493, 486]
Normal Random [666, 667, 662, 666, 670, 668, 668, 666, 668, 666]
Added Random [685, 675, 689, 687, 699, 706, 691, 679, 672, 702]

Conclusion:

Normal Random is about 1-2% less than Added Random on average but remains
practical for use as an efficient approximation.

The above values were calculated with RATIO = 1. Increasing the ratio to 10
creates a significant divergence between Normal Random and Added Random. Sample
output with RATIO = 10:

Low [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Normal Random [1, 2, 0, 1, 0, 1, 0, 1, 1, 2]
Added Random [307, 286, 290, 253, 279, 319, 283, 264, 278, 280]

The difference is due to the load of the sublists. Randomly adding a million
values positions the sublist lengths in a bimodal distribution split at 1000
and 2000. This is nearly the worst-case scenario for measurements.

After trying a number of settings, RATIO = 3 appears reliable and useful. This
approximates the available capacity:

    >>> random.seed(0)
    >>> sl = init_sorted_list(sc.SortedList(), int(1e6))
    >>> len(sl._lists) * sl._load * 2 - len(sl)
    334000

So after constructing a SortedList with sublist length given by a normal
distribution with mean 1,500 and standard deviation 100, we fill the remaining
empty capacity.

"""

from __future__ import print_function

import sortedcontainers as sc
import random

REPEAT = 5
LIMIT = int(1e6)
RATIO = 3


class SortedListWithSplits(sc.SortedList):
    "SortedList that counts splits that occur in _expand."
    
    def __init__(self, *args, **kwargs):
        self.splits = 0
        super(SortedListWithSplits, self).__init__(*args, **kwargs)

    def _expand(self, pos):
        if len(self._lists[pos]) > self._twice:
            self.splits += 1
        super(SortedListWithSplits, self)._expand(pos)


def init_sorted_list(sl, size):
    "Initialize a SortedList with normally distributed sublist lengths."
    sl.clear()

    total = 0

    while total < size:
        # count = random.randrange(sl._load, sl._twice)
        count = int(random.normalvariate(int(sl._load * 1.5), 100))
        count = min(count, sl._load * 2)
        count = max(count, sl._load)
        sl._lists.append(list(range(total, total + count)))
        total += count

    sl._len = sum(len(sublist) for sublist in sl._lists)
    sl._maxes[:] = [sublist[-1] for sublist in sl._lists]

    sl._check()

    while len(sl) > size:
        del sl[random.randrange(len(sl))]

    return sl


def fill(obj, count, limit):
    "Repeatedly add random values to the SortedList."
    for each in range(count):
        obj.add(random.randrange(limit))


if __name__ == '__main__':
    low_splits = []

    for each in range(REPEAT):
        sl = SortedListWithSplits(range(LIMIT))
        fill(sl, LIMIT / RATIO, LIMIT)
        low_splits.append(sl.splits)

    uniform_splits = []

    for each in range(REPEAT):
        sl = init_sorted_list(SortedListWithSplits(), LIMIT)
        fill(sl, LIMIT / RATIO, LIMIT)
        uniform_splits.append(sl.splits)

    rand_splits = []

    for each in range(REPEAT):
        sl = SortedListWithSplits()

        values = list(range(LIMIT))
        random.shuffle(values)
        for value in values:
            sl.add(value)

        sl.splits = 0

        fill(sl, LIMIT / RATIO, LIMIT)
        rand_splits.append(sl.splits)

    print('Low', low_splits)
    print('Normal Random', uniform_splits)
    print('Added Random', rand_splits)