1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
|
"""!
@brief Module provides various random sampling algorithms.
@authors Andrei Novikov (pyclustering@yandex.ru)
@date 2014-2020
@copyright BSD-3-Clause
"""
import random
def reservoir_r(data, n):
"""!
@brief Performs data sampling using Reservoir Algorithm R.
@details Algorithm complexity O(n). Implementation is based on paper @cite article::utils::sampling::1. Average
number of uniform random variates: \f$N - n\f$.
@param[in] data (list): Input data for sampling.
@param[in] n (uint): Size of sample that should be extracted from 'data'.
@return (list) Sample with size 'n' from 'data'.
Generate random samples with 5 elements and with 3 elements using Reservoir Algorithm R:
@code
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
sample = reservoir_r(data, 5) # generate sample with 5 elements for 'data'.
print(sample)
sample = reservoir_r(data, 3) # generate sample with 3 elements for 'data'.
print(sample)
@endcode
Output example for the code above:
@code
[20, 7, 17, 12, 11]
[12, 2, 10]
@endcode
"""
if n > len(data):
raise ValueError("Incorrect sampling value 'n' (%d) that should be bigger then data size (%d).")
random.seed()
reservoir = data[0:n]
for i in range(n, len(data)):
m = random.randrange(0, i + 1)
if m < n:
reservoir[m] = data[i]
return reservoir
def reservoir_x(data, n):
"""!
@brief Performs data sampling using Reservoir Algorithm X.
@details Algorithm complexity O(n). Implementation is based on paper @cite article::utils::sampling::1. Average
number of uniform random variates:
\f[\approx 2n\ln \left (\frac{N}{n} \right)\f]
@param[in] data (list): Input data for sampling.
@param[in] n (uint): Size of sample that should be extracted from 'data'.
@return (list) Sample with size 'n' from 'data'.
Generate random sample with 5 elements using Reservoir Algorithm X:
@code
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
sample = reservoir_x(data, 10) # generate sample with 5 elements for 'data'.
print(sample)
@endcode
Output example for the code above:
@code
[0, 20, 2, 16, 13, 15, 19, 18, 10, 9]
@endcode
"""
def calculate_skip_value(t, size, skip):
return pow(t + 1 - size, skip + 1) / pow(t + 1, skip + 1)
def generate_skip_value(t, size):
threshold, skip = random.random(), 0
while calculate_skip_value(t, size, skip) > threshold:
skip += 1
return skip
if n > len(data):
raise ValueError("Incorrect sampling value 'n' (%d) that should be bigger then data size (%d).")
random.seed()
reservoir = data[0:n]
i = n
while i < len(data):
i += generate_skip_value(i, n)
if i < len(data):
m = random.randrange(0, n)
reservoir[m] = data[i]
i += 1
return reservoir
|