1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
|
# %%
"""
===================================
Random state within joblib.Parallel
===================================
Randomness is affected by parallel execution differently by the different
backends.
In particular, when using multiple processes, the random sequence can be
the same in all processes. This example illustrates the problem and shows
how to work around it.
"""
import numpy as np
from joblib import Parallel, delayed
# %%
# A utility function for the example
def print_vector(vector, backend):
"""Helper function to print the generated vector with a given backend."""
print(
"\nThe different generated vectors using the {} backend are:\n {}".format(
backend, np.array(vector)
)
)
# %%
# Sequential behavior
#####################
#
# ``stochastic_function`` will generate five random integers. When
# calling the function several times, we are expecting to obtain
# different vectors. For instance, we will call the function five times
# in a sequential manner, we can check that the generated vectors are all
# different.
def stochastic_function(max_value):
"""Randomly generate integer up to a maximum value."""
return np.random.randint(max_value, size=5)
n_vectors = 5
random_vector = [stochastic_function(10) for _ in range(n_vectors)]
print(
"\nThe different generated vectors in a sequential manner are:\n {}".format(
np.array(random_vector)
)
)
# %%
# Parallel behavior
###################
#
# Joblib provides three different backends: loky (default), threading, and
# multiprocessing.
backend = "loky"
random_vector = Parallel(n_jobs=2, backend=backend)(
delayed(stochastic_function)(10) for _ in range(n_vectors)
)
print_vector(random_vector, backend)
###############################################################################
backend = "threading"
random_vector = Parallel(n_jobs=2, backend=backend)(
delayed(stochastic_function)(10) for _ in range(n_vectors)
)
print_vector(random_vector, backend)
# %%
# Loky and the threading backends behave exactly as in the sequential case and
# do not require more care. However, this is not the case regarding the
# multiprocessing backend with the "fork" or "forkserver" start method because
# the state of the global numpy random stated will be exactly duplicated
# in all the workers
#
# Note: on platforms for which the default start method is "spawn", we do not
# have this problem but we cannot use this in a Python script without
# using the if __name__ == "__main__" construct. So let's end this example
# early if that's the case:
import multiprocessing as mp
if mp.get_start_method() != "spawn":
backend = "multiprocessing"
random_vector = Parallel(n_jobs=2, backend=backend)(
delayed(stochastic_function)(10) for _ in range(n_vectors)
)
print_vector(random_vector, backend)
# %%
# Some of the generated vectors are exactly the same, which can be a
# problem for the application.
#
# Technically, the reason is that all forked Python processes share the
# same exact random seed. As a result, we obtain twice the same randomly
# generated vectors because we are using ``n_jobs=2``. A solution is to
# set the random state within the function which is passed to
# :class:`joblib.Parallel`.
def stochastic_function_seeded(max_value, random_state):
rng = np.random.RandomState(random_state)
return rng.randint(max_value, size=5)
# %%
# ``stochastic_function_seeded`` accepts as argument a random seed. We can
# reset this seed by passing ``None`` at every function call. In this case, we
# see that the generated vectors are all different.
if mp.get_start_method() != "spawn":
random_vector = Parallel(n_jobs=2, backend=backend)(
delayed(stochastic_function_seeded)(10, None) for _ in range(n_vectors)
)
print_vector(random_vector, backend)
# %%
# Fixing the random state to obtain deterministic results
#########################################################
#
# The pattern of ``stochastic_function_seeded`` has another advantage: it
# allows to control the random_state by passing a known seed. For best results
# [1]_, the random state is initialized by a sequence based on a root seed and
# a job identifier. So for instance, we can replicate the same generation of
# vectors by passing a fixed state as follows.
#
# .. [1] https://numpy.org/doc/stable/reference/random/parallel.html
if mp.get_start_method() != "spawn":
seed = 42
random_vector = Parallel(n_jobs=2, backend=backend)(
delayed(stochastic_function_seeded)(10, [i, seed]) for i in range(n_vectors)
)
print_vector(random_vector, backend)
random_vector = Parallel(n_jobs=2, backend=backend)(
delayed(stochastic_function_seeded)(10, [i, seed]) for i in range(n_vectors)
)
print_vector(random_vector, backend)
|