1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
|
Workflow implementation
=======================
Workflows in `ewoks` are based on *networkx* graphs, both in terms of runtime representation and
persistent representation. At runtime, links between nodes are hash links which provide a unique
identifier for each task output. This identifier is used to save and load task outputs from external
storage (e.g. HDF5).
Hash links
----------
Hash links between tasks have these requirements:
* changing task input values or links will change all output hashes of the downstream tasks and hence
also their URI's to external storage
* pass the actual data from one task to another or the associated URI's should result in the
same outcome
* task results should never be hashed because they can be large
Hash implementation
-------------------
The *universal hashing* in *ewoks* is currently based on SHA-256. The `UniversalHash` class
representation a *universal hash* at runtime. Several builtin python types are *universally hasheable*:
strings, numbers, mappings, sets and iterables. Custom types that are *universally hasheable* should
derive from `UniversalHashable`.
Tasks and task inputs and outputs are *universally hasheable* and are implementation as described
in this class diagram:
.. mermaid::
classDiagram
UniversalHashable <|-- Variable
Variable <|-- VariableContainer
Variable --o VariableContainer
UniversalHashable <|-- Task
Task o-- VariableContainer
class UniversalHashable{
-version
-class_nonce
#pre_uhash
#class_uhash
#instance_nonce
#data_uhash()
uhash() UniversalHash
}
class Variable{
value
data_proxy: DataProxy
}
class VariableContainer{
value: Dictionary<string|int, Variable>
}
class Task{
input_variables: VariableContainer
output_variables: VariableContainer
}
UniversalHashable
+++++++++++++++++
The return value of `UniversalHashable.uhash()` can be either
* the *universal hash* of `pre_uhash` and `instance_nonce` when `instance_nonce`
is provided on instantiation
* equal to `pre_uhash` when `instance_nonce` is NOT provided on instantiation
The value of `UniversalHashable.pre_uhash` can be either
* provided on instantiation
* the *universal hash* of `UniversalHashable.class_nonce` and the return value of `UniversalHashable.data_uhash()`
The value of `UniversalHashable.class_nonce` is the *universal hash* of
* the class full qualifier name
* `UniversalHashable.version`
* `UniversalHashable.class_nonce` of the base class
Variable
++++++++
The return value of `Variable.data_uhash()` is `Variable.value` or `None` when hashing is disabled.
The `Variable.data_proxy` provides read-write access to the `Variable` data in external storage.
A `DataProxy` generates a `DataUri` for a root URI and a `UniversalHashable` (in this case a `Variable`).
For example when the root URI is `"/tmp/dataset_name.nx?path=scan_name/task_results/var1"` then the
`DataUri` will look like this
* `.json:///tmp/dataset_name/scan_name/task_results/var1/6872c154c80bfcda0a9a769e3c1b4c85b8a56ad8d022d5c5da3ef9c036bc1e01.json`
* `.nexus:///tmp/dataset_name.nx?path=scan_name/task_results/var1/6872c154c80bfcda0a9a769e3c1b4c85b8a56ad8d022d5c5da3ef9c036bc1e01`
Example:
++++++++
A task which takes a single integer as input and an array as output
.. code:: python
class MyTask(Task, input_names=["N"], output_names=["array"]):
def run(self):
self.outputs.array = random(self.inputs.N)
When instantiating `MyTask`, the following happens
.. code:: python
self.input_variables = VariableContainer(value={"N": N})
self.output_variables = VariableContainer(value={"array": self.MISSING_DATA},
pre_uhash=input_variables,
instance_nonce=self.class_nonce())
self.pre_uhash = self.output_variables
The *universal hash* of the task is equal to the *universal hash* of the output container.
The input variable container instantiates this variable
.. code:: python
input_variables["N"] = Variable(value=100000)
# N is a `Variable` (task input in this case)
# It’s value is 100000
# It’s uhash is calculated from the value
The output variable container instantiates this variable
.. code:: python
output_variables["array"] = Variable(value=output_variables.MISSING_DATA,
pre_uhash=output_variables.pre_uhash,
instance_nonce=(output_variables.instance_nonce, "array"))
# array is a `Variable` (task output in this case)
# It’s value is not yet defined (set in the `run` method)
# It’s uhash is not calculated from the value but from the uhash of the task input container
This scheme ensures that the hash of a single output variable depends on all upstream inputs
and does not depend on its value . The output variables take the `MyTask.class_nonce()` as an
instance nonce to ensure that different tasks with identical upstream inputs produce.
|