File: implementation.rst

package info (click to toggle)
python-ewokscore 2.0.0-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 880 kB
  • sloc: python: 10,679; makefile: 5
file content (141 lines) | stat: -rw-r--r-- 5,146 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
Workflow implementation
=======================

Workflows in `ewoks` are based on *networkx* graphs, both in terms of runtime representation and
persistent representation. At runtime, links between nodes are hash links which provide a unique
identifier for each task output. This identifier is used to save and load task outputs from external
storage (e.g. HDF5).

Hash links
----------

Hash links between tasks have these requirements:

* changing task input values or links will change all output hashes of the downstream tasks and hence
  also their URI's to external storage
* pass the actual data from one task to another or the associated URI's should result in the
  same outcome
* task results should never be hashed because they can be large

Hash implementation
-------------------

The *universal hashing* in *ewoks* is currently based on SHA-256. The `UniversalHash` class
representation a *universal hash* at runtime. Several builtin python types are *universally hasheable*:
strings, numbers, mappings, sets and iterables. Custom types that are *universally hasheable* should
derive from `UniversalHashable`.

Tasks and task inputs and outputs are *universally hasheable* and are implementation as described
in this class diagram:

.. mermaid::

   classDiagram
      UniversalHashable <|-- Variable
      Variable <|-- VariableContainer
      Variable --o VariableContainer
      UniversalHashable <|-- Task
      Task o-- VariableContainer
      class UniversalHashable{
          -version
          -class_nonce
          #pre_uhash
          #class_uhash
          #instance_nonce
          #data_uhash()
          uhash() UniversalHash
      }
      class Variable{
          value
          data_proxy: DataProxy
      }
      class VariableContainer{
          value: Dictionary<string|int, Variable>
      }
      class Task{
          input_variables: VariableContainer
          output_variables: VariableContainer
      }

UniversalHashable
+++++++++++++++++

The return value of `UniversalHashable.uhash()` can be either

* the *universal hash* of `pre_uhash` and `instance_nonce` when `instance_nonce`
  is provided on instantiation
* equal to `pre_uhash` when `instance_nonce` is NOT provided on instantiation

The value of `UniversalHashable.pre_uhash` can be either

* provided on instantiation
* the *universal hash* of `UniversalHashable.class_nonce` and the return value of `UniversalHashable.data_uhash()`

The value of `UniversalHashable.class_nonce` is the *universal hash* of

* the class full qualifier name
* `UniversalHashable.version`
* `UniversalHashable.class_nonce` of the base class

Variable
++++++++

The return value of  `Variable.data_uhash()` is `Variable.value` or `None` when hashing is disabled.

The `Variable.data_proxy` provides read-write access to the `Variable` data in external storage.

A `DataProxy` generates a `DataUri` for a root URI and a `UniversalHashable` (in this case a `Variable`).

For example when the root URI is `"/tmp/dataset_name.nx?path=scan_name/task_results/var1"` then the
`DataUri` will look like this

* `.json:///tmp/dataset_name/scan_name/task_results/var1/6872c154c80bfcda0a9a769e3c1b4c85b8a56ad8d022d5c5da3ef9c036bc1e01.json`
* `.nexus:///tmp/dataset_name.nx?path=scan_name/task_results/var1/6872c154c80bfcda0a9a769e3c1b4c85b8a56ad8d022d5c5da3ef9c036bc1e01`

Example:
++++++++

A task which takes a single integer as input and an array as output

.. code:: python

  class MyTask(Task, input_names=["N"], output_names=["array"]):

    def run(self):
      self.outputs.array = random(self.inputs.N)

When instantiating `MyTask`, the following happens

.. code:: python

   self.input_variables = VariableContainer(value={"N": N})
   self.output_variables = VariableContainer(value={"array": self.MISSING_DATA},
                                             pre_uhash=input_variables,
                                             instance_nonce=self.class_nonce())
   self.pre_uhash = self.output_variables

The *universal hash* of the task is equal to the *universal hash* of the output container.

The input variable container instantiates this variable

.. code:: python

  input_variables["N"] = Variable(value=100000)
  # N is a `Variable` (task input in this case)
  # It’s value is 100000
  # It’s uhash is calculated from the value

The output variable container instantiates this variable

.. code:: python

  output_variables["array"] = Variable(value=output_variables.MISSING_DATA,
                                       pre_uhash=output_variables.pre_uhash,
                                       instance_nonce=(output_variables.instance_nonce, "array"))
  # array is a `Variable` (task output in this case)
  # It’s value is not yet defined (set in the `run` method)
  # It’s uhash is not calculated from the value but from the uhash of the task input container

This scheme ensures that the hash of a single output variable depends on all upstream inputs
and does not depend on its value . The output variables take the `MyTask.class_nonce()` as an
instance nonce to ensure that different tasks with identical upstream inputs produce.