1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201
|
.. currentmodule:: cf
.. default-role:: obj
.. _LAMA:
Large Amounts of Massive Arrays (LAMA)
======================================
Data arrays contained within fields are stored and manipulated in a
very memory efficient manner such that large numbers of fields may
co-exist and be manipulated regardless of the size of their data
arrays. The limiting factor on array size is not the amount of
available physical memory, but the amount of disk space available,
which is generally far greater.
The basic functionality is:
* **Arrays larger than the available physical memory may be created.**
Arrays larger than a preset number of bytes are partitioned into
smaller sub-arrays which are not kept in memory but stored on disk,
either as part of the original file they were read in from or as
newly created temporary files, whichever is appropriate. Therefore
data arrays larger than a preset size need never wholly exist in
memory.
* **Large numbers of arrays which are in total larger than the
available physical memory may co-exist.**
Large numbers of data arrays which are collectively larger than the
available physical memory may co-exist, regardless of whether any
individual array is in memory or stored on disk.
* **Arrays larger than the available physical memory may be operated
on.**
Array operations (such as subspacing, assignment, arithmetic,
comparison, collapses, etc.) are carried out on a
partition-by-partition basis. When an operation traverses the
partitions, if a partition is stored on disk then it is returned to
disk before processing the next partition.
* **The memory management does not need to be known at the API
level.**
As the array's metadata (such as size, shape, data-type, number of
dimensions, etc.) always reflect the data array in its entirety,
regardless of how the array is partitioned and whether or not its
partitions are in memory, fields and other variables containing data
arrays may be used in the API as if they were normal, in-memory
objects (like numpy arrays). Partitioning does carry a performance
overhead, but this may be minimised for particular applications or
hardware configurations.
.. _LAMA_reading:
Reading from files
------------------
When a field is read from a file, the data array is not realized in
memory, however large or small it may be. Instead each partition
refers to part of the original file on disk. Therefore reading even
very large fields is initially very fast and uses up only a very small
amount of memory.
.. _LAMA_copying:
Copying
-------
When a field is deep copied with its `~Field.copy` method or the
:py:func:`copy.deepcopy` function, the partitions of its data array
are transferred to the new field as object identities and are *not*
deep copied. Therefore copying even very large fields is initially
very fast and uses up only a very small amount of memory.
The independence of the copied field is preserved, however, since each
partition that is stored in memory (as opposed to on disk) is deep
copied if and when the data in the new field is actually accessed, and
then only if the partition's data still exists in the original (or any
other) field.
Aggregation
-----------
When two fields are aggregated to form one, larger field there is no
need for either field's data array partitions to be accessed, and
therefore brought into memory if they were stored on disk. The
resulting field recombines the two fields' array partitions as object
identities into the new larger array. Therevy creating an aggregated
field that uses up only a very small amount of extra memory.
The independence of the new field is preserved, however, since each
partition that is stored in memory (as opposed to on disk) is deep
copied when the data in the new field is actually accessed, and then
only if the partition's data still exists in its the original (or any
other) field.
.. _LAMA_subspacing:
Subspacing
----------
When a new field is created by :ref:`subspacing <Subspacing>` a field,
the new field is actually a :ref:`LAMA deep copy <LAMA_copying>` of
the original but with additional instructions on each of its data
partitions to only use the part specified by the subspace indices.
As with copying, creating subspaced fields is initially very fast and
uses up only a very small amount of memory, with the added advantage
that a deep copy of only the requested parts of the data array needs
to be carried out at the time of data access, and then only if the
partition's data still exists in the original field.
When subspacing a field that has previously been subspacing but has
not yet had its data accessed, the new subspace merely updates the
instructions on which parts of the array partitions to use. For
example:
>>> f.shape = (12, 73, 96)
>>> g = f.subspace[::-2, ...]
>>> h = g.subspace[2:5, ...]
is equivalent to
>>> h = f.subspace[7:2:-2, ...]
and if all of the partitions of field ``f`` are stored on disk then in
both cases so are all of the partitions of field ``h`` and no data has
been read from disk.
Speed and memory management
---------------------------
The creation of temporary files for array partitions of large arrays
and the reading of data from files on disk can create significant
speed overheads (for example, recent tests show that writing a 100
megabyte array to disk can take O(1) seconds), so it may be desirable
to configure the maximum size of array which is kept in memory, and
therefore has fast access.
The data array memory management is configurable as follows:
* Data arrays larger than a given size are partitioned into
sub-arrays, each of which is smaller than the chunk size. By default
this size is set to 1% of the total physical memory and is found and
set with the `cf.CHUNKSIZE` function.
* In-memory sub-arrays may be written to tempoary files on disk when
the amount of available physical memory falls below a specified
amount. By default this amount is 10% of the total physical memory
and is found and set with the `cf.MINNCFCM` function.
.. _LAMA_temporary_files:
Temporary files
^^^^^^^^^^^^^^^
The directory in which temporary files is found and set with the
`cf.TEMPDIR` function:
>>> cf.TEMPDIR()
'/tmp'
>>> cf.TEMPDIR('/home/me/tmp')
>>> cf.TEMPDIR()
'/home/me/tmp'
The removal of temporary files which are no longer required works in
the same way as python's automatic garbage collector. When a
partition's data is stored in a temporary file, that file will only
exist for as long as there are partitions referring to it. When no
partitions require the file it will be deleted automatically.
When python exits normally, all temporary files are always deleted.
Changing the temporary file directory does not prevent temporary files
in the original directory from being garbage collected.
Partitioning
^^^^^^^^^^^^
To maximise looping efficiency, array partitioning preserves as much
as is possible the faster varying (inner) dimensions' sizes in each of
the sub-arrays.
**Examples**
If an array with shape (2, 4, 6) is partitioned into 2 partitions then both
sub-arrays will have shape (1, 4, 6).
If the same array is partitioned into 4 partitions then all four sub-arrays
will have shape (1, 2, 6).
If the same array is partitioned into 8 partitions then all eight
sub-arrays will have shape (1, 2, 3).
If the same array is partitioned into 48 partitions then all forty eight
sub-arrays will have shape (1, 1, 1).
|