1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
|
.. _grabbing_and_sinking:
==================================
DataGrabber and DataSink explained
==================================
In this chapter we will try to explain the concepts behind DataGrabber and
DataSink.
Why do we need these interfaces?
================================
A typical workflow takes data as input and produces data as the result of one or
more operations. One can set the data required by a workflow directly as
illustrated below.
::
from fsl_tutorial2 import preproc
preproc.base_dir = os.path.abspath('.')
preproc.inputs.inputspec.func = os.path.abspath('data/s1/f3.nii')
preproc.inputs.inputspec.struct = os.path.abspath('data/s1/struct.nii')
preproc.run()
Typical neuroimaging studies require running workflows on multiple subjects or
different parameterizations of algorithms. One simple approach to that would be
to simply iterate over subjects.
::
from fsl_tutorial2 import preproc
for name in subjects:
preproc.base_dir = os.path.abspath('.')
preproc.inputs.inputspec.func = os.path.abspath('data/%s/f3.nii'%name)
preproc.inputs.inputspec.struct = os.path.abspath('data/%s/struct.nii'%name)
preproc.run()
However, in the context of complex workflows and given that users typically
arrange their imaging and other data in a semantically hierarchical data store,
an alternate mechanism for reading and writing the data generated by a workflow
is often necessary. As the names suggest DataGrabber is used to get at data
stored in a shared file system while DataSink is used to store the data
generated by a workflow into a hierarchical structure on disk.
DataGrabber
===========
Datagrabber is an interface for collecting files from hard drive. It is very
flexible and supports almost any file organisation of your data you can imagine.
You can use it as a trivial use case of getting a fixed file. By default,
DataGrabber stores its outputs in a field called outfiles.
::
import nipype.interfaces.io as nio
datasource1 = nio.DataGrabber()
datasource1.inputs.base_directory = os.getcwd()
datasource1.inputs.template = 'data/s1/f3.nii'
results = datasource1.run()
Or you can get at all uncompressed nifti files starting with the letter 'f' in
all directories starting with the letter 's'.
::
datasource2.inputs.base_directory = '/mass'
datasource2.inputs.template = 'data/s*/f*.nii'
Two special inputs were used in these previous cases. The input `base_directory`
indicates in which directory to search, while the input `template` indicates the
string template to match. So in the previous case datagrabber is looking for
path matches of the form `/mass/data/s*/f*`.
.. note::
When used with wildcards (e.g., s* and f* above) Datagrabber does not return
data in sorted order. In order to force it to return data in sorted order, one
needs to set the input `sorted = True`. However, when explicitly specifying an
order as we will see below, `sorted` should be set to `False`.
More useful cases arise when the template can be filled by other inputs. In the
example below, we define an input field for `datagrabber` called `run`. This is
then used to set the template (see %d in the template).
::
datasource3 = nio.DataGrabber(infields=['run'])
datasource3.inputs.base_directory = os.getcwd()
datasource3.inputs.template = 'data/s1/f%d.nii'
datasource3.inputs.run = [3, 7]
This will return files `basedir/data/s1/f3.nii` and `basedir/data/s1/f7.nii`. We
can take this a step further and pair subjects with runs.
::
datasource4 = nio.DataGrabber(infields=['subject_id', 'run'])
datasource4.inputs.template = 'data/%s/f%d.nii'
datasource4.inputs.run = [3, 7]
datasource4.inputs.subject_id = ['s1', 's3']
This will return files `basedir/data/s1/f3.nii` and `basedir/data/s3/f7.nii`.
A more realistic use-case
-------------------------
In a typical study one often wants to grab different files for a given subject
and store them in semantically meaningful outputs. In the following example, we
wish to retrieve all the functional runs and the structural image for subject 's1'.
::
datasource = nio.DataGrabber(infields=['subject_id'], outfields=['func', 'struct'])
datasource.inputs.base_directory = 'data'
datasource.inputs.template = '*'
datasource.inputs.field_template = dict(func='%s/f%d.nii',
struct='%s/struct.nii')
datasource.inputs.template_args = dict(func=[['subject_id', [3,5,7,10]]],
struct=[['subject_id']])
datasource.inputs.subject_id = 's1'
Two more fields are introduced: `field_template` and `template_args`. These
fields are both dictionaries whose keys correspond to the `outfields`
keyword. The `field_template` reflects the search path for each output field,
while the `template_args` reflect the inputs that satisfy the template. The
inputs can either be one of the named inputs specified by the `infields` keyword
arg or it can be raw strings or integers corresponding to the template. For the
`func` output, the **%s** in the `field_template` is satisfied by `subject_id`
and the **%d** is field in by the list of numbers.
.. note::
We have not set `sorted` to `True` as we want the DataGrabber to return the
functional files in the order it was specified rather than in an alphabetic
sorted order.
DataSink
========
A workflow working directory is like a **cache**. It contains not only the
outputs of various processing stages, it also contains various extraneous
information such as execution reports, hashfiles determining the input state of
processes. All of this is embedded in a hierarchical structure that reflects the
iterables that have been used in the workflow. This makes navigating the working
directory a not so pleasant experience. And typically the user is interested in
preserving only a small percentage of these outputs. The DataSink interface can
be used to extract components from this `cache` and store it at a different
location. For XNAT-based storage, see XNATSink.
.. note::
Unlike other interfaces, a DataSink's inputs are defined and created by using
the workflow connect statement. Currently disconnecting an input from the
datasink does not remove that connection port.
Let's assume we have the following workflow.
.. digraph:: simple_workflow
"InputNode" -> "Realign" -> "DataSink";
"InputNode" -> "DataSink";
The following code segment defines the DataSink node and sets the `base_directory`
in which all outputs will be stored. The `container` input creates a
subdirectory within the `base_directory`. If you are iterating a workflow over
subjects, it may be useful to save it within a folder with the subject id.
::
datasink = pe.Node(nio.DataSink(), name='sinker')
datasink.inputs.base_directory = '/path/to/output'
workflow.connect(inputnode, 'subject_id', datasink, 'container')
If we wanted to save the realigned files and the realignment parameters to the
same place the most intuitive option would be:
::
workflow.connect(realigner, 'realigned_files', datasink, 'motion')
workflow.connect(realigner, 'realignment_parameters', datasink, 'motion')
However, this will not work as only one connection is allowed per input port. So
we need to create a second port. We can store the files in a separate folder.
::
workflow.connect(realigner, 'realigned_files', datasink, 'motion')
workflow.connect(realigner, 'realignment_parameters', datasink, 'motion.par')
The period (.) indicates that a subfolder called par should be created. But if
we wanted to store it in the same folder as the realigned files, we would use
the `.@` syntax. The @ tells the datasink interface to not create the
subfolder. This will allow us to create different named input ports for datasink
and allow the user to store the files in the same folder.
::
workflow.connect(realigner, 'realigned_files', datasink, 'motion')
workflow.connect(realigner, 'realignment_parameters', datasink, 'motion.@par')
The syntax for the input port of DataSink takes the following form:
::
string[[.[@]]string[[.[@]]string] ...]
where parts between paired [] are optional.
MapNode
-------
In order to use :ref:`nipype.interfaces.io.DataSink` inside a MapNode, it's
inputs have to be defined inside the constructor using the `infields` keyword
arg.
Parameterization
----------------
As discussed in :doc:`mapnode_and_iterables`, one can run a workflow iterating
over various inputs using the iterables attribute of nodes. This means that a
given workflow can have multiple outputs depending on how many iterables are
there. Iterables create working directory subfolders such as
`_iterable_name_value`. The `parameterization` input parameter controls whether
the data stored using DataSink is in a folder structure that contains this
iterable information or not. It is generally recommended to set this to `True`
when using multiple nested iterables.
Substitutions
-------------
The `substitutions` and `substitutions_regexp` inputs allow users to modify the
output destination path and name of a file. Substitutions are a list of 2-tuples
and are carried out in the order in which they were entered. Assuming that the
output path of a file is:
::
/root/container/_variable_1/file_subject_realigned.nii
we can use substitutions to clean up the output path.
::
datasink.inputs.substitutions = [('_variable', 'variable'),
('file_subject_', '')]
This will rewrite the file as:
::
/root/container/variable_1/realigned.nii
.. note::
In order to figure out which substitutions are needed it is often useful to
run the workflow on a limited set of iterables and then determine the
substitutions.
.. include:: ../links_names.txt
|