File: inputs.rst

package info (click to toggle)
python-ruffus 2.6.3%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 20,828 kB
  • ctags: 2,843
  • sloc: python: 15,745; makefile: 180; sh: 14
file content (239 lines) | stat: -rw-r--r-- 11,551 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
.. include:: ../../global.inc
.. include:: manual_chapter_numbers.inc

.. index::
    pair: inputs; Tutorial
    pair: add_inputs; Tutorial
    pair: string substiution for inputs; Tutorial

.. _new_manual.inputs:

###########################################################################################################################################################################################################################################################################################
|new_manual.inputs.chapter_num|: Manipulating task inputs via string substitution using :ref:`inputs() <decorators.inputs>` and  :ref:`add_inputs() <decorators.add_inputs>`
###########################################################################################################################################################################################################################################################################################


.. seealso::

    * :ref:`Manual Table of Contents <new_manual.table_of_contents>`
    * :ref:`inputs() <decorators.inputs>` syntax
    * :ref:`add_inputs() <decorators.add_inputs>` syntax

.. note::

    Remember to look at the example code:

        * :ref:`new_manual.inputs.code`

***********************
Overview
***********************

    The previous chapters have been described how *Ruffus* allows the **Output** names for each job
    to be generated from the *Input* names via string substitution. This is how *Ruffus* can
    automatically chain multiple tasks in a pipeline together seamlessly.

    Sometimes it is useful to be able to modify the **Input**  by string substitution
    as well. There are two situations where this additional flexibility is needed:

        #. You need to add additional prequisites or filenames to the **Input** of every single job
        #. You need to add additional **Input** file names which are some variant of the existing ones.

    Both will be much more obvious with some examples


*******************************************************************************************************************
Adding additional *input* prerequisites per job with :ref:`add_inputs() <decorators.add_inputs>`
*******************************************************************************************************************


===================================================================
1. Example: compiling c++ code
===================================================================

    Let us first compile some c++ (``"*.cpp"``) files using plain :ref:`@transform <decorators.transform>` syntax:

    .. code-block:: python

            # source files exist before our pipeline
            source_files = ["hasty.cpp", "tasty.cpp", "messy.cpp"]
            for source_file in source_files:
                open(source_file, "w")

            from ruffus import *

            @transform(source_files, suffix(".cpp"), ".o")
            def compile(input_filename, output_file):
                open(output_file, "w")

            pipeline_run()


======================================================================================================================================
2. Example: Adding a common header file with :ref:`add_inputs() <decorators.add_inputs>`
======================================================================================================================================

    .. code-block:: python
        :emphasize-lines: 11,17,19

        # source files exist before our pipeline
        source_files = ["hasty.cpp", "tasty.cpp", "messy.cpp"]
        for source_file in source_files:
            open(source_file, "w")

        # common (universal) header exists before our pipeline
        open("universal.h", "w")

        from ruffus import *

        # make header files
        @transform(source_files, suffix(".cpp"), ".h")
        def create_matching_headers(input_file, output_file):
            open(output_file, "w")

        @transform(source_files, suffix(".cpp"),
                                # add header to the input of every job
                    add_inputs("universal.h",
                                # add result of task create_matching_headers to the input of every job
                               create_matching_headers),
                    ".o")
        def compile(input_filename, output_file):
            open(output_file, "w")

        pipeline_run()

            >>> pipeline_run()
                Job  = [hasty.cpp -> hasty.h] completed
                Job  = [messy.cpp -> messy.h] completed
                Job  = [tasty.cpp -> tasty.h] completed
            Completed Task = create_matching_headers
                Job  = [[hasty.cpp, universal.h, hasty.h, messy.h, tasty.h] -> hasty.o] completed
                Job  = [[messy.cpp, universal.h, hasty.h, messy.h, tasty.h] -> messy.o] completed
                Job  = [[tasty.cpp, universal.h, hasty.h, messy.h, tasty.h] -> tasty.o] completed
            Completed Task = compile


=====================================================================
3. Example: Additional *Input* can be tasks
=====================================================================

    We can also add a task name to :ref:`add_inputs() <decorators.add_inputs>`.
    This chains the **Output**, i.e. run time results, of any previous task as
    an additional **Input** to every single job in the task.

        .. code-block:: python
            :emphasize-lines: 1,7,9

            # make header files
            @transform(source_files, suffix(".cpp"), ".h")
            def create_matching_headers(input_file, output_file):
                open(output_file, "w")

            @transform(source_files, suffix(".cpp"),
                                    # add header to the input of every job
                        add_inputs("universal.h",
                                    # add result of task create_matching_headers to the input of every job
                                   create_matching_headers),
                        ".o")
            def compile(input_filenames, output_file):
                open(output_file, "w")

            pipeline_run()


        >>> pipeline_run()
            Job  = [[hasty.cpp, universal.h, hasty.h, messy.h, tasty.h] -> hasty.o] completed
            Job  = [[messy.cpp, universal.h, hasty.h, messy.h, tasty.h] -> messy.o] completed
            Job  = [[tasty.cpp, universal.h, hasty.h, messy.h, tasty.h] -> tasty.o] completed
        Completed Task = compile


================================================================================================================================================================================================================================================
4. Example: Add corresponding files using :ref:`add_inputs() <decorators.add_inputs>` with :ref:`formatter <decorators.formatter>` or :ref:`regex <decorators.regex>`
================================================================================================================================================================================================================================================
    The previous example created headers corresponding to our source files and added them
    as the **Input** to the compilation. That is generally not what you want. Instead,
    what is generally need is a way to

        1) Look up the exact corresponding header for the *specific* job, and not add all
           possible files to all jobs in a task. When compiling ``hasty.cpp``, we just need
           to add ``hasty.h`` (and ``universal.h``).
        2) Add a pre-existing file name (``hasty.h`` already exists. Don't create it via
           another task.)

    This is a surprisingly common requirement: In bioinformatics sometimes DNA or RNA
    sequence files come singly in  `*.fastq  <http://en.wikipedia.org/wiki/FASTQ_format>`__
    and sometimes in `matching pairs  <http://en.wikipedia.org/wiki/DNA_sequencing_theory#Pairwise_end-sequencing>`__:
    ``*1.fastq, *2.fastq`` etc. In the latter case, we often need to make sure that both
    sequence files are being processed in tandem. One way is to take one file name (``*1.fastq``)
    and look up the other.

     :ref:`add_inputs() <decorators.add_inputs>` uses standard *Ruffus* string substitution
     via :ref:`formatter <decorators.formatter>` and :ref:`regex <decorators.regex>` to lookup (generate) **Input** file names.
     (As a rule :ref:`suffix <decorators.suffix>` only substitutes **Output** file names.)

    .. code-block:: python
        :emphasize-lines: 3,5

        @transform( source_files,
                    formatter(".cpp$"),
                                # corresponding header for each source file
                    add_inputs("{basename[0]}.h",
                               # add header to the input of every job
                               "universal.h"),
                    "{basename[0]}.o")
        def compile(input_filenames, output_file):
            open(output_file, "w")

    This script gives the following output

        .. code-block:: pycon

            >>> pipeline_run()
                Job  = [[hasty.cpp, hasty.h, universal.h] -> hasty.o] completed
                Job  = [[messy.cpp, messy.h, universal.h] -> messy.o] completed
                Job  = [[tasty.cpp, tasty.h, universal.h] -> tasty.o] completed
            Completed Task = compile


********************************************************************************
Replacing all input parameters with :ref:`inputs() <decorators.inputs>`
********************************************************************************

    The previous examples all *added* to the set of **Input** file names.
    Sometimes it is necessary to replace all the **Input** parameters altogether.

================================================================================================================================================================================================================================================
5. Example: Running matching python scripts using :ref:`inputs() <decorators.inputs>`
================================================================================================================================================================================================================================================

    Here is a contrived example: we wish to find all cython/python files which have been
    compiled into corresponding c++ source files.
    Instead of compiling the c++, we shall invoke the corresponding python scripts.

    Given three c++ files and their corresponding python scripts:

        .. code-block:: python
            :emphasize-lines: 4

            @transform( source_files,
                        formatter(".cpp$"),

                        # corresponding python file for each source file
                        inputs("{basename[0]}.py"),

                        "{basename[0]}.results")
            def run_corresponding_python(input_filenames, output_file):
                open(output_file, "w")


    The *Ruffus* code will call each python script corresponding to their c++ counterpart:

        .. code-block:: pycon

            >>> pipeline_run()
                Job  = [hasty.py -> hasty.results] completed
                Job  = [messy.py -> messy.results] completed
                Job  = [tasty.py -> tasty.results] completed
            Completed Task = run_corresponding_python