File: onthefly.rst

package info (click to toggle)
python-ruffus 2.6.3%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 20,828 kB
  • ctags: 2,843
  • sloc: python: 15,745; makefile: 180; sh: 14
file content (192 lines) | stat: -rw-r--r-- 7,251 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
.. include:: ../../global.inc
.. include:: manual_chapter_numbers.inc

.. index::
    pair: on_the_fly; Tutorial

.. _new_manual.on_the_fly:

####################################################################################################################################################
|new_manual.on_the_fly.chapter_num|: Esoteric: Generating parameters on the fly with :ref:`@files<decorators.files_on_the_fly>`
####################################################################################################################################################


.. seealso::

   * :ref:`Manual Table of Contents <new_manual.table_of_contents>`
   * :ref:`@files on-the-fly syntax in detail <decorators.files_on_the_fly>`

.. note::

    Remember to look at the example code:

    * :ref:`new_manual.on_the_fly.code`


***********************
Overview
***********************

    The different *Ruffus* :ref:`decorators <decorators>` connect up different tasks and
    generate *Output* (file names) from your *Input* in all sorts of different ways.

    However, sometimes, none of them *quite* do exactly what you need. And it becomes
    necessary to generate your own *Input* and *Output* parameters on the fly.

    Although this additional flexibility comes at the cost of a lot of extra inconvenient
    code, you can continue to leverage the rest of *Ruffus* functionality such as
    checking whether files are up to date or not.

.. index::
    pair: @files; Tutorial on-the-fly parameter generation


*********************************************************************
:ref:`@files <decorators.files_on_the_fly>` syntax
*********************************************************************
    To generate parameters on the fly, use the :ref:`@files <decorators.files_on_the_fly>`
    with a :term:`generator` function which yields one list / tuple of parameters per job.

    For example:

    .. code-block:: python
        :emphasize-lines: 3,16

        from ruffus import *

        # generator function
        def generate_parameters_on_the_fly():
            """
            returns one list of parameters per job
            """
            parameters = [
                                ['A.input', 'A.output', (1, 2)], # 1st job
                                ['B.input', 'B.output', (3, 4)], # 2nd job
                                ['C.input', 'C.output', (5, 6)], # 3rd job
                            ]
            for job_parameters in parameters:
                yield job_parameters

        # tell ruffus that parameters should be generated on the fly
        @files(generate_parameters_on_the_fly)
        def pipeline_task(input, output, extra):
            open(output, "w").write(open(input).read())
            sys.stderr.write("%d + %d => %d\n" % (extra[0] , extra[1], extra[0] + extra[1]))

        pipeline_run()


    Produces:

        .. code-block:: pycon

        Task = parallel_task
            1 + 2 = 3
            Job = ["A", 1, 2] completed
            3 + 4 = 7
            Job = ["B", 3, 4] completed
            5 + 6 = 11
            Job = ["C", 5, 6] completed


    .. note::

        Be aware that the parameter generating function may be invoked
        :ref:`more than once<new_manual.dependencies.checking_multiple_times>`:
        * The first time to check if this part of the pipeline is up-to-date.
        * The second time when the pipeline task function is run.

    The resulting custom *inputs*, *outputs* parameters per job are
    treated normally for the purposes of checking to see if jobs are up-to-date and
    need to be re-run.


**********************************************
 A Cartesian Product, all vs all example
**********************************************

    The :ref:`accompanying example<new_manual.on_the_fly.code>` provides a more realistic reason why
    you would want to generate parameters on the fly. It is a fun piece of code, which generates
    N x M combinations from two sets of files as the *inputs* of a pipeline stage.

    The *inputs* / *outputs* filenames are generated as a pair of nested for-loops to produce
    the N (outside loop) x M (inside loop) combinations, with the appropriate parameters
    for each job ``yield``\ed per iteration of the inner loop. The gist of this is:

        .. code-block:: python
            :emphasize-lines: 3

            #_________________________________________________________________________________________
            #
            #   Generator function
            #
            #        N x M jobs
            #_________________________________________________________________________________________
            def generate_simulation_params ():
                """
                Custom function to generate
                file names for gene/gwas simulation study
                """
                for sim_file in get_simulation_files():
                    for (gene, gwas) in get_gene_gwas_file_pairs():
                        result_file = "%s.%s.results" % (gene, sim_file)
                        yield (gene, gwas, sim_file), result_file



            @files(generate_simulation_params)
            def gwas_simulation(input_files, output_file):
                "..."

        If ``get_gene_gwas_file_pairs()`` produces:
            ::

                ['a.sim', 'b.sim', 'c.sim']

        and ``get_gene_gwas_file_pairs()`` produces:
            ::

                [('1.gene', '1.gwas'), ('2.gene', '2.gwas')]

        then we would end up with ``3`` x ``2`` = ``6`` jobs and the following equivalent function calls:

            ::

                gwas_simulation(('1.gene', '1.gwas', 'a.sim'), "1.gene.a.sim.results")
                gwas_simulation(('2.gene', '2.gwas', 'a.sim'), "2.gene.a.sim.results")
                gwas_simulation(('1.gene', '1.gwas', 'b.sim'), "1.gene.b.sim.results")
                gwas_simulation(('2.gene', '2.gwas', 'b.sim'), "2.gene.b.sim.results")
                gwas_simulation(('1.gene', '1.gwas', 'c.sim'), "1.gene.c.sim.results")
                gwas_simulation(('2.gene', '2.gwas', 'c.sim'), "2.gene.c.sim.results")


    The :ref:`accompanying code<new_manual.on_the_fly.code>` looks slightly more complicated because
    of some extra bookkeeping.



    You can compare this approach with the alternative of using :ref:`@product <decorators.product>`:

        .. code-block:: python
            :emphasize-lines: 3

            #_________________________________________________________________________________________
            #
            #        N x M jobs
            #_________________________________________________________________________________________
            @product(   os.path.join(simulation_data_dir, "*.simulation"),
                        formatter(),

                        os.path.join(gene_data_dir, "*.gene"),
                        formatter(),

                        # add gwas as an input: looks like *.gene but with a differnt extension
                        add_inputs("{path[1][0]/{basename[1][0]}.gwas")

                        "{basename[0][0]}.{basename[1][0]}.results")    # output file
            def gwas_simulation(input_files, output_file):
                "..."