File: faq.rst

package info (click to toggle)
python-ruffus 2.6.3%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 20,828 kB
  • ctags: 2,843
  • sloc: python: 15,745; makefile: 180; sh: 14
file content (980 lines) | stat: -rw-r--r-- 45,898 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
.. include:: global.inc
#############
FAQ
#############

**********************************************************
Citations
**********************************************************

===============================================================
Q. How should *Ruffus* be cited in academic publications?
===============================================================

    The official publication describing the original version of *Ruffus* is:

        `Leo Goodstadt (2010) <http://bioinformatics.oxfordjournals.org/content/early/2010/09/16/bioinformatics.btq524>`_ : **Ruffus: a lightweight Python library for computational pipelines.** *Bioinformatics* 26(21): 2778-2779


**********************************************************
Good practices
**********************************************************

==================================================================================================================
Q. What is the best way of keeping my data and workings separate?
==================================================================================================================

    It is good practice to run your pipeline in a temporary, "working" directory away from your original data.

    The first step of your pipeline might be to make softlinks to your original data in your working directory.
    This is example (relatively paranoid) code to do just this:

    .. code-block:: python
        :emphasize-lines: 3,5

            def re_symlink (input_file, soft_link_name, logger, logging_mutex):
                """
                Helper function: relinks soft symbolic link if necessary
                """
                # Guard agains soft linking to oneself: Disastrous consequences of deleting the original files!!
                if input_file == soft_link_name:
                    logger.debug("Warning: No symbolic link made. You are using the original data directory as the working directory.")
                    return
                # Soft link already exists: delete for relink?
                if os.path.lexists(soft_link_name):
                    # do not delete or overwrite real (non-soft link) file
                    if not os.path.islink(soft_link_name):
                        raise Exception("%s exists and is not a link" % soft_link_name)
                    try:
                        os.unlink(soft_link_name)
                    except:
                        with logging_mutex:
                            logger.debug("Can't unlink %s" % (soft_link_name))
                with logging_mutex:
                    logger.debug("os.symlink(%s, %s)" % (input_file, soft_link_name))
                #
                #   symbolic link relative to original directory so that the entire path
                #       can be moved around with breaking everything
                #
                os.symlink( os.path.relpath(os.path.abspath(input_file),
                            os.path.abspath(os.path.dirname(soft_link_name))), soft_link_name)

            #
            #   First task should soft link data to working directory
            #
            @jobs_limit(1)
            @mkdir(options.working_dir)
            @transform( input_files,
                        formatter(),
                        # move to working directory
                        os.path.join(options.working_dir, "{basename[0]}{ext[0]}"),
                        logger, logging_mutex
                       )
            def soft_link_inputs_to_working_directory (input_file, soft_link_name, logger, logging_mutex):
                """
                Make soft link in working directory
                """
                with logging_mutex:
                    logger.info("Linking files %(input_file)s -> %(soft_link_name)s\n" % locals())
                re_symlink(input_file, soft_link_name, logger, logging_mutex)


.. _faq.paired_files:

==================================================================================================================
Q. What is the best way of handling data in file pairs (or triplets etc.)
==================================================================================================================


    In Bioinformatics, DNA data often consists of only the nucleotide sequence at the two ends of larger fragments.
    The `paired_end  <http://www.illumina.com/technology/next-generation-sequencing/paired-end-sequencing_assay.ilmn>`__ or
    `mate pair  <http://en.wikipedia.org/wiki/Shotgun_sequencing#Whole_genome_shotgun_sequencing>`__ data frequently
    consists of of file pairs with conveniently related names such as "*.R1.fastq" and "*.R2.fastq".

    At some point in data pipeline, these file pairs or triplets must find each other and be analysed in the same job.

    Provided these file pairs or triplets are named consistently, an easiest way to regroup them is to use the
    Ruffus :ref:`@collate <new_manual.collate>` decorator. For example:


    .. code-block:: python

        @collate(original_data_files,

                    # match file name up to the "R1.fastq.gz"
                    formatter("([^/]+)R[12].fastq.gz$"),

                    # Create output parameter supplied to next task
                    "{path[0]}/{1[0]}.sam",
                    logger, logger_mutex)
        def handle_paired_end(input_files, output_paired_files, logger, logger_mutex):
            # check that we really have a pair of two files not an orphaned singleton
            if len(input_files) != 2:
                raise Exception("One of read pairs %s missing" % (input_files,))

            # do stuff here



    This (incomplete, untested) :ref:`example code <faq.paired_files.code>` shows what this would look like *in vivo*.



**********************************************************
General
**********************************************************

=========================================================
Q. *Ruffus* won't create dependency graphs
=========================================================

    A. You need to have installed ``dot`` from `Graphviz <http://www.graphviz.org/>`_ to produce
    pretty flowcharts likes this:

            .. image:: images/pretty_flowchart.png




=========================================================
Q. *Ruffus* seems to be hanging in the same place
=========================================================

    A. If *ruffus* is interrupted, for example, by a Ctrl-C,
    you will often find the following lines of code highlighted::

        File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 1904, in pipeline_run
        File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 1380, in run_all_jobs_in_task
        File "/xxxx/python2.6/multiprocessing/pool.py", line 507, in next
          self._cond.wait(timeout)
        File "/xxxxx/python2.6/threading.py", line 237, in wait
          waiter.acquire()

    This is *not* where *ruffus* is hanging but the boundary between the main programme process
    and the sub-processes which run *ruffus* jobs in parallel.

    This is naturally where broken execution threads get washed up onto.




=========================================================
Q. Regular expression substitutions don't work
=========================================================

    A. If you are using the special regular expression forms ``"\1"``, ``"\2"`` etc.
    to refer to matching groups, remember to 'escape' the subsitution pattern string.
    The best option is to use `'raw' python strings <http://docs.python.org/library/re.html>`_.
    For example:

        ::

            r"\1_substitutes\2correctly\3four\4times"

    Ruffus will throw an exception if it sees an unescaped ``"\1"`` or ``"\2"`` in a file name.

========================================================================================
Q. How to force a pipeline to appear up to date?
========================================================================================

    *I have made a trivial modification to one of my data files and now Ruffus wants to rerun my month long pipeline. How can I convince Ruffus that everything is fine and to leave things as they are?*

    The standard way to do what you are trying to do is to touch all the files downstream...
    That way the modification times of your analysis files would postdate your existing files.
    You can do this manually but Ruffus also provides direct support:

    .. code-block:: python

        pipeline_run (touch_files_only = True)

    pipeline_run will run your script normally stepping over up-to-date tasks and starting
    with jobs which look out of date. However, after that, none of your pipeline task functions
    will be called, instead, each non-up-to-date file is `touch  <https://en.wikipedia.org/wiki/Touch_(Unix)>`__-ed in
    turn so that the file modification dates follow on successively.

    See the documentation for :ref:`pipeline_run() <pipeline_functions.pipeline_run>`

    It is even simpler if you are using the new Ruffus.cmdline support from version 2.4. You can just type

        .. code-block:: bash

            your script --touch_files_only [--other_options_of_your_own_etc]

    See :ref:`command line <new_manual.cmdline>` documentation.

========================================================================================
Q. How can I use my own decorators with Ruffus?
========================================================================================

(Thanks to Radhouane Aniba for contributing to this answer.)

A. With care! If the following two points are observed:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1. Use `@wraps <https://docs.python.org/2/library/functools.html#functools.wraps>`__  from ``functools`` or Michele Simionato's `decorator <https://pypi.python.org/pypi/decorator>`__ module
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    These will automatically forward attributes from the task function correctly:

    * ``__name__`` and ``__module__`` is used to identify functions uniquely in a Ruffus pipeline, and
    * ``pipeline_task`` is used to hold per task data

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2. Always call Ruffus decorators first before your own decorators.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    Otherwise, your decorator will be ignored.

    So this works:

    .. code-block:: python

      @follows(prev_task)
      @custom_decorator(something)
      def test():
          pass

    This is a bit futile

    .. code-block:: python

        # ignore @custom_decorator
        @custom_decorator(something)
        @follows(prev_task)
        def test():
            pass


    This order dependency is an unfortunate quirk of how python decorators work. The last (rather futile)
    piece of code is equivalent to:

    .. code-block:: python

        test = custom_decorator(something)(ruffus.follows(prev_task)(test))

    Unfortunately, Ruffus has no idea that someone else (``custom_decorator``) is also modifying the ``test()`` function
    after it (``ruffus.follows``) has had its go.



------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Example decorator:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    Let us look at a decorator to time jobs:

    .. code-block:: python

        import sys, time
        def time_func_call(func, stream, *args, **kwargs):
            """prints elapsed time to standard out, or any other file-like object with a .write() method.
            """
           start = time.time()
           # Run the decorated function.
           ret = func(*args, **kwargs)
           # Stop the timer.
           end = time.time()
           elapsed = end - start
           stream.write("{} took {} seconds\n".format(func.__name__, elapsed))
           return ret


        from ruffus import *
        import sys
        import time

        @time_job(sys.stderr)
        def first_task():
            print "First task"


        @follows(first_task)
        @time_job(sys.stderr)
        def second_task():
            print "Second task"


        @follows(second_task)
        @time_job(sys.stderr)
        def final_task():
            print "Final task"

        pipeline_run()


    What would ``@time_job`` look like?

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1. Using functools `@wraps <https://docs.python.org/2/library/functools.html#functools.wraps>`__
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


    .. code-block:: python

        import functools
        def time_job(stream=sys.stdout):
            def actual_time_job(func):
                @functools.wraps(func)
                def wrapper(*args, **kwargs):
                    return time_func_call(func, stream, *args, **kwargs)
                return wrapper
            return actual_time_job

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2. Using Michele Simionato's `decorator <https://pypi.python.org/pypi/decorator>`__ module
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


    .. code-block:: python

        import decorator
        def time_job(stream=sys.stdout):
            def time_job(func, *args, **kwargs):
                return time_func_call(func, stream, *args, **kwargs)
            return decorator.decorator(time_job)


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2. By hand, using a `callable object <https://docs.python.org/2/reference/datamodel.html#emulating-callable-objects>`__
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


    .. code-block:: python

        class time_job(object):
            def __init__(self, stream=sys.stdout):
                self.stream = stream
            def __call__(self, func):
                def inner(*args, **kwargs):
                    return time_func_call(func, self.stream, *args, **kwargs)
                # remember to forward __name__
                inner.__name__ = func.__name__
                inner.__module__ = func.__module__
                inner.__doc__ = func.__doc__
                if hasattr(func, "pipeline_task"):
                    inner.pipeline_task = func.pipeline_task
                return inner





========================================================================================
Q. Can a task function in a *Ruffus* pipeline be called normally outside of Ruffus?
========================================================================================
    A. Yes. Most python decorators wrap themselves around a function. However, *Ruffus* leaves the
    original function untouched and unwrapped. Instead, *Ruffus* adds a ``pipeline_task`` attribute
    to the task function to signal that this is a pipelined function.

    This means the original task function can be called just like any other python function.

=====================================================================================================================
Q. My  *Ruffus* tasks create two files at a time. Why is the second one ignored in successive stages of my pipeline?
=====================================================================================================================
    *This is my code:*

        ::

            from ruffus import *
            import sys
            @transform("start.input", regex(".+"), ("first_output.txt", "second_output.txt"))
            def task1(i,o):
                pass

            @transform(task1, suffix(".txt"), ".result")
            def task2(i, o):
                pass

            pipeline_printout(sys.stdout, [task2], verbose=3)

        ::

            ________________________________________
            Tasks which will be run:

            Task = task1
                   Job = [start.input
                         ->[first_output.txt, second_output.txt]]

            Task = task2
                   Job = [[first_output.txt, second_output.txt]
                         ->first_output.result]

            ________________________________________

    A: This code produces a single output of a tuple of 2 files. In fact, you want two
    outputs, each consisting of 1 file.

    You want a single job (single input) to be produce multiple outputs (multiple jobs
    in downstream tasks). This is a one-to-many operation which calls for
    :ref:`@split <decorators.split>`:

        ::

            from ruffus import *
            import sys
            @split("start.input", ("first_output.txt", "second_output.txt"))
            def task1(i,o):
                pass

            @transform(task1, suffix(".txt"), ".result")
            def task2(i, o):
                pass

            pipeline_printout(sys.stdout, [task2], verbose=3)

        ::

            ________________________________________
            Tasks which will be run:

            Task = task1
                   Job = [start.input
                         ->[first_output.txt, second_output.txt]]

            Task = task2
                   Job = [first_output.txt
                         ->first_output.result]
                   Job = [second_output.txt
                         ->second_output.result]

            ________________________________________


=======================================================================================
Q. How can a *Ruffus* task produce output which goes off in different directions?
=======================================================================================

    A. As above, anytime there is a situation which requires a one-to-many operation, you should reach
    for :ref:`@subdivide <decorators.subdivide>`. The advanced form takes a regular expression, making
    it easier to produce multiple derivatives of the input file. The following example subdivides
    *2* jobs each into *3*, so that the subsequence task will run *2* x *3* = *6* jobs.

        ::

            from ruffus import *
            import sys
            @subdivide(["1.input_file",
                        "2.input_file"],
                        regex(r"(.+).input_file"),      # match file prefix
                       [r"\1.file_type1",
                        r"\1.file_type2",
                        r"\1.file_type3"])
            def split_task(input, output):
               pass


            @transform(split_task, regex("(.+)"), r"\1.test")
            def test_split_output(i, o):
               pass

            pipeline_printout(sys.stdout, [test_split_output], verbose = 3)

        Each of the original 2 files have been split in three so that test_split_output will run
        6 jobs simultaneously.

            ::

                ________________________________________
                Tasks which will be run:

                Task = split_task
                       Job = [1.input_file ->[1.file_type1, 1.file_type2, 1.file_type3]]
                       Job = [2.input_file ->[2.file_type1, 2.file_type2, 2.file_type3]]

                Task = test_split_output
                       Job = [1.file_type1 ->1.file_type1.test]
                       Job = [1.file_type2 ->1.file_type2.test]
                       Job = [1.file_type3 ->1.file_type3.test]
                       Job = [2.file_type1 ->2.file_type1.test]
                       Job = [2.file_type2 ->2.file_type2.test]
                       Job = [2.file_type3 ->2.file_type3.test]
                ________________________________________



=======================================================================================
Q. Can I call extra code before each job?
=======================================================================================
    A. This is easily accomplished by hijacking the process
    for checking if jobs are up to date or not (:ref:`@check_if_uptodate <decorators.check_if_uptodate>`):

        ::

            from ruffus import *
            import sys

            def run_this_before_each_job (*args):
                print "Calling function before each job using these args", args
                # Remember to delegate to the default *Ruffus* code for checking if
                #   jobs need to run.
                return needs_update_check_modify_time(*args)

            @check_if_uptodate(run_this_before_each_job)
            @files([[None, "a.1"], [None, "b.1"]])
            def task_func(input, output):
                pass

            pipeline_printout(sys.stdout, [task_func])

        This results in:
        ::

            ________________________________________
            >>> pipeline_run([task_func])
            Calling function before each job using these args (None, 'a.1')
            Calling function before each job using these args (None, 'a.1')
            Calling function before each job using these args (None, 'b.1')
                Job = [None -> a.1] completed
                Job = [None -> b.1] completed
            Completed Task = task_func

        .. note::

            Because ``run_this_before_each_job(...)`` is called whenever *Ruffus* checks to see if
            a job is up to date or not, the function may be called twice for some jobs
            (e.g. ``(None, 'a.1')`` above).


=========================================================================================================
Q. Does *Ruffus* allow checkpointing: to distinguish interrupted and completed results?
=========================================================================================================

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
A. Use the builtin sqlite checkpointing
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


    By default, ``pipeline_run(...)`` will save the timestamps for output files from successfully run jobs to an sqlite database file (``.ruffus_history.sqlite``) in the current directory .

    * If you are using ``Ruffus.cmdline``, you can change the checksum / timestamp database file name on the command line using  ``--checksum_file_name NNNN``
    *


    The level of timestamping / checksumming can be set via the ``checksum_level`` parameter:

    .. code-block:: python

        pipeline_run(..., checksum_level = N, ...)

    where the default is 1::

       level 0 : Use only file timestamps
       level 1 : above, plus timestamp of successful job completion
       level 2 : above, plus a checksum of the pipeline function body
       level 3 : above, plus a checksum of the pipeline function default arguments and the additional arguments passed in by task decorators

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
A. Use a flag file
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    When gmake is interrupted, it will delete the target file it is updating so that the target is
    remade from scratch when make is next run. Ruffus, by design, does not do this because, more often than
    not, the partial / incomplete file may be usefully if only to reveal, for example, what might have caused an interrupting error
    or exception. It also seems a bit too clever and underhand to go around the programmer's back to delete files...

    A common *Ruffus* convention is create an empty checkpoint or "flag" file whose sole purpose
    is to record a modification-time and the successful completion of a job.

    This would be task with a completion flag:

    ::

        #
        #   Assuming a pipelined task function named "stage1"
        #
        @transform(stage1, suffix(".stage1"), [".stage2", ".stage2_finished"] )
        def stage2 (input_files, output_files):
            task_output_file, flag_file = output_files
            cmd = ("do_something2 %(input_file)s >| %(task_output_file)s ")
            cmd = cmd % {
                            "input_file":               input_files[0],
                            "task_output_file":         task_output_file
                        }
            if not os.system( cmd ):
                #88888888888888888888888888888888888888888888888888888888888888888888888888888
                #
                #   It worked: Create completion flag_file
                #
                open(flag_file, "w")
                #
                #88888888888888888888888888888888888888888888888888888888888888888888888888888


    The flag_files ``xxx.stage2_finished`` indicate that each job is finished. If this is missing,
    ``xxx.stage2`` is only a partial, interrupted result.


    The only thing to be aware of is that the flag file will appear in the list of inputs of the
    downstream task, which should accordingly look like this:


    ::

        @transform(stage2, suffix(".stage2"), [".stage3", ".stage3_finished"] )
        def stage3 (input_files, output_files):

            #888888888888888888888888888888888888888888888888888888888888888888888888888888888
            #
            #   Note that the first parameter is a LIST of input files, the last of which
            #       is the flag file from the previous task which we can ignore
            #
            input_file, previous_flag_file  = input_files
            #
            #888888888888888888888888888888888888888888888888888888888888888888888888888888888
            task_output_file, flag_file     = output_files
            cmd = ("do_something3 %(input_file)s >| %(task_output_file)s ")
            cmd = cmd % {
                            "input_file":               input_file,
                            "task_output_file":         task_output_file
                        }
            # completion flag file for this task
            if not os.system( cmd ):
                open(flag_file, "w")


    The :ref:`Bioinformatics example<examples_bioinformatics_part2.step2>` contains :ref:`code <examples_bioinformatics_part2_code>` for checkpointing.


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
A. Use a temp file
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    Thanks to Martin Goodson for suggesting this and providing an example. In his words:

    "I normally use a decorator to create a temporary file which is only renamed after the task has completed without any problems. This seems a more elegant solution to the problem:"


        .. code-block:: python

            def usetemp(task_func):
                """ Decorate a function to write to a tmp file and then rename it. So half finished tasks cannot create up to date targets.
                """
                @wraps(task_func)
                def wrapper_function(*args, **kwargs):
                    args=list(args)
                    outnames=args[1]
                    if not isinstance(outnames, basestring) and  hasattr(outnames, '__getitem__'):
                        tmpnames=[str(x)+".tmp" for x in outnames]
                        args[1]=tmpnames
                        task_func(*args, **kwargs)
                        try:
                            for tmp, name in zip(tmpnames, outnames):
                                if os.path.exists(tmp):
                                    os.rename(tmp, str(name))
                        except BaseException as e:
                            for name in outnames:
                                if os.path.exists(name):
                                    os.remove(name)
                            raise (e)
                    else:
                        tmp=str(outnames)+'.tmp'
                        args[1]=tmp
                        task_func(*args, **kwargs)
                        os.rename(tmp, str(outnames))
            return wrapper_function


    Use like this:

        .. code-block:: python

            @files(None, 'client1.price')
            @usetemp
            def getusers(inputfile, outputname):
                #**************************************************
                # code goes here
                # outputname now refers to temporary file
                pass





**********************************************************
Windows
**********************************************************

=========================================================
Q. Windows seems to spawn *ruffus* processes recursively
=========================================================

    A. It is necessary to protect the "entry point" of the program under windows.
    Otherwise, a new process will be started each time the main module is imported
    by a new Python interpreter as an unintended side effects. Causing a cascade
    of new processes.

    See: http://docs.python.org/library/multiprocessing.html#multiprocessing-programming

    This code works::

        if __name__ == '__main__':
            try:
                pipeline_run([parallel_task], multiprocess = 5)
        except Exception, e:
            print e.args



**********************************************************
Sun Grid Engine / PBS / SLURM etc
**********************************************************

==========================================================================================================================================
Q. Can Ruffus be used to manage a cluster or grid based pipeline?
==========================================================================================================================================
    A. Some minimum modifications have to be made to your *Ruffus* script to allow it to submit jobs to a cluster

    See :ref:`ruffus.drmaa_wrapper <new_manual.ruffus.drmaa_wrapper.run_job>`

    Thanks to Andreas Heger and others at CGAT and Bernie Pope for contributing ideas and code.


==========================================================================================================================================
Q. When I submit lots of jobs via Sun Grid Engine (SGE), the head node occassionally freezes and dies
==========================================================================================================================================

    A. You need to use multithreading rather than multiprocessing. See :ref:`ruffus.drmaa_wrapper <new_manual.ruffus.drmaa_wrapper.run_job>`


=====================================================================
Q. Keeping Large intermediate files
=====================================================================

    Sometimes pipelines create a large number of intermediate files which might not be needed later.

    Unfortunately, the current design of *Ruffus* requires these files to hang around otherwise the pipeline
    will not know that it ran successfully.

    We have some tentative plans to get around this but in the meantime, Bernie Pope suggests
    truncating intermediate files in place, preserving timestamps::


        # truncate a file to zero bytes, and preserve its original modification time
        def zeroFile(file):
            if os.path.exists(file):
                # save the current time of the file
                timeInfo = os.stat(file)
                try:
                    f = open(file,'w')
                except IOError:
                    pass
                else:
                    f.truncate(0)
                    f.close()
                    # change the time of the file back to what it was
                    os.utime(file,(timeInfo.st_atime, timeInfo.st_mtime))

**********************************************************************************
Sharing python objects between Ruffus processes running concurrently
**********************************************************************************

    The design of Ruffus envisages that much of the data flow in pipelines occurs in files but it is also possible to pass python objects in memory.

    Ruffus uses the `multiprocessing  <http://docs.python.org/2/library/multiprocessing.html>`_ module and much of the following is a summary of what is covered
    in depth in the Python Standard Library `Documentation  <http://docs.python.org/2/library/multiprocessing.html#sharing-state-between-processes>`_.

    Running Ruffus using ``pipeline_run(..., multiprocess = NNN)`` where ``NNN`` > 1 runs each job concurrently on up to ``NNN`` separate local processes.
    Each task function runs independently in a different python intepreter, possibly on a different CPU, in the most efficient way.
    However, this does mean we have to pay some attention to how data is sent across process boundaries (unlike the situation with ``pipeline_run(..., multithread = NNN)`` ).

    The python code and data which comprises your multitasking Ruffus job is sent to a separate process in three ways:

    #. The python function code and data objects are `pickled  <http://docs.python.org/2/library/pickle.html>`__, i.e. converting into a byte stream, by the master process, sent to the remote process
       before being converted back into normal python (unpickling).
    #. The parameters for your jobs, i.e. what Ruffus calls your task functions with, are separately `pickled  <http://docs.python.org/2/library/pickle.html>`__ and sent to the remote process via
       `multiprocessing.Queue  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Queue>`_
    #. You can share and synchronise other data yourselves. The canonical example is the logger provided by ``Ruffus.cmdline.setup_logging``

    .. note::

        Check that your function code and data can be `pickled  <http://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled>`_.

        Only functions, built-in functions and classes defined at the top level of a module are picklable.


    The following answers are a short "how-to" for sharing and synchronising data yourselves.


==============================================================================
Can ordinary python objects be shared between processes?
==============================================================================

    A. Objects which can be `pickled  <http://docs.python.org/2/library/pickle.html>`__ can be shared as is. These include

        * numbers
        * strings
        * tuples, lists, sets, and dictionaries containing only objects which can be `pickled  <http://docs.python.org/2/library/pickle.html>`__.

    #. If these do not change during your pipeline, you can just use them without any further effort in your task.
    #. If you need to use the value at the point when the task function is *called*, then you need to pass the python object as parameters to your task.
       For example:

       .. code-block:: python
           :emphasize-lines: 1

            # changing_list changes...
            @transform(previous_task, suffix(".foo"), ".bar", changing_list)
            def next_task(input_file, output_file, changing_list):
                pass

    #. If you need to use the value when the task function is *run* then see :ref:`the following answer. <how-about-synchronising-python-objects-in-real-time>`.


================================================================================================
Why am I getting ``PicklingError``?
================================================================================================

    What is happening? Didn't `Joan of Arc  <https://en.wikipedia.org/wiki/Battle_of_the_Herrings>`_ solve this once and for all?

    A. Some of the data or code in your function cannot be `pickled  <http://docs.python.org/2/library/pickle.html>`__ and is being asked to be sent by python ``mulitprocessing`` across process boundaries.


        When you run your pipeline using multiprocess:

            .. code-block:: python

                pipeline_run([], verbose = 5, multiprocess = 5, logger = ruffusLoggerProxy)

        You will get the following errors:

            .. code-block:: python

                Exception in thread Thread-2:
                Traceback (most recent call last):
                  File "/path/to/python/python2.7/threading.py", line 808, in __bootstrap_inner
                    self.run()
                  File "/path/to/python/python2.7/threading.py", line 761, in run
                    self.__target(*self.__args, * *self.__kwargs)
                  File "/path/to/python/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
                    put(task)
                PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed


        which go away when you set ``pipeline_run([], multiprocess = 1, ...)``




    Unfortunately, pickling errors are particularly ill-served by standard python error messages. The only really good advice is to take the offending
    code and try and `pickle  <http://docs.python.org/2/library/pickle.html>`__ it yourself and narrow down the errors. Check your objects against the list
    in the `pickle  <http://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled>`_ module.
    Watch out especially for nested functions. These will have to be moved to file scope.
    Other objects may have to be passed in proxy (see below).


.. _how-about-synchronising-python-objects-in-real-time:

================================================================================================
How about synchronising python objects in real time?
================================================================================================

    A. You can use managers and proxy objects from the `multiprocessing  <http://docs.python.org/library/multiprocessing.html>`__ module.

    The underlying python object would be owned and managed by a (hidden) server process. Other processes can access the shared objects transparently by using proxies. This is how the logger provided by
    ``Ruffus.cmdline.setup_logging`` works:

    .. code-block:: python

        #  optional logger which can be passed to ruffus tasks
        logger, logger_mutex = cmdline.setup_logging (__name__, options.log_file, options.verbose)

    ``logger`` is a proxy for the underlying python `logger  <http://docs.python.org/2/library/logging.html>`__ object, and it can be shared freely between processes.

    The best course is to pass ``logger`` as a parameter to a *Ruffus* task.

    The only caveat is that we should make sure multiple jobs are not writting to the log at the same time. To synchronise logging, we use proxy to a non-reentrant `multiprocessing.lock <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Lock>`_.

    .. code-block:: python

        logger, logger_mutex = cmdline.setup_logging (__name__, options.log_file, options.verbose)


        @transform(previous_task, suffix(".foo"), ".bar", logger, logger_mutex)
        def next_task(input_file, output_file, logger, logger_mutex):
            with logger_mutex:
                logger.info("We are in the middle of next_task: %s -> %s" % (input_file, output_file))


==============================================================================
Can I  share and synchronise my own python classes via proxies?
==============================================================================

    A. `multiprocessing.managers.SyncManager  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.managers.SyncManager>`__ provides out of the box support for lists, arrays and dicts etc.

        Most of the time, we can use a "vanilla" manager provided by `multiprocessing.Manager()  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.sharedctypes.multiprocessing.Manager>`_:

        .. code-block:: python


            import multiprocessing
            manager = multiprocessing.Manager()

            list_proxy = manager.list()
            dict_proxy = manager.dict()
            lock_proxy          = manager.Lock()
            namespace_proxy     = manager.Namespace()
            queue_proxy         = manager.Queue([maxsize])
            rentrant_lock_proxy = manager.RLock()
            semaphore_proxy     = manager.Semaphore([value])
            char_array_proxy    = manager.Array('c')
            integer_proxy       = manager.Value('i', 6)

            @transform(previous_task, suffix(".foo"), ".bar", lock_proxy, dict_proxy, list_proxy)
            def next_task(input_file, output_file, lock_proxy, dict_proxy, list_proxy):
                with lock_proxy:
                    list_proxy.append(3)
                    dict_proxy['a'] = 5


    However, you can also create proxy custom classes for your own objects.

    In this case you may need to derive from  `multiprocessing.managers.SyncManager  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.managers.SyncManager>`_
    and register proxy functions. See ``Ruffus.proxy_logger`` for an example of how to do this.

============================================================================================================================================================
How do I send python objects back and forth without tangling myself in horrible synchronisation code?
============================================================================================================================================================

    A. Sharing python objects by passing messages is a much more modern and safer way to coordinate multitasking than using synchronization primitives like locks.

    The python `multiprocessing  <http://docs.python.org/2/library/multiprocessing.html#pipes-and-queues>`__ module provides support for passing python objects as messages between processes.
    You can either use `pipes  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Pipe>`__
    or `queues  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Queue>`__.
    The idea is that one process pushes and object onto a `pipe  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Pipe>`__ or `queue  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Queue>`__
    and the other processes pops it out at the other end. `Pipes  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Pipe>`__ are
    only two ended so `queues  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Queue>`__ are usually a better fit for sending data to multiple Ruffus jobs.

    Proxies for `queues  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.managers.SyncManager.Queue>`__ can be passed between processes as in the previous section


==============================================================================
How do I share large amounts of data efficiently across processes?
==============================================================================

    A. If it is really impractical to use data files on disk, you can put the data in shared memory.

    It is possible to create shared objects using shared memory which can be inherited by child processes or passed as Ruffus parameters.
    This is probably most efficently done via the `array  <http://docs.python.org/2/library/multiprocessing.html#multiprocessing.Array>`_
    interface. Again, it is easy to create locks and proxies for synchronised access:


    .. code-block:: python

        from multiprocessing import Process, Lock
        from multiprocessing.sharedctypes import Value, Array
        from ctypes import Structure, c_double

        manager = multiprocessing.Manager()

        lock_proxy          = manager.Lock()
        int_array_proxy     = manager.Array('i', [123] * 100)

        @transform(previous_task, suffix(".foo"), ".bar", lock_proxy, int_array_proxy)
        def next_task(input_file, output_file, lock_proxy, int_array_proxy):
            with lock_proxy:
                int_array_proxy[23] = 71