1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
|
##################################################################################################################################
########################## ########################################
########################## Trinity PBS job submission with multi part dependencies ########################################
########################## User modifiyable input file ########################################
########################## ########################################
##################################################################################################################################
### Author: Josh Bowden, CSIRO IM&T, Alexie Papanicolaou CSIRO CES
### Email: alexie@butterflybase.org
### Version 1.0
###
### Configuration file for script to split the Trinity workflow into multiple stages so as to efficiently request
### and use appropriate resources (walltime and number of cores) on a computer cluster / supercomputer.
###
### User must set all the variables in this file to appropriate values
### and then run the trinity_pbs.sh script with this file as input as shown below:
###
### Command line usage:
### To start (or re-start) an analysis:
### >trinity_pbs.sh TRINITY.CONFIG.template
### To stop previously started PBS jobs on the queue:
### >trinity_pbs.sh --rm OUTPUTDIR
### Where:
### TRINITY.CONFIG.template = user specific job details (i.e. the current file)
### OUTPUTDIR = is path to output data directory (set below)
###
### If any stage fails, the jobs may be resubmitted and only the scripts that have not completed
### will be resubmitted to the batch system. Either the scripts can be re-run (by using trinity_pbs.sh)
### with original (or new) inputs from the current file (changing the variables below)
### or the original scripts created by trinity_pbs.sh can be re-run by finding them in the output directory.
###
### Each stage and each array job will have a PBS output file sent to the output directory when the job finishes.
### This means there will be many output files from the PBS system when the array job runs (from part 4 and 5 mostly).
### If any part fails, errors will be specified in these output files.
###
### N.B. The trinity_pbs.sh file must have a number of system specific variables set by a system administrator
###
##################################################################################################################################
# USER must edit these:
###### Set an email to which job progress and status will be sent to.
UEMAIL=
###### Set a valid account (if available), otherwise leave blank.
ACCOUNT="#PBS -A sf-CSIRO"
###### Select a value for JOBPRFIX that is not longer than 7 characters. NO spaces or other non-alphanumeric characters
JOBPREFIX=
###### Set output data directory (OUTPUTDIR)
###### OUTPUTDIR is where PBS scripts will be written and also Trinity results will be stored
###### This area requires a large amount of space (possibly 100's of GB) and a high file count
###### ($WORKDIR is a standard area on some systems, however users should check that it is valid on their machine)
OUTPUTDIR="$WORKDIR"/trinityrnaseq/"$JOBPREFIX"
###### Set input data directory. This has to be explicitly set as it is used in other internal scripts.
###### Make sure you include the final forward slash. Defaults to current directory
DATADIRECTORY=$PWD/
###### Set input filenames, you can use wildcards if you embed the filename is 'single quotes'
FILENAMELEFT='*_left.fasta' # change this
FILENAMERIGHT='*_right.fasta' #change this
FILENAMESINGLE=single.fasta # change this or set it to empty
SEQTYPE=fa # change this to fa (FASTA) fq (FASTQ) or cfa/cfq (FASTA or FASTQ colour space SOLiD ABI)
# User may opt to change these:
# we set --max_reads_per_graph to 1million because very high I/O is needed otherwise. It is unlikely that a transcript needs more than 1 million reads to be assembled....
FILENAMEINPUT=" --seqType "$SEQTYPE" --left "$DATADIRECTORY""$FILENAMELEFT" --right "$DATADIRECTORY""$FILENAMERIGHT" --max_reads_per_graph 1000000 "
### STANDARD_JOB_DETAILS sets analysis specific input to Trinity.pl
### N.B. do not use --CPU or JM flag as this is automatically appended
STANDARD_JOB_DETAILS="Trinity.pl "$FILENAMEINPUT" --output "$OUTPUTDIR""
# This is where you specify resource limits. We provide some defaults values
# Ultimately settings depends on your data size and complexity
# Steps that go beyond their walltime will not complete. Edit these values and resubmit
### Stage P1: Time and resources required for Inchworm stage
### Only use at maximum, half the available CPUs on a node
# - Inchworm will not efficiently use any more than 4 CPUs and you will have to take longer for resources to be assigned
WALLTIME_P1="2:00:00"
MEM_P1="20gb" # will use it for --JM
NCPU_P1="4"
PBSNODETYPE_P1="any" # ask you system administrator what Nodetypes exists
### Stage P2: Time and resources required for Chrysalis stage
### Starts with Bowtie alignment and post-processing of alignment file
### All CPUs presenct can be used for the Chrysalis parts.
#They may take a while to be provisioned, so the less request, possibly the faster the jobs turnaround.
# For one step (the parallel sort) it needs as much memory as specified in P1. Less memory, means more I/O for sorting
# increase for more lanes of data: 3 lanes-> 60gb of RAM and 24h of time will do it.
# The bowtie step will take considerable amount of time with more data
WALLTIME_P2="12:00:00"
MEM_P2="20gb" # will use it for the parallel sort of the SAM after alignment
NCPU_P2="6"
PBSNODETYPE_P2="any"
### Stage P3: This is a backup stage for Chrysalis - only runs if time ran out in P2 above.
### This will need about 1 day per lane
WALLTIME_P3="18:00:00"
MEM_P3="8gb"
NCPU_P3="6"
PBSNODETYPE_P3="medium"
### Stage P4: QuantifyGraph graph runs in many parallel parts
### Tasks that fail will reamain in the OUTPUTDIR/chrysalis/quantifyGraph_commands.XYZ files (XYZ is a number)
### The remaining tasks can be run by running the job submission command "trinity_pbs.sh <config_file>" again.
NUMPERARRAYITEM_P4=5000
WALLTIME_P4="00:30:00"
MEM_P4="4gb"
NCPU_P4="1"
PBSNODETYPE_P4="medium"
### Stage P5: Butterfly options. Some butterfly jobs can take exceedingly long. Users may need to restart trinity_pbs multiple times to complete.
### Tasks that fail will reamain in the OUTPUTDIR/chrysalis/butterfly_commands.adj.XYZ files (XYZ is a number)
### Often there will be a few tasks that take a lot longer than others so multiple submissions to the cluster may be required.
### The remaining tasks can be run by running the job submission command "trinity_pbs.sh <config_file>" again.
### Running the long running tasks seperately may also be a good option.
NUMPERARRAYITEM_P5=5000
WALLTIME_P5="01:00:00"
MEM_P5="10gb"
NCPU_P5="1"
PBSNODETYPE_P5="medium"
|