1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
|
=====
Usage
=====
Important notes
---------------
* For best results, cmp.h5 or BAM files generated from short library sequencing should be paired with the SMsn protocol.
* Conversely, cmp.h5 or BAM files containing long library-sequenced reads are best when using the SMp protocol.
* A CCS fastq file is only *required* when the --align option is specified [DEPRECATED USAGE; see below]. This calls an alignment step that is used to mask out sequencing errors on each molecule. --align should never be used with the SMp protocol and long libraries, as CCS sequencing only works with short libraries.
* UPDATE: because the --align option has been deprecated, just specify NONE as the fastq path in the input file.
Example usage
-------------
.. code-block:: console
Usage: smalr [--help] [options] <inputs_file>
This program will take a native aligned_reads.cmph5 and compare each molecule's
kinetics profile with a matching WGA aligned_reads.cmp.h5. There are two protocols
available for use within the pipeline, SMsn and SMp. The single argument for both
protocols is an inputs_file containing paths to the relevant files.
The inputs_file has the following format:
native_cmph5 : /path/to/native/aligned_reads.cmp.h5
fastq : /path/to/native/CCS_reads.fastq (DEPRECATED; specify NONE)
wga_cmph5 : /path/to/WGA/aligned_reads.cmp.h5
ref : /path/to/sample/reference.fasta
With the updated support for aligned BAM files (in addition to *.cmp.h5 files),
please supply the path to the aligned BAM file using the 'native_cmph5' and
'wga_cmph5' lines in the input file.
SMsn: Single-molecule, single nucleotide analysis
Each motif site on each sequencing molecule is assessed for methylation
status. This is designed for use with short (~250bp) sequencing library
preps, where the long read lengths of SMRT reads enables multiple passes
over each motif site. The reliability of the SMsn scores increases with
more passes (i.e. higher single-molecule coverage).
Example usage for SMsn analysis:
smalr -i --SMsn --motif=GATC --mod_pos=2 --procs=4 -c 5 input.txt
SMp: Single-molecule, motif-pooled analysis
All motif sites on a sequencing molecule are pooled together and the
molecule-wide methylation status for the given motif is assessed. This
is designed for use with long (10Kb+) sequencing library preps, where each
single long subread can span many distinct motif sites. The reliability of
the SMp scores increases with increasing number of distinct motif sites
contained in the subread.
Example usage for SMp analysis:
smalr -i --SMp --motif=GATC --mod_pos=2 --procs=4 -c 5 input.txt
Options:
-h, --help Show this help message and exit
-d, --debug Increase verbosity of logging
-i, --info Add basic logging
--logFile=LOGFILE Write logging to file [log.out]
--out=OUT Filename to output SMsn/SMp results [<SMsn/SMp>.out]
-c NATIVECOVTHRESH, --nativeCovThresh=NATIVECOVTHRESH Per mol/strand coverage threshold below which to ignore molecules [10]
-m MOTIF, --motif=MOTIF (Required) The sequence motif to be analyzed [None]
-s MOD_POS, --mod_pos=MOD_POS (Required) The modified position in the motif to be analyzed (e.g. for
Gm6ATC, mod_pos=2) [None]
--wgaCovThresh=WGACOVTHRESH Aggregate WGA coverage threshold below which to skip analysis at that
position [10]
--SMsn Use short-library, single-nucleotide detection protocol. [False]
--SMp Use long-library epigenetic phasing protocol (pool IPDs from each
subread). [False]
--procs=PROCS Number of processors to use [4]
--align [DEPRECATED; not supported for BAM input] Align native reads to
reference to avoid real SNP positions. Only use when expecting sequence
heterogeneity in sample (i.e. mtDNA). [False]
--upstreamSkip=UPSTREAMSKIP Number of bases 5' of a CCS-detected, molecule-level SNP to skip in
analysis (only when using --align) [10]
--downstreamSkip=DOWNSTREAMSKIP Number of bases 3' of a CCS-detected, molecule-level SNP to skip in
analysis (only when using --align) [10]
--minSubreadLen=MINSUBREADLEN Minimum length of a subread to analyze [100]
--minAcc=MINACC Minimum accuracy of a subread to analyze [0.8]
--natProcs=NATPROCS Number of processors to use for native molecule analysis [same as procs]
--leftAnchor=LEFTANCHOR Number of left bp to exclude around subread-level alignment errors [1]
--rightAnchor=RIGHTANCHOR Number of right bp to exclude around subread-level alignment errors [1]
--write_vars=WRITE_VARS Write mol-specific variant calls to this filename (only when using --align)
[None]
--useZMW Index molecules using ZMW/movie ID's instead of molecule IDs (if all
alignments all have unique molecule IDs) [False]
Because the ``--align`` option has been deprecated due to limited use-cases, several of the above options are no longer necessary or useful. These include the following options, which can be disregarded:
* ``--align``
* ``--upstreamSkip``
* ``--downstreamSkip``
* ``--leftAnchor``
* ``--rightAnchor``
* ``--write_vars``
Pipeline output
---------------
One output directory will be created for each contig in the reference. If there is only one contig, the results will be placed in the folder named for that contig. These results include a log detailing the analysis of that contig, the motif positions in that contig (forward and reverse strand), a fasta file of that contig, and a results file (SMsn.out or SMp.out). This results file contains the following informtation:
Column headers
^^^^^^^^^^^^^^^
1. Contig strand
2. Contig motif position (for SMp, pooled motif sites are summarized by smallest site position)
3. SMsn or SMp score (native score - WGA score)
4. Molecule ID
5. Native score (mean of subread-normalized ln(IPD) values; site- and molecule-specific)
6. WGA score (mean of subread-normalized ln(IPD) values; site-specific accross all WGA molecules)
7. Number of data points used to get the molecule-level native score
8. Number of data points used to get the aggregate WGA score
9. Mean length of subreads from the native molecule
|