File: getting_started.rst

package info (click to toggle)
python-seqcluster 1.2.7%2Bds-1
  • links: PTS, VCS
  • area: contrib
  • in suites: bullseye
  • size: 113,592 kB
  • sloc: python: 5,327; makefile: 184; sh: 122; javascript: 55
file content (195 lines) | stat: -rw-r--r-- 7,616 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
.. _getting_started:


***************
Getting started
***************

Best practices are implemented in a `python framework`_.

.. _python framework: https://github.com/chapmanb/bcbio-nextgen

clustering of small RNA sequences
---------------------------------

seqcluster generates a list of clusters of small RNA sequences, their genome location, their annotation and the abundance in all the sample of the project

.. image:: seqcluster.png
   :scale: 25%
   
**REMOVE ADAPTER**

I am currently using ``cutadapt``:

::

    cutadapt --adapter=$ADAPTER --minimum-length=8 --untrimmed-output=sample1_notfound.fastq -o sample1_clean.fastq -m 17 --overlap=8 sample1.fastq 

**COLLAPSE READS**

To reduce computational time, I recommend to collapse sequences, also it would help to apply filters based on abundances.
Like removing sequences that appear only once.

::

   seqcluster collapse -f sample1_clean.fastq -o collapse

Here I am only using sequences that had the adapter, meaning that for sure are small fragments.

This is compatible with UMI barcodes. If you have in the read name ``UMI_ATCGAT ``, then the tool will remove PCR dupiclates as well. To confirm this happened, the tool should output this sentence during the processing of the file: ``Find UMI tags in read names, collapsing by UMI``.

**PREPARE SAMPLES**

::

    seqcluster prepare -c file_w_samples -o res --minl 17 --minc 2 --maxl 45

the file_w_samples should have the following format:

::

	lane1_sequence.txt_1_1_phred.fastq      cc1
	lane1_sequence.txt_2_1_phred.fastq      cc2
	lane2_sequence.txt_1_1_phred.fastq      cc3
	lane2_sequence.txt_2_1_phred.fastq      cc4

two columns file, where the first column is the name of the file with the small RNA sequences for each sample, and the second column in the name of the sample.

The fastq files should be like this:

::

    @seq_1_x11
    CCCCGTTCCCCCCTCCTCC
    +
    QUALITY_LINE
    @seq_2_x20
    TGCGCAGTGGCAGTATCGTAGCCAATG
    +
    QUALITY_LINE
    </pre>

Where _x[09]  indicate the abundance of that sequence, and the middle number is the index of the sequence.

This script will generate: ``seqs.fastq`` and ``seqs.ma``. 
* seqs.fastq: have unique sequences and unique ids
* seqs.ma: is the abundance matrix of all unique sequences in all samples

**ALIGNMENT**

You should use an aligner to map seqs.fa to your genome. A possibility is bowtie or STAR. 
From here, we need a file in BAM format for the next step.
VERY IMPORTANT: the BAM file should be sorted

::

    bowtie -a --best --strata -m 5000 INDEX seqs.fastq -S | samtools view -Sbh /dev/stdin | samtools sort -o /dev/stdout temp > seqs.sort.bam


or 

::

    STAR --genomeDir $star_index_folder --readFilesIn res/seqs.fastq --alignIntronMax 1  --outFilterMultimapNmax 1000 --outSAMattributes NH HI NM --outSAMtype BAM SortedByCoordinate


**CLUSTERING**

::

    seqcluster cluster -a res/Aligned.sortedByCoord.out.bam  -m res/seqs.ma -g $GTF_FILE  -o res/cluster -ref PATH_TO_GENOME_FASTA --db example


* `-a` is the SAM file generated after mapped with your tool, which input has been seqs.fa
* `-m` the previous seqs.fa
* `-b` annotation files in bed format (see below examples) [deprecated]
* `-g` annotation files in gtf format (see below examples) [recommended]
* `-i` genome fasta file used in the mapping step (only needed if -s active)
* `-o` output folder
* `-ref` genome fasta file. Needs ``fai`` file as well there. (i.e ``hg19.fa``, ``hg19.fa.fai``)
* `-d` create debug logging
* `-s` construction of putative precursor (NOT YET IMPLEMENTED)
* `--db` (optional) will create sqlite3 database with results that will be used to browse data with html web page (under development)

Example of a bed file for annotation (the fourth column should be the name of the feature): 

::

    chr1    157783  157886  snRNA   0       -
    
Strongly recommend gtf format. Bed annotation is deprecated. Go `here <http://seqcluster.readthedocs.org/installation.html#data>`_ to know how to download data from hg19 and mm10.

Example of a gtf file for annotation (the **third** column should be the name of the feature and
the value after `gene name` attribute is the specific annotation): 

:: 

    chr1    source  miRNA      1       11503   .       +       .       gene name 'mir-102' ;


hint: scripts to generate human and mouse annotation are inside `seqcluster/scripts <https://github.com/lpantano/seqcluster/blob/master/scripts>`_ folder. 

**OUTPUTS**

* ``counts.tsv``: count matrix that can be input of downstream analyses
* ``size_counts.tsv``: size distribution of the small RNA by annotation group
* ``seqcluster.json``: json file containing all information 
* ``log/run.log``: all messages at debug level
* ``log/trace.log``: to keep trace of algorithm decisions


Interactive HTML Report
-----------------------

This will create html report using the following command assuming the output of `seqcluster cluster` is at `res`::

	seqcluster report -j res/seqcluster.json -o report -r $GENONE_FASTA_PATH 

where `$GENOME_FASTA_PATH` is the path to the genome fasta file used in the alignment.

**Note**: you can try our new `visualization tool <http://seqcluster.readthedocs.org/more_outputs.html#report>`_!

* ``report/html/index.html``: table with all clusters and the annotation with sorting option
* ``report/html/[0-9]/maps.html``: `summary`_ of the cluster with expression profile, annotation, and all sequences inside
* ``report/html/[0-9]/maps.fa``: putative precursor

.. _summary: https://rawgit.com/lpantano/seqcluster/master/data/examples_report/html/1/maps.html

An example of the output is below:

.. image:: http://i.makeagif.com/media/7-03-2016/M0GjW2.gif

Easy start with bcbio-nextgen.py
------------------------------------

**Note**:If you already are using bcbio, visit `bcbio <http://github.com/chapmanb/bcbio>`_ to run the pipeline there. 

To install the small RNA data::

	bcbio_nextgen.py upgrade -u development --tools --datatarget smallrna

**Options to run in a cluster**

It uses ipython-cluster-helper to send jobs to nodes in the cluster

* `--parallel` should set to `ipython`
* `--scheduler` should be set to `sge,lsf,slurm`
* `--num-jobs` indicates how much jobs to launch. It will run samples independently. If you have 4 samples, and set this to 4, 4 jobs will be launch to the cluster
* `--queue` the queue to use
* `--resources` allows to set any special parameter for the cluster, such as, email in sge system: `M=my@email.com`

Read complete usability here: https://github.com/roryk/ipython-cluster-helper
An examples in slurm system is::

	--parallel ipython --scheduler slurm --num-jobs 4 --queue general

**Output**

* one folder for each analysys, and inside one per sample
 * adapter: `*clean.fastq` is the file after adapter removal, `*clean_trimmed.fastq` is the collapse `clean.fastq`, `*fragments.fastq` is file without adapter, `*short.fastq` is file with reads < 16 nt.
 * align: BAM file results from align `trimmed.fastq`
 * mirbase: file with miRNA anotation and novel miRNA discovery with mirdeep2
 * tRNA: analysis done with tdrmapper [citation needed]
 * qc: `*_fastqc.html` is the fastqc results from the uncollapse fastq file
* seqcluster: is the result of running seqcluster. See its `documentation <http://seqcluster.readthedocs.org/getting_started.html#clustering-of-small-rna-sequences>`_ for further information.
* `report/srna-report.Rmd`: template to create a quick html report with exploration and differential expression analysis. See `example here <https://github.com/lpantano/mypubs/blob/master/srnaseq/mirqc/ready_report.md>`_