File: mirna_annotation.rst

package info (click to toggle)
python-seqcluster 1.2.9%2Bds-5
  • links: PTS, VCS
  • area: contrib
  • in suites: forky, sid, trixie
  • size: 113,636 kB
  • sloc: python: 5,308; makefile: 184; sh: 122; javascript: 55
file content (202 lines) | stat: -rw-r--r-- 8,146 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
.. _mirna_annotation:


***************
miRNA annotation
***************

miRNA annotation is running inside `bcbio small RNAseq pipeline <https://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html#smallrna-seq>`_ together with other tools to do a complete
small RNA analysis.

For some comparison with other tools go `here <https://github.com/lpantano/mypubs/blob/master/mirna/mirannotation/stats.md>`_.

You can run samples after processing the reads as shown below.
Currently there are two version: JAVA 

**Naming**

See always up to date information here in `mirtop open project <https://github.com/miRTop/incubator/blob/master/isomirs/isomir_naming.md>`_.

It is a working process, but since 10-21-2015 isomiR naming has changed to:

* Nucleotide substitution: ``NUMBER|NUCLEOTIDE_ISOMIR|NUCLEOTIDE_REFERENCE`` means at the position giving by the number the nucleotide in the sequence has substituted the nucleotide in the reference. This, as well, is a post-transcriptional modification.
* Additions at 3' end: ``0/NA`` means no modification. ``UPPER CASE LETTER`` means addition at the end. Note these nucleotides don't match the precursor. So they are post-transcriptional modification.
* Changes at 5' end: ``0/NA`` means no modification. ``UPPER CASE LETTER`` means nucleotide insertions (sequence starts before miRBase mature position). ``LOWWER CASE LETTER`` means nucleotide deletions (sequence starts after miRBase mature position).
* Changes at 3' end: ``0/NA`` means no modification. ``UPPER CASE LETTER`` means nucleotide insertions (sequence ends after miRBase mature position). ``LOWWER CASE LETTER`` means nucleotide deletions (sequence ends before miRBase mature position).

Processing of reads
-------------------

**REMOVE ADAPTER**

I am currently using ``cutadapt``.

::

    cutadapt --adapter=$ADAPTER --minimum-length=8 --untrimmed-output=sample1_notfound.fastq -o sample1_clean.fastq -m 17 --overlap=8 sample1.fastq 

**COLLAPSE READS**

To reduce computational time, I recommend to collapse sequences, also it would help to apply filters based on abundances.
Like removing sequences that appear only once.

::

   seqcluster collapse -f sample1_clean.fastq -o collapse

Here I am only using sequences that had the adapter, meaning that for sure are small fragments. The output will be named as ``sample1_clean_trimmed.fastq``


Prepare databases
-----------------

For human or mouse, follows `this instruction <http://seqcluster.readthedocs.org/installation.html#data>`_ to download easily miRBase files. In general you only need hairpin.fa and miRNA.str from miRBase site. `mirGeneDB <http://mirgenedb.org>`_ is also supported, download the needed files `here <https://github.com/lpantano/small_rna_annotation/tree/master/mirgenedb>`_. 

**Highly recommended to filter hairpin.fa to contain only the desired species.**

miRNA/isomiR annotation with JAVA
---------------------------------

**MIRALIGNER**

Download the tool from `miraligner`_ repository. 

.. _miraligner: https://github.com/lpantano/seqbuster/raw/miraligner/modules/miraligner/miraligner.jar

Download the mirbase files (`hairpin`_ and `miRNA`_) from the ftp and save it to `DB` folder.

.. _hairpin: ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.zip
.. _miRNA: ftp://mirbase.org/pub/mirbase/CURRENT/miRNA.str.zip

You can map the miRNAs with.

::

     java -jar miraligner.jar -sub 1 -trim 3 -add 3 -s hsa -i sample1_clean_trimmed.fastq -db DB  -o output_prefix 


**Cite**

SeqBuster is a bioinformatic tool for the processing and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications in human embryonic cells. Pantano L, Estivill X, Martí E. *Nucleic Acids Res. 2010 Mar;38(5):e34. Epub 2009 Dec 11.*

**NOTE:** `Check comparison of multiple tools <https://github.com/lpantano/mypubs/blob/master/mirna/mirannotation/stats.md>`_ for miRNA annotation.

Convert to GFF3-srna
---------------------

Use mirtop to convert to GFF3-srna format. This is the desired format to share the isomiR information and can be used to join multiple projects together easily.

See _`this <http://mirtop.readthedocs.io/en/dev/quick_start.html#from-seqbuster-miraligner-files-to-gff3>` to know how to convert all the output into a single file and share easily with collaborators::

    mirtop gff --format seqbuster --sps hsa --hairpin database/hairpin.fa --gtf database/hsa.gff3 -o test_out out_folder/*/*.mirna


.. _miRNAgff: ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3


Post-analysis with R
--------------------

Use the outputs to do differential expression, clustering and descriptive analysis with this package: `isomiRs <https://github.com/lpantano/isomiRs>`_

To load the data you can use `IsomirDataSeqFromFiles function  <http://lpantano.github.io/isomiRs/reference/IsomirDataSeqFromFiles.html>`_ and get the count data with `isoCounts <http://lpantano.github.io/isomiRs/reference/isoCounts.html>`_ to move to DESeq2 or similar packages.

Manual of miraligner(JAVA)
--------------------------

**options**

Add ``-freq`` if you have your fasta/fastq file with this format and you want a third column with the frequency (normally value after x character)::


    >seq_1_x4
    CACCGCTGTCGGGGAACCGCGCCAATTT


Add ``-pre`` if you want also sequences that map to the precursor but outside the mature miRNA


* Parameter `-sub`: mismatches allowed (0/1)
* Parameter `-trim`: nucleotides allowed for trimming (max 3)
* Parameter `-add`: nucleotides allowed for addition (max 3)
* Parameter `-s`: species (3 letter, human=>hsa)
* Parameter `-i`: fasta file
* Parameter `-db`: folder where miRBase files are(one copy at miraligner-1.0/DB folder)
* Parameter `-o`: prefix for the output files
* Parameter `-freq`: add frequency of the sequence to the output (just where input is fasta file with name matching this patter: >seq_3_x67)
* Parameter `-pre`: add sequences mapping to precursors as well

**input**

A fasta/fastq file reads::

    >seq
    CACCGCTGTCGGGGAACCGCGCCAATTT

or tabular file with counts information::

CACCGCTGTCGGGGAACCGCGCCAATTT 45

**output**

Track file *.mirna.opt: information about the process

Non mapped sequences will be on *.nomap

Header of the *.mirna.out file:

* seq: sequence
* freq/name: depending on the input this column contains counts (tabular input file) or name (fasta file)
* mir: miRNA name
* start: start of the sequence at the precursor
* end: end of the sequence at the precursor
* mism: nucleotide substitution position | nucleotide at sequence | nucleotide at precursor
* addition: nucleotides at 3 end added::


    precursor         => cctgtggttagctggttgcatatcc
    annotated miRNA   =>   TGTGGTTAGCTGGTTGCATAT
    sequence add:  TT =>   TGTGGTTAGCTGGTTGCATATTT


* tr5: nucleotides at 5 end different from the annonated sequence in miRBase::


	precursor 	      => cctgtggttagctggttgcatatcc
	annotated miRNA   =>   TGTGGTTAGCTGGTTGCATAT
	sequence tr5:  CC => CCTGTGGTTAGCTGGTTGCATAT
	sequence tr5:  tg =>     TGGTTAGCTGGTTGCATAT


* tr3: nucleotides at 3 end different from the annotated sequence in miRBase::


    precursor         => cctgtggttagctggttgcatatcc
    annotated miRNA   =>   TGTGGTTAGCTGGTTGCATAT
    sequence tr3: cc  =>   TGTGGTTAGCTGGTTGCATATCC
    sequence tr3: AT  =>   TGTGGTTAGCTGGTTGCAT

* s5: offset nucleotides at the begining of the annotated miRNAs::


    precursor         => agcctgtggttagctggttgcatatcc
    annotated miRNA   =>     TGTGGTTAGCTGGTTGCATAT
    s5                => AGCCTGTG


* s3:offset nucleotides at the ending of the annotated miRNAs::
 

    precursor         =>  cctgtggttagctggttgcatatccgc
    annotated miRNA   =>    TGTGGTTAGCTGGTTGCATAT
    s3                =>                     ATATCCGC


* type: mapped on precursor or miRNA sequences
* ambiguity: number of different detected precursors

Example::

    seq			miRNA		start	end	mism	tr5	tr3	add	s5	s3	DB amb
    TGGCTCAGTTCAGCAGGACC    hsa-mir-24-2    50      67      0       qCC     0       0       0       0       precursor 1
    ACTGCCCTAAGTGCTCCTTCTG  hsa-miR-18a*    47      68      0       0       0       tG      ATCTACTG        CTGGCA  miRNA 1