1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
|
*** INSEGT - INtersecting SEcond Generation sequencing daTa with annotation ***
---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
1. Overview
2. Installation
3. Usage
4. Input Format
5. Output Format
6. Contact
---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------
INSEGT is a tool to analyse alignments of RNA-Seq reads (single-end or
paired-end) by using gene annotations.
It can measure exon, transcript and gene expression levels of given
annotations. If read alignments span more than one exon, INSEGT can also
compute possible exon combinations (tuple) and their expression levels.
According to requirements it computes maximal tuple or tuple with a certain
length. For paired-end reads INSEGT builds also tuple, whose exons are
connected by matepairs.
Briefly it searches the intervals of the read alignments in the intervals
of the given annotations. It counts the mapped reads for each annotation and
its parent annotation and stores the mapped annotations for each read.
INSEGT produces three output files:
The first contains the expression levels of the given annotations.
The second contains for each given read the mapped annotations.
If the read mapped in not annotated regions, an UNKNOWN_REGION in the output will mark
this.
And the third file contains exon tuple and their expression levels.
(For further informations see:
https://www.inf.fu-berlin.de/en/groups/abi/theses/bachelor/finished/krakau/bsc_thesis_krakau.pdf)
---------------------------------------------------------------------------
2. Installation
---------------------------------------------------------------------------
INSEGT is distributed with SeqAn - The C++ Sequence Analysis Library (see
https://www.seqan.de). To build INSEGT do the following:
Follow the "Getting Started" section on https://trac.seqan.de/wiki and check
out the latest Git repo. Instead of creating a project file in Debug mode,
switch to Release mode (-DCMAKE_BUILD_TYPE=Release) and compile insegt. This
can be done as follows:
1) git clone https://github.com/seqan/seqan.git
2) mkdir seqan/buld; cd seqan/build
3) cmake .. -DCMAKE_BUILD_TYPE=Release
4) make insegt
5) ./bin/insegt --help
---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------
Usage: insegt [OPTIONS] <ALIGNMENTS-FILE> <ANNOTATIONS-FILE>
INSEGT expects at least two files:
A SAM file which contains the alignments of the reads and a GFF file which
contains the gene annotations (annotations can be also given in the
GTF format, but this has would have to be specified additionally with the
option -z).
Without any additional parameters INSEGT would compute all tuple up to the
length 2. That means that it builds only 2-tuple, whose exons are connected
by matepairs.
The default behaviour can be modified by adding the following options to
the command line:
---------------------------------------------------------------------------
3.1. Options
---------------------------------------------------------------------------
[ -p PATH ], [ --output-path TEXT ]
Path to the directory, where the output files should be stored.
The default value is the insegt directory.
[ -n NUM ], [ --ntuple INT ]
Tuple length parameter. By default it is 2.
[ -o NUM ], [ --offset-interval INT ]
The offset-parameter to short alignment intervals to search them in the
given annotation regions. The default value is 5.
[ -t NUM ], [ --threshold-gaps INT ]
The number of gaps, which are allowed in the read alignments, for which
the gaps are not interpreted as introns. The default value is 5.
[ -c NUM ], [ --threshold-count INT ]
Threshold for min. count of tuple, for which the tuple are given out.
The default value is 1.
[ -r NUM ], [ --threshold-rpkm DOUBLE ]
Threshold for min. RPKM-value of tuple, for which the tuple are given out.
The default value is 0.0.
[ -m], [ --max-tuple ]
Only maximal tuple are computed, which are spanned by the whole read.
[ -e ], [ --exact-ntuple ]
Only tuple of the given length are computed. By default all tuple up to
the given length are computed. (If -m is set, -e will be ignored)
[ -u ], [ --unknown-orientation ]
Read-orientations are unknown. The alignment intervals are searched in the
annotations of both strands. By default they're just searched in the
annotations corresponding to the given orientation of the read.
[ -f], [ --fusion-genes ]
Check for fusion genes. A separate output for matepair tupel, which map in
different genes, is created. By default there is no check for fusion genes.
[ -z], [ --gtf-format ]
The gene annotations are given in a GTF file (instead of GFF).
By default INSEGT expects a GFF file.
---------------------------------------------------------------------------
4. Input General Feature Format
---------------------------------------------------------------------------
INSEGT expects a GFF file which contains the gene annotations.
The General Feature Format (GFF) is specified by the Sanger Institute as a tab-
delimited text format with the following columns:
<seqname> <src> <feat> <start> <end> <score> <strand> <frame> [attr] [cmts]
See also: https://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
INSEGT requires additional specifications for the input of the exon- and genenames.
Input columns:
(Irrelevant fields are represented as "<>", their contents doesn't matter)
<Sequenzname> < > < > <Start> <Ende> < > <Strang> < > <ID;ParentID>
Example:
chr1 . . 200 1000 . + . ID=ENSG1
chr1 . . 300 450 . + . ID=ENSG1-1;ParentID=ENSG1
chr1 . . 700 600 . - . ID=ENSG2-1;ParentID=ENSG2
The order of the row-entries doesn't matter.
If no entry exists for the parent of an exon, INSEGT generates it automatically.
Moreover parent entries don't need start- and end-positions.
---------------------------------------------------------------------------
5. Output Formats
---------------------------------------------------------------------------
INSEGT output files are GFF files. The first output file contains the annotations
similar to the GFF input and additionally the counts of the mapped
reads and the normalized expression levels in RPKM (reads per kilobase of
exon model per million mapped reads).
The two other output files have different specifications.
---------------------------------------------------------------------------
5.1. Output for annotations (annoOutput.gff):
---------------------------------------------------------------------------
Output columns:
<sequencename> < > < > <start> <end> <count> <strand> < > \
<ID=annotationname;[ParentID=parentname;]expression-level (RPKM)>
Example:
chr1 . . 30 . 60 + . ID=P1;30.0;
chr1 . . 75 300 20 + . ID=E1;ParentID=P1;25.0;
---------------------------------------------------------------------------
5.2. Output for reads (readOutput.gff):
---------------------------------------------------------------------------
The read-output of INSEGT contains the mapped annotations followed by their
parent annotation. If the read mapped in annotations of different parents,
the annotations are presented one after another corresponding to their parents.
Consecutive mapped annotations are separated by ":".
Overlapping annotations which are mapped by the same read interval are separated by ";".
Output columns:
<readname> <sequencename> <strand> <annotationnames> <parentname1> \
[<annotationnames> <parentname2> ...]
Example:
read_1 chr1 + E1:E2-1;E2-2:E3 P1 E10:UNKNOWN_REGION P2
---------------------------------------------------------------------------
5.3. Output for tuple (tupleOutput.gff):
---------------------------------------------------------------------------
In the tuple-output of INSEGT exons which are connected by a single read are
separated by ":" and exons which are connected by a matepair are separated by "^".
Output columns:
<sequencename> <parentname> <strand> <tupel of annotationnames> <count> <expression-level (RPKM)>
Example:
chr1 P1 + E1:E2-1:E3^E7 50 30.0
chr1 P1 + E1:E2-2:E3^E7 10 11.32
---------------------------------------------------------------------------
6. Example
---------------------------------------------------------------------------
There are small example files in the folder
seqan-trunk/apps/insegt/example/ containing alignments and
annotations. In the following you can see how to use INSEGT with default
paramters to analyze the alignments of the reads using the given annotations:
1) ./apps/insegt/insegt \
../../seqan-trunk/apps/insegt/example/alignments.sam \
../../seqan-trunk/apps/insegt/example/annotations.gff
This computes all tuple up to the length 2 and outputs them. To only output
tuple with a min. count of 2 you can use the option -c:
2) ./apps/insegt/insegt -c 2 \
../../seqan-trunk/apps/insegt/example/alignments.sam \
../../seqan-trunk/apps/insegt/example/annotations.gff
---------------------------------------------------------------------------
7. Contact
---------------------------------------------------------------------------
For questions or comments, contact:
David Weese <weese@inf.fu-berlin.de>
Marcel H. Schulz <marcel.schulz@molgen.mpg.de>
Sabrina Krakau <krakau@mi.fu-berlin.de>
|