File: README

package info (click to toggle)
seqan2 2.4.0+dfsg-11
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 223,504 kB
  • sloc: cpp: 256,886; ansic: 91,672; python: 8,339; sh: 995; xml: 570; makefile: 251; awk: 51
file content (291 lines) | stat: -rw-r--r-- 12,522 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
                                   NGS ROI
                       NGS Region of Interest Analysis

                     http://www.seqan.de/projects/ngs-roi
                                Version 0.2.2
                                October, 2013


------------------------------------------------------------------------------
Table of Contents
------------------------------------------------------------------------------

  1. Overview
  2. Examples
  3. Reference and Contact
  4. Reference and Contact
  5. File Formats

------------------------------------------------------------------------------
1. Overview
------------------------------------------------------------------------------

NGS ROI is a framework for the analysis of NGS mapping data in SAM/BAM format
based on the idea of decoupling mapping information from the strict linear
ordering of the genome.

------------------------------------------------------------------------------
2. Examples
------------------------------------------------------------------------------

You can find the binaries bam2roi, roi_feature_projection, and
roi_plot_thumbnails in the directory /usr/lib/seqan/bin/ and the example files
in /usr/share/doc/seqan-apps/ngs_roi/example/.

The following examples require a Unix environment such as Linux or Mac Os X.

------------------------------------------------------------------------------
2.1. Help
------------------------------------------------------------------------------

The following program calls print the help to the user.

  $ bam2roi --help
  $ roi_feature_projection --help
  $ roi_plot_thumbnails --help

------------------------------------------------------------------------------
2.2. Converting BAM to ROI
------------------------------------------------------------------------------

The file "example.bam" consists of ~10.000 records from the 2L chromosome from
a BAM file that was obtained by mapping the SRA dataset SRR618933 agains the
D. melanogaster genome and sorting by genomic coordinate.

We can generate a ROI file from this BAM file:

  $ bam2roi -if example.bam -of example.roi
  ...

  $ head -5 example.roi
  #ROI 0.3
  ##ref	start	end	name	length	strand	max_count	num_reads	gc_content	counts
  2L	9	60	region_0	52	+	3	4	0.4423	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,...
  2L	62	138	region_1	77	+	15	20	0.3896	2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,8,9,...
  2L	143	596	region_2	454	+	18	148	0.4009	2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,8,8,...
  2L	599	1252	region_3	654	+	21	207	0.4083	1,1,1,1,1,1,1,1,1,1,2,2,3,3,5,...
  2L	1255	2352	region_4	1098	+	26	340	0.4071	2,2,2,3,3,3,3,3,3,3,3,...
  2L	2357	2807	region_5	451	+	21	141	0.4058	1,1,1,1,1,1,1,1,1,1,6,6,8,...
  2L	2812	2888	region_6	77	+	9	16	0.3896	1,1,1,1,1,1,1,1,1,1,1,1,1,...
  2L	2890	3344	region_7	455	+	21	121	0.4110	1,1,1,1,1,2,2,2,2,3,3,5,5,...

------------------------------------------------------------------------------
2.2. Converting BAM to ROI
------------------------------------------------------------------------------

We can create a PNG thumbnail view of the ROI file:

  $ roi_plot_thumbnails -if example.roi -o plots
  ...

  $ ls plots*
  plots0.png

------------------------------------------------------------------------------
2.3. Projecting to BED
------------------------------------------------------------------------------

The file "dmel.bed" contains a subset of the gene annotations for the
D. melanogaster genome.  We had to translate some of the reference names
(e.g. chr2L to 2L) such that they match the reference name from the BAM and
ROI file.

The program roi_feature_projection reads a ROI and a BED file and create one
output ROI record for each BED record in the input.  The counts array will
have the length of the BED record and the counts of the ROI records
overlapping with the BED record will be projected to the counts of the new ROI
record.

The ROI file and the BED file have be sorted by coordinate (lexicographically
reference names, then by begin and end position).  You will find the script
bed_sort.sh in /usr/lib/seqan/bin/.  We first use it to sort the BED file
appropriately.

  $ head -5 dmel.bed
  2L	8384466	8388803	CG13384-RD	0	+
  2L	8384466	8388803	CG13384-RC	0	+
  2L	8384466	8388803	CG13384-RE	0	+
  2L	8384466	8388803	CG13384-RG	0	+
  2L	8384466	8388803	CG13384-RF	0	+

  $ bed_sort.sh -i dmel.bed -o dmel.sorted.bed

  $ head -5 dmel.sorted.bed
  2L	7528	9491	CG11023-RA	0	+
  2L	9835	18583	CG2671-RA	0	-
  2L	9835	18583	CG2671-RC	0	-
  2L	9835	21372	CG2671-RB	0	-
  2L	9835	21372	CG2671-RD	0	-

The ROI file will be sorted appropriately already after creation from BAM.
However, to demonstrate the script roi_sort.sh, here is how to sort the ROI
file by coordinate.

  $ sort_roi.sh -p -i exampe.roi -o example.sorted.roi

Then, we use the program roi_feature_projection to project the ROI records to
the BED records.

  $ roi_feature_projection -ir example.sorted.roi -if dmel.sorted.bed \
      -or projected_bed.roi
  ...

  $ head -5 projected_bed.roi
  ##ref   begin_pos       end_pos region_name     length  strand  max_count       counts
  2L      7528    9491    CG11023-RA_0    1964    +       0       0,0,0,0,0,0,0,0,0,0,...
  2L      9835    18583   CG2671-RA_0     8749    -       0       0,0,0,0,0,0,0,0,0,0,...
  2L      9835    18583   CG2671-RC_0     8749    -       55      1,1,1,1,1,1,1,1,1,1,...
  2L      9835    21372   CG2671-RB_0     11538   -       55      1,1,1,1,1,1,1,1,1,1,...

------------------------------------------------------------------------------
2.4. Projecting to GFF/GTF
------------------------------------------------------------------------------

Next, we demonstrate how to project the ROI file to a GTF file.  The procedure
is the same for GFF files.

First, we sort the GTF file by coordinate.

  $ head -5 dmel.gtf
  2L	dm3_flyBaseGene	exon	8384467	8384895	0.000000	+	.	gene_id "CG13384-RD"; transcript_id "CG13384-RD";
  2L	dm3_flyBaseGene	exon	8384997	8385097	0.000000	+	.	gene_id "CG13384-RD"; transcript_id "CG13384-RD";
  2L	dm3_flyBaseGene	start_codon	8386247	8386249	0.000000	+	.	gene_id "CG13384-RD"; transcript_id "CG13384-RD";
  2L	dm3_flyBaseGene	CDS	8386247	8386385	0.000000	+	0	gene_id "CG13384-RD"; transcript_id "CG13384-RD";
  2L	dm3_flyBaseGene	exon	8386217	8386385	0.000000	+	.	gene_id "CG13384-RD"; transcript_id "CG13384-RD";

  $ gff_sort.sh -i dmel.gtf -o dmel.sorted.gtf

  $ head -5 dmel.sorted.gtf
  2L	dm3_flyBaseGene	exon	7529	8116	0.000000	+	.	gene_id "CG11023-RA"; transcript_id "CG11023-RA";
  2L	dm3_flyBaseGene	start_codon	7680	7682	0.000000	+	.	gene_id "CG11023-RA"; transcript_id "CG11023-RA";
  2L	dm3_flyBaseGene	CDS	7680	8116	0.000000	+	0	gene_id "CG11023-RA"; transcript_id "CG11023-RA";
  2L	dm3_flyBaseGene	CDS	8229	8589	0.000000	+	1	gene_id "CG11023-RA"; transcript_id "CG11023-RA";
  2L	dm3_flyBaseGene	exon	8229	8589	0.000000	+	.	gene_id "CG11023-RA"; transcript_id "CG11023-RA";

Then, we project the ROI file to the GTF file, splicing together the exons
having the same transcript id.

  $ roi_feature_projection -ir example.roi -if sorted.gtf \
      -or projected_gtf.roi --gff-type exon --gff-group-by transcript_id
  ...
  $ head -5 projected_gtf.roi
  ##ref   begin_pos       end_pos region_name     length  strand  max_count       counts
  2L      7529    9491    CG11023-RA      1773    +       34      8,9,13,12,12,12,12,12,11...
  2L      9836    18583   CG2671-RA       5407    +       55      1,1,1,1,1,1,1,1,1,1,0,0,...
  2L      9836    18583   CG2671-RC       5264    -       55      1,1,1,1,1,1,1,1,1,1,0,0,...
  2L      9836    21372   CG2671-RB       5154    -       55      1,1,1,1,1,1,1,1,1,1,0,0,...
  2L      9836    21372   CG2671-RD       5229    -       55      1,1,1,1,1,1,1,1,1,1,0,0,...

------------------------------------------------------------------------------
2.5. Sorting and Plotting
------------------------------------------------------------------------------

We can now sort the ROI file by the the highest count (7th field, descending):

  $ roi_sort.sh -n 7 -r -i projected_gtf.roi -o projected_gtf.by_max_count.roi

  $ head -5 projected_gtf.by_max_count.roi
  ##ref   begin_pos       end_pos region_name     length  strand  max_count       counts
  2L      72388   76203   CG11372-RA      1832    +       69      0,0,0,1,1,1,1,1,1,1,1,1,1,..
  2L      9836    18583   CG2671-RA       5407    +       55      1,1,1,1,1,1,1,1,1,1,0,0,0,..
  2L      9836    18583   CG2671-RC       5264    -       55      1,1,1,1,1,1,1,1,1,1,0,0,0,..
  2L      9836    21372   CG2671-RB       5154    -       55      1,1,1,1,1,1,1,1,1,1,0,0,0,..

Let us extract the records that do not have a max_count of 0.

  $ awk '($7 > 0) { print $0) }' projected_gtf.by_max_count.roi > to_plot.roi
  $ roi_plot_9.sh -i to_plot.roi -o out.pdf

Note that the generated PDF file already contains links to a local IGV
instance listening on port 60151.  If you haven't already, you should start an
IGV instance, load the drosophila reference genome, add some feature tracks
and enable remote control via HTTP in IGV.  Clicking on a plot in the PDF file
will open this locus in IGV.

------------------------------------------------------------------------------
2.6. Computing Metrics
------------------------------------------------------------------------------

We can compute some metrics using the script roi_metrics.Rscript.  This script
can also serve as a starting point for your own scripts:

  $ roi_metrics.Rscript -i to_plot.roi -o metrics.roi

  $ head -3 metrics.roi
  # ROI written from R
  ##ref	begin_pos	end_pos	region_name	length	strand	max_count	min	median	mean	quantile75	quantile95	aoc	xreaXX	r3linear	distMax3p	distMax5p	counts
  2L	72388	76203	CG11372-RA	1832	+	69	0	7	11.8733624454148	19	44	0.172077716600215	0.328195209163977	0.00369582266667077	1286	1286	0,0,...

Let us now sort descendingly by mean coverage.

  $ roi_sort.sh -n 10 -r metrics.roi -o metrics.by_mean.roi

And we can also plot this:

  $ roi_plot_9.sh -i metric.by_mean.roi -o out.pdf

------------------------------------------------------------------------------
3. Reference and Contact
------------------------------------------------------------------------------

Reference

  Holtgrewe M., Coppee, J.-Y., Reinert, K., Jagla, B.  Novel Post-Alignment
  Visualization and Characterization of High-Throughput Sequencing
  Experiments.  Unpublished.

Contact

  Bernd Jagla <bernd.jagla@pasteur.fr>

------------------------------------------------------------------------------
4. Galaxy Integration
------------------------------------------------------------------------------

The integration of ngs_roi is available in the Galaxy Tool Shed (Testing)

  http://testtoolshed.g2.bx.psu.edu/view/holtgrewe/ngs_roi

------------------------------------------------------------------------------
5. File Formats
------------------------------------------------------------------------------

------------------------------------------------------------------------------
5.1. ROI Format
------------------------------------------------------------------------------

The file begins with a sequence of comments, followed by one column header,
followed by the records.

Comments start with a single hash (#) and are ignored in analysis.

The column header starts with two hashes (##) and is followed by the tab
separated column names. The column names must not contain spaces.

The columns 1-7 are fixed. The last column is always "counts". The columns are
as follows

  1. ref         -- The reference name.
  2. begin_pos   -- The begin position.
  3. end_pos     -- The end position.
  4. region_name -- The name of the region.
  5. length      -- The length of the region.
  6. strand      -- The strand, one of "+" and "-". "+" in case of being not
                    strand-specific.
  7. max_count   -- The largest value of the counts columns.

  additional columns
     Additional annotation data for the ROI.

  N. counts       -- Comma-separated list of length unsignd integers.

------------------------------------------------------------------------------
5.2. BED, GFF and GTF format
------------------------------------------------------------------------------

See the following external resources for a specification of these formats:

  BED http://genome.ucsc.edu/FAQ/FAQformat.html#format1
  GFF http://www.sanger.ac.uk/resources/software/gff/spec.html
      http://www.sequenceontology.org/gff3.shtml
  GTF http://mblab.wustl.edu/GTF22.html