1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291
|
NGS ROI
NGS Region of Interest Analysis
http://www.seqan.de/projects/ngs-roi
Version 0.2.2
October, 2013
------------------------------------------------------------------------------
Table of Contents
------------------------------------------------------------------------------
1. Overview
2. Examples
3. Reference and Contact
4. Reference and Contact
5. File Formats
------------------------------------------------------------------------------
1. Overview
------------------------------------------------------------------------------
NGS ROI is a framework for the analysis of NGS mapping data in SAM/BAM format
based on the idea of decoupling mapping information from the strict linear
ordering of the genome.
------------------------------------------------------------------------------
2. Examples
------------------------------------------------------------------------------
You can find the binaries bam2roi, roi_feature_projection, and
roi_plot_thumbnails in the directory /usr/lib/seqan/bin/ and the example files
in /usr/share/doc/seqan-apps/ngs_roi/example/.
The following examples require a Unix environment such as Linux or Mac Os X.
------------------------------------------------------------------------------
2.1. Help
------------------------------------------------------------------------------
The following program calls print the help to the user.
$ bam2roi --help
$ roi_feature_projection --help
$ roi_plot_thumbnails --help
------------------------------------------------------------------------------
2.2. Converting BAM to ROI
------------------------------------------------------------------------------
The file "example.bam" consists of ~10.000 records from the 2L chromosome from
a BAM file that was obtained by mapping the SRA dataset SRR618933 agains the
D. melanogaster genome and sorting by genomic coordinate.
We can generate a ROI file from this BAM file:
$ bam2roi -if example.bam -of example.roi
...
$ head -5 example.roi
#ROI 0.3
##ref start end name length strand max_count num_reads gc_content counts
2L 9 60 region_0 52 + 3 4 0.4423 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,...
2L 62 138 region_1 77 + 15 20 0.3896 2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,8,9,...
2L 143 596 region_2 454 + 18 148 0.4009 2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,8,8,...
2L 599 1252 region_3 654 + 21 207 0.4083 1,1,1,1,1,1,1,1,1,1,2,2,3,3,5,...
2L 1255 2352 region_4 1098 + 26 340 0.4071 2,2,2,3,3,3,3,3,3,3,3,...
2L 2357 2807 region_5 451 + 21 141 0.4058 1,1,1,1,1,1,1,1,1,1,6,6,8,...
2L 2812 2888 region_6 77 + 9 16 0.3896 1,1,1,1,1,1,1,1,1,1,1,1,1,...
2L 2890 3344 region_7 455 + 21 121 0.4110 1,1,1,1,1,2,2,2,2,3,3,5,5,...
------------------------------------------------------------------------------
2.2. Converting BAM to ROI
------------------------------------------------------------------------------
We can create a PNG thumbnail view of the ROI file:
$ roi_plot_thumbnails -if example.roi -o plots
...
$ ls plots*
plots0.png
------------------------------------------------------------------------------
2.3. Projecting to BED
------------------------------------------------------------------------------
The file "dmel.bed" contains a subset of the gene annotations for the
D. melanogaster genome. We had to translate some of the reference names
(e.g. chr2L to 2L) such that they match the reference name from the BAM and
ROI file.
The program roi_feature_projection reads a ROI and a BED file and create one
output ROI record for each BED record in the input. The counts array will
have the length of the BED record and the counts of the ROI records
overlapping with the BED record will be projected to the counts of the new ROI
record.
The ROI file and the BED file have be sorted by coordinate (lexicographically
reference names, then by begin and end position). You will find the script
bed_sort.sh in /usr/lib/seqan/bin/. We first use it to sort the BED file
appropriately.
$ head -5 dmel.bed
2L 8384466 8388803 CG13384-RD 0 +
2L 8384466 8388803 CG13384-RC 0 +
2L 8384466 8388803 CG13384-RE 0 +
2L 8384466 8388803 CG13384-RG 0 +
2L 8384466 8388803 CG13384-RF 0 +
$ bed_sort.sh -i dmel.bed -o dmel.sorted.bed
$ head -5 dmel.sorted.bed
2L 7528 9491 CG11023-RA 0 +
2L 9835 18583 CG2671-RA 0 -
2L 9835 18583 CG2671-RC 0 -
2L 9835 21372 CG2671-RB 0 -
2L 9835 21372 CG2671-RD 0 -
The ROI file will be sorted appropriately already after creation from BAM.
However, to demonstrate the script roi_sort.sh, here is how to sort the ROI
file by coordinate.
$ sort_roi.sh -p -i exampe.roi -o example.sorted.roi
Then, we use the program roi_feature_projection to project the ROI records to
the BED records.
$ roi_feature_projection -ir example.sorted.roi -if dmel.sorted.bed \
-or projected_bed.roi
...
$ head -5 projected_bed.roi
##ref begin_pos end_pos region_name length strand max_count counts
2L 7528 9491 CG11023-RA_0 1964 + 0 0,0,0,0,0,0,0,0,0,0,...
2L 9835 18583 CG2671-RA_0 8749 - 0 0,0,0,0,0,0,0,0,0,0,...
2L 9835 18583 CG2671-RC_0 8749 - 55 1,1,1,1,1,1,1,1,1,1,...
2L 9835 21372 CG2671-RB_0 11538 - 55 1,1,1,1,1,1,1,1,1,1,...
------------------------------------------------------------------------------
2.4. Projecting to GFF/GTF
------------------------------------------------------------------------------
Next, we demonstrate how to project the ROI file to a GTF file. The procedure
is the same for GFF files.
First, we sort the GTF file by coordinate.
$ head -5 dmel.gtf
2L dm3_flyBaseGene exon 8384467 8384895 0.000000 + . gene_id "CG13384-RD"; transcript_id "CG13384-RD";
2L dm3_flyBaseGene exon 8384997 8385097 0.000000 + . gene_id "CG13384-RD"; transcript_id "CG13384-RD";
2L dm3_flyBaseGene start_codon 8386247 8386249 0.000000 + . gene_id "CG13384-RD"; transcript_id "CG13384-RD";
2L dm3_flyBaseGene CDS 8386247 8386385 0.000000 + 0 gene_id "CG13384-RD"; transcript_id "CG13384-RD";
2L dm3_flyBaseGene exon 8386217 8386385 0.000000 + . gene_id "CG13384-RD"; transcript_id "CG13384-RD";
$ gff_sort.sh -i dmel.gtf -o dmel.sorted.gtf
$ head -5 dmel.sorted.gtf
2L dm3_flyBaseGene exon 7529 8116 0.000000 + . gene_id "CG11023-RA"; transcript_id "CG11023-RA";
2L dm3_flyBaseGene start_codon 7680 7682 0.000000 + . gene_id "CG11023-RA"; transcript_id "CG11023-RA";
2L dm3_flyBaseGene CDS 7680 8116 0.000000 + 0 gene_id "CG11023-RA"; transcript_id "CG11023-RA";
2L dm3_flyBaseGene CDS 8229 8589 0.000000 + 1 gene_id "CG11023-RA"; transcript_id "CG11023-RA";
2L dm3_flyBaseGene exon 8229 8589 0.000000 + . gene_id "CG11023-RA"; transcript_id "CG11023-RA";
Then, we project the ROI file to the GTF file, splicing together the exons
having the same transcript id.
$ roi_feature_projection -ir example.roi -if sorted.gtf \
-or projected_gtf.roi --gff-type exon --gff-group-by transcript_id
...
$ head -5 projected_gtf.roi
##ref begin_pos end_pos region_name length strand max_count counts
2L 7529 9491 CG11023-RA 1773 + 34 8,9,13,12,12,12,12,12,11...
2L 9836 18583 CG2671-RA 5407 + 55 1,1,1,1,1,1,1,1,1,1,0,0,...
2L 9836 18583 CG2671-RC 5264 - 55 1,1,1,1,1,1,1,1,1,1,0,0,...
2L 9836 21372 CG2671-RB 5154 - 55 1,1,1,1,1,1,1,1,1,1,0,0,...
2L 9836 21372 CG2671-RD 5229 - 55 1,1,1,1,1,1,1,1,1,1,0,0,...
------------------------------------------------------------------------------
2.5. Sorting and Plotting
------------------------------------------------------------------------------
We can now sort the ROI file by the the highest count (7th field, descending):
$ roi_sort.sh -n 7 -r -i projected_gtf.roi -o projected_gtf.by_max_count.roi
$ head -5 projected_gtf.by_max_count.roi
##ref begin_pos end_pos region_name length strand max_count counts
2L 72388 76203 CG11372-RA 1832 + 69 0,0,0,1,1,1,1,1,1,1,1,1,1,..
2L 9836 18583 CG2671-RA 5407 + 55 1,1,1,1,1,1,1,1,1,1,0,0,0,..
2L 9836 18583 CG2671-RC 5264 - 55 1,1,1,1,1,1,1,1,1,1,0,0,0,..
2L 9836 21372 CG2671-RB 5154 - 55 1,1,1,1,1,1,1,1,1,1,0,0,0,..
Let us extract the records that do not have a max_count of 0.
$ awk '($7 > 0) { print $0) }' projected_gtf.by_max_count.roi > to_plot.roi
$ roi_plot_9.sh -i to_plot.roi -o out.pdf
Note that the generated PDF file already contains links to a local IGV
instance listening on port 60151. If you haven't already, you should start an
IGV instance, load the drosophila reference genome, add some feature tracks
and enable remote control via HTTP in IGV. Clicking on a plot in the PDF file
will open this locus in IGV.
------------------------------------------------------------------------------
2.6. Computing Metrics
------------------------------------------------------------------------------
We can compute some metrics using the script roi_metrics.Rscript. This script
can also serve as a starting point for your own scripts:
$ roi_metrics.Rscript -i to_plot.roi -o metrics.roi
$ head -3 metrics.roi
# ROI written from R
##ref begin_pos end_pos region_name length strand max_count min median mean quantile75 quantile95 aoc xreaXX r3linear distMax3p distMax5p counts
2L 72388 76203 CG11372-RA 1832 + 69 0 7 11.8733624454148 19 44 0.172077716600215 0.328195209163977 0.00369582266667077 1286 1286 0,0,...
Let us now sort descendingly by mean coverage.
$ roi_sort.sh -n 10 -r metrics.roi -o metrics.by_mean.roi
And we can also plot this:
$ roi_plot_9.sh -i metric.by_mean.roi -o out.pdf
------------------------------------------------------------------------------
3. Reference and Contact
------------------------------------------------------------------------------
Reference
Holtgrewe M., Coppee, J.-Y., Reinert, K., Jagla, B. Novel Post-Alignment
Visualization and Characterization of High-Throughput Sequencing
Experiments. Unpublished.
Contact
Bernd Jagla <bernd.jagla@pasteur.fr>
------------------------------------------------------------------------------
4. Galaxy Integration
------------------------------------------------------------------------------
The integration of ngs_roi is available in the Galaxy Tool Shed (Testing)
http://testtoolshed.g2.bx.psu.edu/view/holtgrewe/ngs_roi
------------------------------------------------------------------------------
5. File Formats
------------------------------------------------------------------------------
------------------------------------------------------------------------------
5.1. ROI Format
------------------------------------------------------------------------------
The file begins with a sequence of comments, followed by one column header,
followed by the records.
Comments start with a single hash (#) and are ignored in analysis.
The column header starts with two hashes (##) and is followed by the tab
separated column names. The column names must not contain spaces.
The columns 1-7 are fixed. The last column is always "counts". The columns are
as follows
1. ref -- The reference name.
2. begin_pos -- The begin position.
3. end_pos -- The end position.
4. region_name -- The name of the region.
5. length -- The length of the region.
6. strand -- The strand, one of "+" and "-". "+" in case of being not
strand-specific.
7. max_count -- The largest value of the counts columns.
additional columns
Additional annotation data for the ROI.
N. counts -- Comma-separated list of length unsignd integers.
------------------------------------------------------------------------------
5.2. BED, GFF and GTF format
------------------------------------------------------------------------------
See the following external resources for a specification of these formats:
BED http://genome.ucsc.edu/FAQ/FAQformat.html#format1
GFF http://www.sanger.ac.uk/resources/software/gff/spec.html
http://www.sequenceontology.org/gff3.shtml
GTF http://mblab.wustl.edu/GTF22.html
|