1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311
|
\documentclass[12pt]{report}
\usepackage{graphicx,fancyvrb,url,comment,longtable,color,ccaption}
\usepackage{amsmath}
\textwidth=6.6in
\textheight=8.7in
\oddsidemargin=-0.1in
\evensidemargin=-0.1in
\headheight=0in
\headsep=0in
\topmargin=-0.1in
\DefineVerbatimEnvironment{Rcode}{Verbatim}{fontsize=\footnotesize}
\newcommand{\code}[1]{{\small\texttt{#1}}}
\newcommand{\Subread}{\textsf{Subread}}
\newcommand{\Subjunc}{\textsf{Subjunc}}
\newcommand{\Rsubread}{\textsf{Rsubread}}
\newcommand{\ExactSNP}{\textsf{ExactSNP}}
\newcommand{\limma}{\textsf{limma}}
\newcommand{\edgeR}{\textsf{edgeR}}
\newcommand{\DGEList}{\textsf{DGEList}}
\newcommand{\voom}{\textsf{voom}}
\newcommand{\featureCounts}{\textsf{featureCounts}}
\newcommand{\repair}{\textsf{repair}}
\newcommand{\R}{\textsf{R}}
\newcommand{\C}{\textsf{C}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\excludecomment{hide}
\makeindex
\begin{document}
\begin{titlepage}
\begin{center}
{\Huge\bf Subread/Rsubread Users Guide}\\
\vspace{1 cm}
{\centering\large Subread v1.5.1/Rsubread v1.23.4\\}
\vspace{1 cm}
\centering 25 August 2016\\
\vspace{5 cm}
\Large Wei Shi and Yang Liao\\
\vspace{1 cm}
\small
{\large Bioinformatics Division\\
The Walter and Eliza Hall Institute of Medical Research\\
The University of Melbourne\\
Melbourne, Australia\\}
\vspace{7 cm}
\centering Copyright \small{\copyright} 2011 - 2016\\
\end{center}
\end{titlepage}
\tableofcontents
\chapter{Introduction}
The Subread/Rsubread packages comprise a suite of high-performance software programs for processing next-generation sequencing data.
Included in these packages are \code{Subread} aligner, \code{Subjunc} aligner, \code{Subindel} long indel detection program, \code{featureCounts} read quantification program, \code{exactSNP} SNP calling program and other utility programs.
This document provides a detailed description to the programs included in the packages.
\code{Subread} and \code{Subjunc} aligners adopt a mapping paradigm called ``seed-and-vote'' \cite{liao}.
This is an elegantly simple multi-seed strategy for mapping reads to a reference genome.
This strategy chooses the mapped genomic location for the read directly from the seeds.
It uses a relatively large number of short seeds (called subreads) extracted from each read and allows all the seeds to vote on the optimal location.
When the read length is $<$160 bp, overlapping subreads are used.
More conventional alignment algorithms are then used to fill in detailed mismatch and indel information between the subreads that make up the winning voting block.
The strategy is fast because the overall genomic location has already been chosen before the detailed alignment is done.
It is sensitive because no individual subread is required to map exactly, nor are individual subreads constrained to map close by other subreads.
It is accurate because the final location must be supported by several different subreads. The strategy extends easily to find exon junctions, by locating reads that contain sets of subreads mapping to different exons of the same gene.
It scales up efficiently for longer reads.
\code{Subread} is a general-purpose read aligner.
It can be used to align reads generated from both genomic DNA sequencing and RNA sequencing technologies.
It has been successfully used in a number of high-profile studies \cite{TangNC2013,ManNI2013,SpangenbergSCR2013,tang,ezh2}.
\code{Subjunc} is specifically designed to detect exon-exon junctions and to perform full alignments for RNA-seq reads.
Note that \code{Subread} performs local alignments for RNA-seq reads, whereas \code{Subjunc} performs global alignments for RNA-seq reads.
\code{Subread} and \code{Subjunc} comprise a read re-alignment step in which reads are re-aligned using genomic variation data and junction data collected from the initial mapping.
The \code{Subindel} program carries out local read assembly to discover long insertions and deletions.
Read mapping should be performed before running this program.
The {\featureCounts} program is designed to assign mapped reads or fragments (paired-end data) to genomic features such as genes, exons and promoters.
It is a light-weight read counting program suitable for count both gDNA-seq and RNA-seq reads for genomic features\cite{fcounts}.
The \textsf{Subread-featureCounts-limma/voom} pipeline has been found to be one of the best-performing pipelines for the analyses of RNA-seq data by the SEquencing Quality Control (SEQC) study, the third stage of the well-known MicroArray Quality Control (MAQC) project \cite{seqc}.
Also included in this software suite is a very efficient SNP caller -- {\ExactSNP}.
{\ExactSNP} measures local background noise for each candidate SNP and then uses that information to accurately call SNPs.
These software programs support a variety of sequencing platforms including Illumina GA/HiSeq, ABI SOLiD, Life Science 454, Helicos Heliscope and Ion Torrent. They are released in two packages -- SourceForge \emph{Subread} package and Bioconductor \emph{Rsubread} package.
\chapter{Preliminaries}
\section{Citation}
If you use {\Subread} or {\Subjunc} aligners, please cite:\\
\begin{quote}
Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013
\\
{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/23558742}} }
\end{quote}
If you use \featureCounts, please cite:\\
\begin{quote}
Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 2013 Nov 30. [Epub ahead of print]
\\
{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/24227677}}}
\end{quote}
\section{Download and installation}
\subsection{SourceForge {\Subread} package}
\subsubsection{Installation from a binary distribution}
This is the easiest way to install the {\Subread} package onto your computer.
Download a {\Subread} binary distribution that suits your oprating system, from the SourceForge website {\color{blue}{\url{http://subread.sourceforge.net}}}. The operating systems currently being supported include multiple variants of Linux (Debian, Ubuntu, Fedora and Cent OS) and Mac OS X. Both 64-bit and 32-bit machines are supported. The executables can be found in the `bin' diretory of the binary package.
To install {\Subread} package for other operating systems such as FreeBSD and Solaris, you will have to install them for the source.
\subsubsection{Installation from the source package}
Download {\Subread} source package to your working directory from SourceForge \\
{\color{blue}{\url{http://subread.sourceforge.net}}}, and type the following command to uncompress it:\\
\code{tar zxvf subread-1.x.x.tar.gz}\\
Enter \code{src} directory of the package and issue the following command to install it on a Linux operating system: \\
\code{make -f Makefile.Linux}\\
To install it on a Mac OS X operating system, issue the following command:\\
\code{make -f Makefile.MacOS}\\
To install it on a FreeBSD operating system, issue the following command:\\
\code{make -f Makefile.FreeBSD}\\
To install it on Oracle Solaris or OpenSolaris computer operating systems, issue the following command:\\
\code{make -f Makefile.SunOS}\\
To install it on a Windows computer, you will need to firstly install a unix-like environment such as cygwin and then install the {\Subread} package.\\
A new directory called \code{bin} will be created under the home directory of the software package, and the executables generated from the compilation are saved to that directory.
To enable easy access to these executables, you may copy them to a system directory such as \code{/usr/bin} or add the path to them to your search path (your search path is usually specified in the environment variable \code{`PATH'}).
\subsection{Bioconductor {\Rsubread} package}
You have to get {\R} installed on my computer to install this package.
Lauch an {\R} session and issue the following command to install it:
\begin{Rcode}
source("http://bioconductor.org/biocLite.R")
biocLite("Rsubread")
\end{Rcode}
Alternatively, you may download the {\Rsubread} source package directly from {\color{blue}{\url{http://bioconductor.org/packages/release/bioc/html/Rsubread.html}} } and install it to your {\R} from the source.
\section{How to get help}
Bioconductor mailing list (\url{http://bioconductor.org/}) and SeqAnswer forum (\url{http://www.seqanswers.com}) are the best places to get help and to report bugs.
Alternatively, you may contact Wei Shi (shi at wehi dot edu dot au) directly.
\chapter{The seed-and-vote mapping paradigm}
\section{Seed-and-vote}
We have developed a new read mapping paradigm called ``seed-and-vote" for efficient, accurate and scalable read mapping \cite{liao}.
The seed-and-vote strategy uses a number of overlapping seeds from each read, called \emph{subreads}.
Instead of trying to pick the best seed, the strategy allows all the seeds to vote on the optimal location for the read.
The algorithm then uses more conventional alignment algorithms to fill in detailed mismatch and indel information between the subreads that make up the winning voting block.
The following figure illustrates the proposed seed-and-vote mapping approach with an toy example.
\begin{center}
\includegraphics[scale=0.3]{seed-and-vote.png}
\end{center}
Two aligners have been developed under the seed-and-vote paradigm, including \code{Subread} and \code{Subjunc}.
\code{Subread} is a general-purpose read aligner, which can be used to map both genomic DNA-seq and RNA-seq read data.
Its running time is determined by the number of \emph{subreads} extracted from each read, not by the read length.
Thus it has an excellent maping scalability, ie. its running time has only very modest increase with the increase of read length.
\code{Subread} uses the largest mappable region in the read to determine its mapping location, therefore it automatically determines whether a global alignment or a local alignment should be found for the read.
For the exon-spanning reads in a RNA-seq dataset, \code{Subread} performs local alignments for them to find the target regions in the reference genome that have the largest overlap with them.
Note that \code{Subread} does not perform global alignments for the exon-spanning reads and it soft clips those read bases which could not be mapped.
However, the \code{Subread} mapping result is sufficient for carrying out the gene-level expression analysis using RNA-seq data, because the mapped read bases can be reliably used to assign reads, including both exonic reads and exon-spanning reads, to genes.
To get the full alignments for exon-spanning RNA-seq reads, the \code{Subjunc} aligner can be used.
\code{Subjunc} is designd to discover exon-exon junctions from using RNA-seq data, but it performs full alignments for all the reads at the same time.
The \code{Subjunc} mapping results should be used for detecting genomic variations in RNA-seq data, allele-specific expression analysis and exon-level gene expression analysis.
The Section~\ref{sec:junction} describes how exon-exon junctions are discovered and how exon-spanning reads are aligned using the seed-and-vote paradigm.
\section{Detection of short indels}
\begin{center}
\includegraphics[scale=0.3]{indel.png}
\end{center}
The seed-and-vote paradigm is very powerful in detecting short indels (insertions and deletions).
The figure below shows how we use the \emph{subreads} to confidently detect short indels.
When there is an indel existing in a read, mapping locations of subreads extracted after the indel will be shifted to the left (insertion) or to the right (deletion), relative to the mapping locations of subreads at the left side of the indel.
Therefore, indels in the reads can be readily detected by examining the difference in mapping locations of the extracted subreads.
Moreover, the number of bases by which the mapping location of subreads are shifted gives the precise length of the indel.
Since no mismatches are allowed in the mapping of the subreads, the indels can be detected with a very high accuracy.
\section{Detection of exon-exon junctions}
\label{sec:junction}
The seed-and-vote paradigm is also very useful in detecting exon-exon junctions, because the short subreads extracted across the entire read can be used to detect short exons in a sensitive and accurate way.
The figure below shows the schematic of detecting exon-exon junctions and mapping RNA-seq reads by \code{Subjunc}, which uses this paradigm.
The first scan detects all possible exon-exon junctions using the mapping locations of the subreads extracted from each read.
Matched donor (`GT') and receptor (`AG') sites are required for calling junctions.
Exons as short as 16bp can be detected in this step.
The second scan verifies the putative exon-exon junctions discovered from the first scan by performing re-alignments for the junction reads.
The output from \code{Subjunc} includes the list of verified junctions and also the mapping results for all the reads.
Orientation of splicing sites is indicated by `XA' tag in section of optional fields in mapping output.
By default, \code{Subjunc} only reports canonical exon-exon junctions it has discovered (ie. presence of donor (`GT') and receptor (`AG') sites is required).
However, users may turn on `--allJunctions' option to instruct \code{Subjunc} to report all junctions including both canonical and non-canonical ones.
\begin{center}
\includegraphics[scale=0.5]{junction.png}
\end{center}
\section{Detection of structural variants (SVs)}
\code{Subread} and \code{Subjunc} can be used detect SV events including long indel, duplication, inversion and translocation, in RNA-seq and genomic DNA-seq data.
Detection of long indels is conducted by performing local read assembly.
When the specified indel length (`-I' option in SourceForge \code{C} or `indels' paradigm in \code{Rsubread}) is greater than 16, \code{Subread} and \code{Subjunc} will automatically start the read assembly process to detect long indels (up to 200bp).
Breakpoints detected from SV events will be saved to a text file (`.breakpoint.txt'), which includes chromosomal coordinates of breakpoints and also the number of reads supporting each pair of breakpoints found from the same SV event.
For the reads that were found to contain SV breakpoints, extra tags will be added for them in mapping output.
These tags include CC(chromosome name), CP(mapping position), CG(CIGAR string) and CT(strand), and they describe the secondary alignment of the read (the primary alignment is described in the main fields).
\section{Two-scan read alignment}
\code{Subread} and \code{Subjunc} aligners employ a two-scan approach for read mapping.
In the first scan, the aligners use seed-and-vote method to identify candidate mapping locations for each read and also discover short indels, exon-exon junctions and structural variants.
In the second scan, they carry out final alignment for each read using the variant and junction information.
Variant and junction data (including chromosomal coordinates and number of supporting reads) will be output along with the read mapping results.
To the best of our knowledge, \code{Subread} and \code{Subjunc} are the first to employ a two-scan mapping strategy to achieve a superior mapping accuracy.
This strategy was later adopted by other aligners as well (called `two-pass').
\section{Multi-mapping reads}
Multi-mapping reads are those reads that map to more than one genomic location with the same similarity score (eg. number of mis-mismatched bases).
\code{Subread} and \code{Subjunc} aligners can effectively detect multi-mapping reads by closely examining candidate locations which receive the highest number of votes or second highest number of votes.
Numbers of mis-matched bases and matched bases are counted for each candidate location during the final re-alignment step and they are used for identifying multi-mapping reads.
For RNA-seq data, a read is called as a multi-mapping read if it has two or more candidate mapping locations that have the same number of mis-matched bases and this number is the smallest in all candidate locations being considered.
For genomic DNA-seq data, a read is called as a multi-mapping read if it has two or more candidate locations that have the same number of matched bases and this number is the largest among all candidate locations being considered.
Note that for both RNA-seq and genomic DNA-seq data, any alignment reported for a multi-mapping read must not have more than threshold number of mis-matched bases (as specified in `-M' parameter).
For the reporting of a multi-mapping read, users can choose to not report any alignment for the read (`-u' option) or report up to a pre-defined number of alignments (`-B' option).
\section{Mapping of paired-end reads}
For the mapping of paired-end reads, we use the following formula to obtain a list of candidate mapping locations for each read pair:
$$PE_{score} = w * (V_1 + V_2) $$
where $V_1$ and $V_2$ are the number of votes received from two reads from the same pair, respectively.
$w$ has a value of 1.3 if mapping locations of the two reads are within the nominal paired-end distance (or nominal fragment length), and has a value of 1 otherwise.
Up to 4,096 posssible alignments will be examined for each read pair and a maximum of three candidate alignments with the highest $PE_{score}$ will be chosen for final re-alignment.
Total number of matched bases (for genomic DNA-seq data) or mis-matched bases (for RNA-seq data) will be used to determine the best mapping in the final re-alignment step.
\section{Recommended aligner setting}
It is recommended to report uniquely mapped reads only when running \code{Subread} and \code{Subjunc} aligners since this will give the most accurate mapping result.
By default, only uniquely mapped reads are reported when running aligners in Bioconductor {\Rsubread} package.
This however needs to be explicitly specified when running aligners in SourceForge {\Subread} package (\code{-u}).
\chapter{Mapping reads generated by genomic DNA sequencing technologies}
\label{chapter:subread-dnaseq}
\section{A quick start for using SourceForge {\Subread} package}
An index must be built for the reference first and then the read mapping can be performed.\\
{\noindent\bf Step 1: Build an index}\\
\noindent Build a base-space index (default). You can provide a list of FASTA files or a single FASTA file including all the reference sequences.\\
\code{subread-buildindex -o my\_index chr1.fa chr2.fa ...}\\
{\noindent\bf Step 2: Align reads}\\
\noindent Map single-end reads from a gzipped file using 5 threads and save mapping results to a BAM file:\\
\code{subread-align -t 1 -T 5 -i my\_index -r reads.txt.gz -o subread\_results.bam}\\
\noindent Detect indels of up to 16bp:\\
\code{subread-align -t 1 -I 16 -i my\_index -r reads.txt -o subread\_results.sam}\\
\noindent Report up to three best mapping locations:\\
\code{subread-align -t 1 -B 3 -i my\_index -r reads.txt -o subread\_results.sam}\\
\noindent Report uniquely mapped reads only:\\
\code{subread-align -t 1 -u -i my\_index -r reads.txt -o subread\_results.sam}\\
\noindent Map paired-end reads:\\
\code{subread-align -t 1 -d 50 -D 600 -i my\_index -r reads1.txt -R reads2.txt \newline -o subread\_results.sam}\\
\section{A quick start for using Bioconductor {\Rsubread} package}
An index must be built for the reference first and then the read mapping can be performed.\\
{\noindent\bf Step 1: Building an index}\\
\noindent To build the index, you must provide a single FASTA file (eg. ``genome.fa'') which includes all the reference sequences.
\begin{Rcode}
library(Rsubread)
buildindex(basename="my_index",reference="genome.fa")
\end{Rcode}
{\noindent\bf Step 2: Aligning the reads}\\
\noindent Map single-end reads using 5 threads:
\begin{Rcode}
align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread.bam",nthreads=5)
\end{Rcode}
\noindent Detect indels of up to 16bp:
\begin{Rcode}
align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread.bam",indels=16)
\end{Rcode}
\noindent Report up to three best mapping locations:
\begin{Rcode}
align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread.bam",nBestLocations=3)
\end{Rcode}
\noindent Map paired-end reads:
\begin{Rcode}
align(index="my_index",readfile1="reads1.txt.gz",readfile2="reads2.txt.gz",type="dna",
output_file="rsubread.bam",minFragLength=50,maxFragLength=600)
\end{Rcode}
\section{Index building}
\label{sec:index}
The \code{subread-buildindex} (\code{buildindex} function in \Rsubread) program builds an index for reference genome by creating a hash table in which keys are 16bp mers (subreads) extracted from the genome and values are their chromosomal locations.
By default, subreads are extracted from the genome at a 2bp interval.
The reference sequences should be in FASTA format (the header line for each chromosomal sequence starts with ``$>$'').\\
Table 1 describes the arguments used by the \code{subread-buildindex} program.
\newpage
\begin{table}[h]
\raggedright{Table 1: Arguments used by the \code{subread-buildindex} program (\code{buildindex} function in \Rsubread) in alphabetical order.
Arguments in parenthesis in the first column are used by \code{buildindex}.\newline\\}
\begin{tabular}{|p{4cm}|p{12cm}|}
\hline
Arguments & Description \\
\hline
chr1.fa, chr2.fa, ... \newline (\code{reference}) & Give names of chromosome files. Note that in {\Rsubread}, only a single FASTA file including all reference sequences should be provided.\\
\hline
-B \newline (\code{indexSplit=FALSE}) & Create one block of index. The built index will not be split into multiple pieces. This makes the largest amount of memory be requested when running alignments, but it enables the maximum mapping speed to be achieved. This option overrides -M when it is provided as well.\\
\hline
-c \newline (\code{colorspace}) & Build a color-space index.\\
\hline
-f $<int>$ \newline (\code{TH\_subread}) & Specify the threshold for removing uninformative subreads (highly repetitive 16bp mers). Subreads will be excluded from the index if they occur more than threshold number of times in the reference genome. Default value is 100.\\
\hline
-F \newline (\code{gappedIndex=FALSE}) & Build a full index for the reference genome. 16bp mers (subreads) will be extracted from every position of the reference genome. Under default setting (`-F' is not specified), subreads are extracted in every three bases from the genome.\\
\hline
-M $<int>$ \newline (\code{memory}) & Specify the Size of requested memory(RAM) in megabytes, 8000MB by default. With the default value, the index built for a mammalian genome (eg. human or mouse genome) will be saved into one block, enabling the fastest mapping speed to be achieved. The amount of memory used is $\sim$ 7600MB for mouse or human genome (other species have a much smaller memory footprint), when performing read mapping. Using less memory will increase read mapping time.\\
\hline
-o $<basename>$ \newline (\code{basename}) & Specify the base name of the index to be created.\\
\hline
-v & Output version of the program. \\
\hline
\end{tabular}
\end{table}
\newpage
\section{Read mapping}
The {\Subread} aligner (\texttt{subread-align} program in SourceForge {\Subread} package or \code{align} function in Bioconductor {\Rsubread} package) extracts a number of subreads from each read and then uses these subreads to vote for the mapping location of the read.
It uses the the ``seed-and-vote'' paradigm for read mapping and reports the largest mappable region for each read.
Table 2 describes the arguments used by {\Subread} aligner (and also \code{Subjunc} aligner).
Arguments used in Bioconductor \code{Rsubread} package are included in parenthesis.\\
\begin{longtable}{|p{4cm}|p{12cm}|}
\multicolumn{2}{p{16cm}}{Table 2: Arguments used by the \code{subread-align}/\code{subjunc} programs included in the SourceForge {\Subread} package in alphabetical order.
Arguments in parenthesis in the first column are the equivalent arguments used in Bioconductor {\Rsubread} package.
Arguments used by \code{subread-align} only are marked with $^*$.
Arguments used by \code{subjunc} only are marked with $^{**}$.
\newline
}
\endfirsthead
\hline
Arguments & Description \\
\hline
-b \newline (\code{color2base=TRUE}) & Output base-space reads instead of color-space reads in mapping output for color space data (eg. LifTech SOLiD data). Note that the mapping itself will still be performed at color-space.\\
\hline
-B $<int>$ \newline (\code{nBestLocations}) & Specify the maximal number of equally-best mapping locations allowed to be reported for each read. 1 by default. `NH' tag is used to indicate how many alignments are reported for the read and `HI' tag is used for numbering the alignments reported for the same read, in the output. Note that \code{-u} option takes precedence over \code{-B}.\\
\hline
-d $<int>$ \newline (\code{minFragLength}) & Specify the minimum fragment/template length, 50 by default. Note that if the two reads from the same pair do not satisfy the fragment length criteria, they will be mapped individually as if they were single-end reads.\\
\hline
-D $<int>$ \newline (\code{maxFragLength}) & Specify the maximum fragment/template length, 600 by default.\\
\hline
-i $<index> \newline (\code{index}) $ & Specify the base name of the index.\\
\hline
-I $<int>$ \newline (\code{indels}) & Specify the number of INDEL bases allowed in the mapping. 5 by default. Indels of up to 200bp long can be detected.\\
\hline
-m $<int>$ \newline (\code{TH1}) & Specify the consensus threshold, which is the minimal number of consensus subreads required for reporting a hit. The consensus subreads are those subreads which vote for the same location in the reference genome for the read. If pair-end read data are provided, at least one of the two reads from the same pair must satisfy this criteria. 3 by default.\\
\hline
-M $<int>$ \newline (\code{maxMismatches}) & Specify the maximum number of mis-matched bases allowed in the alignment. 3 by default. Mis-matches found in soft-clipped bases are not counted.\\
\hline
-n $<int>$ \newline (\code{nsubreads}) & Specify the number of subreads extracted from each read, 10 by default.\\
\hline
-o $<output>$ \newline (\code{output\_file}) & Give the name of output file. The default output format is BAM. All reads are included in mapping output, including both mapped and unmapped reads, and they are in the same order as in the input file.\\
\hline
-p $<int>$ \newline (\code{TH2}) & Specify the minimum number of consensus subreads both reads from the same pair must have. This argument is only applicable for paired-end read data. The value of this argument should not be greater than that of `-m' option, so as to rescue those read pairs in which one read has a high mapping quality but the other does not. 1 by default.\\
\hline
-P $<3:6>$ \newline (\code{phredOffset}) & Specify the format of Phred scores used in the input data, '3' for phred+33 and '6' for phred+64. '3' by default. For \code{align} function in \Rsubread, the possible values are `33' (for phred+33) and `64' (for phred+64). `33' by default.\\
\hline
-r $<input>$ \newline (\code{readfile1}) & Give the name of input file(s) (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}). For paired-end read data, this gives the first read file and the other read file should be provided via the -R option. Supported input formats include FASTQ/FASTA (uncompressed or gzip compressed)(default), SAM and BAM.\\
\hline
-R $<input>$ \newline (\code{readfile2}) & Provide name of the second read file from paired-end data. The program will switch to paired-end read mapping mode if this file is provided. (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}).\\
\hline
-S $<ff:fr:rf>$ \newline (\code{PE\_orientation}) & Specify the orientation of the two reads from the same pair. It has three possible values including `fr', `ff' and `'rf. Letter `f' denotes the forward strand and letter `r' the reverse strand. `fr' by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand).\\
\hline
$^*$ -t $<int>$ \newline (\code{type}) & Specify the type of input sequencing data. Possible values include \code{0}, denoting RNA-seq data, or \code{1}, denoting genomic DNA-seq data. User must specify the value. Character values including `rna' and `dna' can also be used in the {\R} function. For genomic DNA-seq data, the aligner takes into account both the number of matched bases and the number of mis-matched bases to determine the the best mapping location after applying the `seed-and-vote' approach for read mapping. For RNA-seq data, only the number of mis-matched bases is considered for determining the best mapping location. \\
\hline
-T $<int>$ \newline (\code{nthreads}) & Specify the number of threads/CPUs used for mapping. The value should be between 1 and 32. 1 by default.\\
\hline
-u \newline (\code{unique=TRUE}) & Output uniquely mapped reads only. Reads that were found to have more than one best mapping location will not be reported.\\
\hline
$^{**}$$--$allJunctions \newline (\code{reportAllJunctions=TRUE}) & This option should be used with \code{subjunc} for detecting canonical exon-exon junctions (with `GT/AG' donor/receptor sites), non-canonical exon-exon junctions and structural variants (SVs) in RNA-seq data. detected junctions will be saved to a file with suffix name ``.junction.bed". Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordinates of detected SV breakpoints and also number of supporting reads. In the read mapping output, each breakpoint-containing read will contain the following extra fields for the description of its secondary alignment: CC(Chr), CP(Position),CG(CIGAR) and CT(strand). The primary alignment (described in the main field) and secondary alignment give respectively the mapping results for the two segments from the same read that were seperated by the breakpoint. Note that each breakpoint-containing read occupies only one row in mapping output. The mapping output includes mapping results for all the reads.\\
\hline
$--$BAMinput \newline (\code{input\_format="BAM"}) & Specify that the input read data are in BAM format.\\
\hline
$--$complexIndels & Detect multiple short indels that occur concurrently in a small genomic region (these indels could be as close as 1bp apart).\\
\hline
$--$DPGapExt $<int>$ \newline (\code{DP\_GapExtPenalty}) & Specify the penalty for extending the gap when performing the Smith-Waterman dynamic programming. 0 by defaut.\\
\hline
$--$DPGapOpen $<int>$ \newline (\code{DP\_GapOpenPenalty}) & Specify the penalty for opening a gap when applying the Smith-Waterman dynamic programming to detecting indels. -2 by defaut.\\
\hline
$--$DPMismatch $<int>$ \newline (\code{DP\_MismatchPenalty}) & Specify the penalty for mismatches when performing the Smith-Waterman dynamic programming. 0 by defaut.\\
\hline
$--$DPMatch $<int>$ \newline (\code{DP\_MatchScore}) & Specify the score for the matched base when performing the Smith-Waterman dynamic programming. 2 by defaut.\\
\hline
$--$rg $<string>$ \newline (\code{readGroup}) & Add a $<tag:value>$ to the read group (RG) header in the mapping output. \\
\hline
$--$rg-id $<string>$ \newline (\code{readGroupID}) & Specify the read group ID. If specified, the read group ID will be added to the read group header field and also to each read in the mapping output. \\
\hline
$--$SAMinput \newline (\code{input\_format="SAM"}) & Specify that the input read data are in SAM format.\\
\hline
$--$SAMoutput \newline (\code{output\_format="SAM"}) & Specify that mapping results are saved into a SAM format file. \\
\hline
$^*$$--$sv \newline (\code{detectSV=TRUE}) & This option should be used with \code{subread-align} for detecting structural variants (SVs) in genomic DNA sequencing data. Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordinates of detected SV breakpoints and also number of supporting reads for each SV event. In the read mapping output, each breakpoint-containing read will contain the following extra fields for the description of its secondary alignment: CC(Chr), CP(Position),CG(CIGAR) and CT(strand). The primary alignment (described in the main field) and secondary alignment give respectively the mapping results for the two segments from the same read that were seperated by the breakpoint. Note that each breakpoint-containing read occupies only one row in mapping output. The mapping output includes mapping results for all the reads.\\
\hline
$--$trim5 $<int>$ \newline (\code{nTrim5}) & Trim off $<int>$ number of bases from 5' end of each read. 0 by default.\\
\hline
$--$trim3 $<int>$ \newline (\code{nTrim3}) & Trim off $<int>$ number of bases from 3' end of each read. 0 by default.\\
\hline
-v & Output version of the program. \\
\hline
\end{longtable}
\newpage
\section{Mapping quality scores}
{\Subread} and {\Subjunc} aligners assign a mapping quality score (MQS) to each mapped read to indicate the confidence of the mapping:\\
\[ MQS = \left\{
\begin{array}{l l}
\frac{40}{(N_c + N_{mm})} & \quad \text{if only one best location found}\\
& \\
& \\
0 & \quad \text{if $>1$ equally best locations were found}\\
\end{array} \right.\] \\
\noindent where $N_c$ is the number of candidate locations that are considered for final full alignment (re-alignment step).
Such locations must have a vote number within top three vote numbers counted from all locations considered in the subread-mapping step (first scan).
Up to three candiate locations are considered for each read in the realignment step.
$N_{mm}$ is the number of mismatches found in the alignment at each candidate location.
% \[ MQS = \left\{
% \begin{array}{l l}
% (\sum_{i \in b_m} ( 1 - p_i) - \sum_{i \in b_{mm}} (1 - p_i)) \times 60 / L & \quad \text{if uniquely mapped}\\
% & \quad \text{\scriptsize{[MQS is reset to 0 if less than 0]}}\\
% & \\
% 0 & \quad \text{if $>1$ equally best locations found}\\
% \end{array} \right.\]
% where $L$ is the read length, $p_i$ is the base-calling $p$-value for the $i$th base in the read, $b_m$ is the set of locations of matched bases, and $b_{mm}$ is the set of locations of mismatched bases.
% Base-calling p values can be readily computed from the base quality scores.
% Read bases of high sequencing quality have low base-calling p values.
% Read bases that were found to be insertions are treated as matched bases in the MQS calculation.
% The MQS is a read-length normalized value and it is in the range [0, 60).
\section{Mapping output}
Read mapping results for each library will be saved to a BAM or SAM format file.
Short indels detected from the read data will be saved to a text file (`.indel').
If `--sv' is specified when running \code{subread-align}, breakpoints detected from structural variant events will be output to a text file for each library as well (`.breakpoints.txt').
\newpage
\chapter{Mapping reads generated by RNA sequencing technologies}
\section{A quick start for using SourceForge {\Subread} package}
\label{sec:rnaseq-subread}
An index must be built for the reference first and then the read mapping and/or junction detection can be carried out.\\
{\noindent\bf Step 1: Building an index}\\
\noindent The following command can be used to build a base-space index.
You can provide a list of FASTA files or a single FASTA file including all the reference sequences.\\
\code{subread-buildindex -o my\_index chr1.fa chr2.fa ...}\\
\noindent For more details about index building, see Section~\ref{sec:index}.\\
{\noindent\bf Step 2: Aligning the reads}\\
\noindent{{\Subread}}\\
\noindent If the purpose of an RNA-seq experiment is to quantify gene-level expression and discover differentially expressed genes, the {\Subread} aligner is recommended.
{\Subread} carries out local alignments for RNA-seq reads.
The commands used by {\Subread} to align RNA-seq reads are the same as those used to align gDNA-seq reads.
Below is an example of using {\Subread} to map single-end RNA-seq reads.\\
\code{subread-align -t 0 -i my\_index -r rnaseq-reads.txt -o subread\_results.sam}\\
\noindent Another RNA-seq aligner included in this package is the {\Subjunc} aligner.
{\Subjunc} not only performs read alignments but also detects exon-exon junctions.
The main difference between {\Subread} and {\Subjunc} is that {\Subread} does not attempt to detect exon-exon junctions in the RNA-seq reads.
For the alignments of the exon-spanning reads, {\Subread} just uses the largest mappable regions in the reads to find their mapping locations.
This makes {\Subread} more computationally efficient.
The largest mappable regions can then be used to reliably assign the reads to their target genes by using a read summarization program (eg. \featureCounts, see Section~\ref{sec:featureCounts}), and differential expression analysis can be readily performed based on the read counts yielded from read summarization.
Therefore, {\Subread} is sufficient for read mapping if the purpose of RNA-seq analysis is to perform a differential expression analysis.
Also, {\Subread} could report more mapped reads than {\Subjunc}.
For example, the exon-spanning reads that are not aligned by {\Subjunc} due to the lack of canonical GT/AG splicing signals can be aligned by {\Subread} as long as they have a good match with the reference sequence.\\
\noindent{{\Subjunc}}\\
For other purposes of the RNA-seq data anlayses such as exon-exon junction detection, alternative splicing analysis and genomic mutation detection, {\Subjunc} aligner should be used because exon-spanning reads need to be fully aligned.
Below is an example command of using {\Subjunc} to perform global alignments for paired-end RNA-seq reads.
Note that there are two files produced after mapping: one is a BAM-format file including mapping results and the other a BED-format file including discovered exon-exon junctions.\\
\code{subjunc -i my\_index -r rnaseq-reads1.txt -R rnaseq-reads2.txt -o subjunc\_result}
\section{A quick start for using Bioconductor {\Rsubread} package}
An index must be built for the reference first and then the read mapping can be performed.\\
{\noindent\bf Step 1: Building an index}\\
\noindent To build the index, you must provide a single FASTA file (eg. ``genome.fa'') which includes all the reference sequences.
\begin{Rcode}
library(Rsubread)
buildindex(basename="my_index",reference="genome.fa")
\end{Rcode}
{\noindent\bf Step 2: Aligning the reads}\\
Please refer to Section~\ref{sec:rnaseq-subread} for difference between {\Subread} and {\Subjunc} in mapping RNA-seq data.
Below is an example for mapping a single-end RNA-seq dataset using {\Subread}.
Useful information about \code{align} function can be found in its help page (type \code{?align} in your {\R} prompt).
\begin{Rcode}
align(index="my_index",readfile1="rnaseq-reads.txt.gz",output_file="subread_results.bam")
\end{Rcode}
Below is an example for mapping a single-end RNA-seq dataset using {\Subjunc}.
Useful information about \code{subjunc} function can be found in its help page (type \code{?subjunc} in your {\R} prompt).
\begin{Rcode}
subjunc(index="my_index",readfile1="rnaseq-reads.txt.gz",output_file="subjunc_results.bam")
\end{Rcode}
\section{Local read alignment}
The \code{Subread} and \code{Subjunc} can both be used to map RNA-seq reads to the reference genome.
If the goal of the RNA-seq data is to perform expression analysis, eg. finding genes expressing differentially between different conditions, then \code{Subread} is recommended.
\code{Subread} performs fast local alignments for reads and reports the mapping locations that have the largest overlap with the reads.
These reads can then be assigned to genes for expression analysis.
For this type of analysis, global alignments for the exon-spanning reads are not required because local aligments are sufficient to get reads to be accurately assigned to genes.
However, for other types of RNA-seq data analyses such as exon-exon junction discovery, genomic mutation detection and allele-specific gene expression analysis, global alignments are required.
The next section describes the {\Subjunc} aligner, which performs global aligments for RNA-seq reads.
\section{Global read alignment}
{\Subjunc} aligns each exon-spanning read by firstly using a large number of subreads extracted from the read to identify multiple target regions matching the selected subreads, and then using the splicing signals (donor and receptor sites) to precisely determine the mapping locations of the read bases.
It also includes a verification step to compare the quality of mapping reads as exon-spanning reads with the quality of mapping reads as exonic reads to finally decide how to best map the reads.
Reads may be re-aligned if required.
Output of {\Subjunc} aligner includes a list of discovered exon-exon junction locations and also the complete alignment results for the reads.
Table 2 describes the arguments used by the {\Subjunc} program.\\
\section{Mapping output}
Read mapping results for each library will be saved to a BAM/SAM file.
Detected exon-exon junctions will be saved to a BED file for each library (`.junction.bed').
Detected short indels will be saved to a text file (`.indel').\\
\section{Mapping microRNA sequencing reads (miRNA-seq)}
To use {\Subread} aligner to map miRNA-seq reads, a full index must be built for the reference genome before read mapping can be carried out.
For example, the following command builds a full index for mouse reference genome \emph{mm10}:
\code{\\
subread-buildindex -F -B -o mm10\_full\_index mm10.fa \\
}
The full index includes 16bp mers extracted from every genomic location in the genome.
Note that if \code{-F} is not specified, \code{subread-buildindex} builds a gapped index which includes 16bp mers extracted every three bases in the reference genome, ie. there is a 2bp gap between each pair of neighbouring 16bp mers.
After the full index was built, read alignment can be performed.
Reads do not need to be trimmed before feeding them to {\Subread} aligner since {\Subread} soft clips sequences in the reads that can not be properly mapped.
The parameters used for mapping miRNA-seq reads need to be carefully designed due to the very short length of miRNA sequences ($\sim$22bp).
The total number of subreads (16bp mers) extracted from each read should be the read length minus 15, which
is the maximum number of subreads that can be possibly extracted from a read.
The reason why we need to extract the maximum number of subreads is to achieve a high sensitivity in detecting the short miRNA sequences.
The threshold for the number of consensus subreads required for reporting a hit should be within the range of 2 to 7 consensus subreads inclusive.
The larger the number of consensus subreads required, the more stringent the mapping will be.
Using a threshold of 2 consensus subreads allows the detection of miRNA sequences of as short as 17bp, but the mapping error rate could be relatively high.
With this threshold, there will be at least 17 perfectly matched bases present in each reported alignment.
If a threshold of 4 consensus subreads was used, length of miRNA sequences that can be detected is 19 bp or longer.
With this threshold, there will be at least 19 perfectly matched bases present in each reported alignment.
When a threshold of 7 consensus subreads was used, only miRNA sequences of 22bp or longer can be detected (at least 22 perfectly matched bases will be present in each reported alignment).
We found that there was a significant decrease in the number of mapped reads when the requried number of consensus subreads increased from 4 to 5 when we tried to align a mouse miRNA-seq dataset, suggesting that there are a lot of miRNA sequences that are only 19bp long.
We therefore used a threshold of 4 consensus subreads to map this dataset.
However, what we observed might not be the case for other datasets that were generated from different cell types and different species.
Below is an example of mapping 50bp long reads (adaptor sequences were included in the reads in addition to the miRNA sequences), with at least 4 consensus subreads required in the mapping.
Note that `-t' option should have a value of 1 since miRNA-seq reads are more similar to gDNA-seq reads than mRNA-seq reads from the read mapping point of vew.
\code{\\
subread-align -t 1 -i mm10\_full\_index -n 35 -m 4 -M 3 -T 10 -I 0 -P 3 -B 10 \\
-r miRNA\_reads.fastq -o result.sam\\
}
The `-B 10' parameter instructs {\Subread} aligner to report up to 10 best mapping locations (equally best) in the mapping results.
The multiple locations reported for the reads could be useful for investigating their true origin, but they might need to be filtered out when assigning mapped reads to known miRNA genes to ensure a high-quality quantification of miRNA genes.
The miRBase database (\url{http://www.mirbase.org/}) is a useful resource that includes annotations for miRNA genes in many species.
The {\featureCounts} program can be readily used for summarizing reads to miRNA genes.
\chapter{Read summarization}
\section{Introduction}
Sequencing reads often need to be assigned to genomic features of interest after they are mapped to the reference genome.
This process is often called \emph{read summarization} or \emph{read quantification}.
Read summarization is required by a number of downstream analyses such as gene expression analysis and histone modification analysis.
The output of read summarization is a count table, in which the number of reads assigned to each feature in each library is recorded.
A particular challenge to the read summarization is how to deal with those reads that overlap more than one feature (eg. an exon) or meta-feature (eg. a gene).
Care must be taken to ensure that such reads are not over-counted or under-counted.
Here we describe the {\featureCounts} program, an efficient and accurate read quantifier.
{\featureCounts} has the following features:
\begin{itemize}
\item It carries out precise and accurate read assignments by taking care of indels, junctions and structural variants in the reads.
\item It takes only half a minute to summarize 20 million reads.
\item It supports GTF and SAF format annotation.
\item It supports strand-specific read counting.
\item It can count reads at feature (eg. exon) or meta-feature (eg. gene) level.
\item Highly flexible in counting multi-mapping and multi-overlapping reads. Such reads can be excluded, fully counted or fractionally counted.
\item It gives users full control on the summarization of paired-end reads, including allowing them to check if both ends are mapped and/or if the fragment length falls within the specified range.
\item Reduce ambuiguity in assigning read pairs by searching features that overlap with both reads from the pair.
\item It allows users to specify whether chimeric fragments should be counted.
\item Automatically detect input format (SAM or BAM).
\item Automatically sort paired-end reads. Users can provide either location-sorted or name-sorted bams files to featureCounts. Read sorting is implemented on the fly and it only incurs minimal time cost.
\end{itemize}
\section{featureCounts}
\label{sec:featureCounts}
\subsection{Input data}
The data input to {\featureCounts} consists of (i) one or more files of aligned reads in either SAM or BAM format and (ii) a list of genomic features in either Gene Transfer Format (GTF) or General Feature Format (GFF) or Simplified Annotation Format (SAF). The format of input reads is automatically detected (SAM or BAM).
For paired-end reads, if they were location-sorted in the input {\featureCounts} will automatically re-order the reads to place next to each other the reads from the same pair before counting them.
We also provide an utility program {\repair} to allow users to pair up the reads before feeding them to {\featureCounts}.
Note that name-sorted paired-end reads generated by other programs may include incorrectly paired reads due to for example multi-mapping issue.
If this is the case, {\featureCounts} will re-sort them.
Both read alignment and read counting should use the same reference genome. For each read, the BAM/SAM file gives the name of the reference chromosome or contig the read mapped to, the start position of the read on the chromosome or contig/scaffold, and the so-called CIGAR string giving the detailed alignment information including insertions and deletions and so on relative to the start position.
The genomic features can be specified in either GTF/GFF or SAF format. The SAF format is the simpler and includes only five required columns for each feature (see next section). In either format, the feature identifiers are assumed to be unique, in accordance with commonly used Gene Transfer Format (GTF) refinement of GFF.
{\featureCounts} supports strand-specific read counting if strand-specific information is provided. Read mapping results usually include mapping quality scores for mapped reads. Users can optionally specify a minimum mapping quality score that the assigned reads must satisfy.
\subsection{Annotation format}
\label{sec:annotation}
The genomic features can be specified in either GTF/GFF or SAF format.
A definition of the GTF format can be found at UCSC website (\url{http://genome.ucsc.edu/FAQ/FAQformat.html#format4}).
The SAF format includes five required columns for each feature: feature identifier, chromosome name, start position, end position and strand.
These five columns provide the minimal sufficient information for read quantification purposes.
Extra annotation data are allowed to be added from the sixth column.
A SAF-format annotation file should be a tab-delimited text file.
It should also include a header line.
An example of a SAF annotation is shown as below:
\code{\\
GeneID Chr Start End Strand\\
497097 chr1 3204563 3207049 -\\
497097 chr1 3411783 3411982 -\\
497097 chr1 3660633 3661579 -\\
100503874 chr1 3637390 3640590 -\\
100503874 chr1 3648928 3648985 -\\
100038431 chr1 3670236 3671869 -\\
...
}
\code{GeneID} column includes gene identifiers that can be numbers or character strings.
Chromosomal names included in the \code{Chr} column must match the chromosomal names of reference sequences to which the reads were aligned.
\subsection{Single and paired-end reads}
Reads may be paired or unpaired.
If paired reads are used, then each pair of reads defines a DNA or RNA fragment bookended by the two reads.
In this case, {\featureCounts} can be instructed to count fragments rather than reads.
{\featureCounts} automatically sorts reads by name if paired reads are not in consecutive positions in the SAM or BAM file, with minimal cost.
Users do not need to sort their paired reads before providing them to {\featureCounts}.
\subsection{Features and meta-features}
{\featureCounts} is a general-purpose read summarization function, which assigns mapped reads (RNA-seq reads or genomic DNA-seq reads) to genomic features or meta-features.
Each feature is an interval (range of positions) on one of the reference sequences. We define a meta-feature to be a set of features representing a biological construct of interest. For example, features often correspond to exons and meta-features to genes. Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. {\featureCounts} can summarize reads at either the feature or meta-feature levels.
We recommend to use unique gene identifiers, such as NCBI Entrez gene identifiers, to cluster features into meta-features. Gene names are not recommended to use for this purpose because different genes may have the same names. Unique gene identifiers were often included in many publicly available GTF annotations which can be readily used for summarization. The Bioconductor {\Rsubread} package also includes NCBI RefSeq annotations for human and mice. Entrez gene identifiers are used in these annotations.
\subsection{Overlap of reads with features}
{\featureCounts} preforms precise read assignment by comparing mapping location of every base in the read or fragment with the genomic region spanned by each feature.
It takes account of any gaps (insertions, deletions, exon-exon junctions or structural variants) that are found in the read.
It calls a hit if any overlap is found between read and feature.
Users may use `--minOverlap (\code{minOverlap} in \R)' and `--fracOverlap (\code{fracOverlap} in \R)' options to specify the minimum number of overlapping bases and minimum fraction of overlapping bases requried for assigning a read to a feature, respectively.
The `--fracOverlap' option might be particularly useful for counting reads with variable lengths.
% A hit is called for a meta-feature if the read or fragment overlaps any component feature of the meta-feature.
\subsection{Multi-mapping reads}
A multi-mapping read is a read that can be equally best mapped to more than one location in the reference genome.
Due to the mapping amguity, it is recommended that multi-mapping reads should be excluded from read counting (default behavior of {\featureCounts} program) to produce as accurate counts as possible.
However we do provide users with different options to deal with the counting of such reads.
Users can choose to discard multi-mapping reads, or fully count every alignment reported for a multi-mapping read (ie. each alignment carries 1 count) or count each alignment fractionally (ie. each alignment carries $1/n$ count where $n$ is the total number of alignments reported for the read).
Relevent parameters for counting multi-mapping reads include `-M' (\code{countMultiMappingReads} in \R) and `--fraction' (\code{fraction} in \R).
\subsection{Multi-overlap reads}
A multi-overlap read or fragment is one that overlaps more than one feature, or more than one meta-feature when summarizing at the meta-feature level. {\featureCounts} provides users with the option to exclude multi-overlap reads, or fully count them for each overlapping feature (ie. each overlapping feature receives a count of 1 from the read) or assign a fractional count to each overlapping feature.
Relevent parameters for counting multi-overlap reads include `-O' (\code{allowMultiOverlap} in \R) and `--fraction' (\code{fraction} in \R).
The decision whether or not to counting these reads is often determined by the experiment type. We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments, because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined. On the other hand, we recommend that multi-overlap reads or fragments are counted for most ChIP-seq experiments because epigenetic modifications inferred from these reads may regulate the biological functions of all their overlapping genes.
Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.
\subsection{In-built annotations}
In-built gene annotations for genomes \emph{hg38}, \emph{hg19}, \emph{mm10} and \emph{mm9} are included in both Bioconductor {\Rsubread} package and SourceForge {\Subread} package.
These annotations were downloaded from NCBI RefSeq database and then adapted by merging overlapping exons from the same gene to form a set of disjoint exons for each gene.
Genes with the same Entrez gene identifiers were also merged into one gene.
Each row in the annotation represents an exon of a gene. There are five columns in the annotation data including Entrez gene identifier (\emph{GeneID}), chromosomal name (\emph{Chr}), chromosomal start position(\emph{Start}), chromosomal end position (\emph{End}) and strand (\emph{Strand}).
In {\Rsubread}, users can access these annotations via the {\textsf{getInBuiltAnnotation}} function.
In {\Subread}, these annotations are stored in directory `annotation' under home directory of the package.
\subsection{Program output}
Output of {\featureCounts} program in SourceForge {\Subread} package is saved into a tab-delimited file, which includes annotation columns (`Geneid', `Chr', `Start', `End', `Strand' and `Length') and data columns (read counts for each gene in each library).
Annotation column `Length' contains total number of non-overlapping bases of each feature or meta-feature.
When for example summarizing RNA-seq reads to genes, this column will give total number of non-overlapping bases included in all exons belonging to the same gene, for each gene.
When performing summarization at meta-feature level, annotation columns including `Chr', `Start', `End', `Strand' and `Length' give the annotation information for every feature included each meta-features.
Therefore, each of these columns may include more than one value (semi-colon separated).
Output of {\featureCounts} program in SourceForge {\Subread} package also includes stat info of summarization results, which is saved to a tab-delimited file as well (a separate file).
This file gives the total number of reads that are successfully assigned and also numbers of reads that are not assigned due to various reasons.
Below lists the reasons why reads may not be assigned:
\begin{itemize}
\item Unassigned\_Ambiguity: overlapping with two or more features (feature-level summarization) or meta-features (meta-feature-level) summarization.
\item Unassigned\_MultiMapping: reads marked as multi-mapping in SAM/BAM input (the `NH' tag is checked by the program).
\item Unassigned\_NoFeatures: not overlapping with any features included in the annotation.
\item Unassigned\_Unmapped: reads are reported as unmapped in SAM/BAM input. Note that if the `--primary' option of featureCounts program is specified, the read marked as a primary alignment will be considered for assigning to features.
\item Unassigned\_MappingQuality: mapping quality scores lower than the specified threshold.
\item Unassigned\_FragementLength: length of fragment does not satisfy the criteria.
\item Unassigned\_Chimera: two reads from the same pair are mapped to different chromosomes or have incorrect orientation.
\item Unassigned\_Secondary: reads marked as second alignment in the FLAG field in SAM/BAM input.
\item Unassigned\_Nonjunction: reads do not span two or more exons. Such reads will not be assigned if the `--countSplitAlignmentsOnly' option is specified.
\item Unassigned\_Duplicate: reads marked as duplicate in the FLAG field in SAM/BAM input.
\end{itemize}
All these output were also provided by the {\featureCounts} function included in Bioconductor {\Rsubread} package, except that read summarization results are saved into an {\R} `List' object.
For more details, see the help page for {\featureCounts} function in {\Rsubread}.
\subsection{Program usage}
Table 3 describes the parameters used by the {\featureCounts} program.
\pagebreak
\begin{longtable}{|p{5cm}|p{11cm}|}
\multicolumn{2}{p{16cm}}{Table 3: Arguments used by the {\featureCounts} program included in the SourceForge {\Subread} package in alphabetical order.
Arguments included in parenthesis are the equivalent parameters used by {\featureCounts} function in Bioconductor {\Rsubread} package.}
\endfirsthead
\hline
Arguments & Description \\
\hline
input\_files \newline (\code{files}) & Give the names of input read files that include the read mapping results. The program automatically detects the file format (SAM or BAM). Multiple files can be provided at the same time.\\
\hline
-a $<input> \newline (\code{annot.ext, annot.inbuilt}) $ & Give the name of an annotation file. \\
\hline
-A \newline (\code{chrAliases}) & Provide a chromosome name alias file to match chr names in annotation with those in the reads. This should be a two-column comma-delimited text file. Its first column should include chr names in the annotation and its second column should include chr names in the reads. Chr names are case sensitive. No column header should be included in the file.\\
\hline
-B \newline (\code{requireBothEndsMapped}) & If specified, only fragments that have both ends successfully aligned will be considered for summarization. This option should be used together with \code{-p} (or \code{isPairedEnd} in {\Rsubread} {\featureCounts}).\\
\hline
-C \newline (\code{countChimericFragments}) & If specified, the chimeric fragments (those fragments that have their two ends aligned to different chromosomes) will NOT be counted. This option should be used together with \code{-p} (or \code{isPairedEnd} in {\Rsubread} {\featureCounts}).\\
\hline
-d $<int>$ \newline (\code{minFragLength}) & Minimum fragment/template length, 50 by default. This option must be used together with \code{-p} and \code{-P}.\\
\hline
-D $<int>$ \newline (\code{maxFragLength}) & Maximum fragment/template length, 600 by default. This option must be used together with \code{-p} and \code{-P}.\\
\hline
-f \newline (\code{useMetaFeatures}) & If specified, read summarization will be performed at feature level (eg. exon level). Otherwise, it is performed at meta-feature level (eg. gene level).\\
\hline
-F \newline (\code{isGTFAnnotationFile}) & Specify the format of the annotation file. Acceptable formats include `GTF' and `SAF' (see Section~\ref{sec:annotation} for details). The {\C} version of {\featureCounts} program uses a GTF format annotation by default, but the R version uses a SAF format annotation by default. The R version also includes in-built annotations.\\
\hline
-g $<input>$ \newline (\code{GTF.attrType}) & Specify the attribute type used to group features (eg. exons) into meta-features (eg. genes) when GTF annotation is provided. `gene\_id' by default. This attribute type is usually the gene identifier. This argument is useful for the meta-feature level summarization.\\
\hline
-G $<input>$ \newline (\code{genome}) & Provide the name of a FASTA-format file that contains the reference sequences used in
read mapping that produced the provided SAM/BAM files. This optional argument can be used with '-J' option to improve read counting for junctions.\\
\hline
-J \newline (\code{juncCounts}) & Count the number of reads supporting each exon-exon junction. Junctions are identified from those exon-spanning reads (containing `N' in CIGAR string) in input data. For each junction, the reported data include number of supporting reads, genes that the junction belongs to, chromosomal coordinates of splice sites etc.\\
\hline
-M \newline (\code{countMultiMappingReads}) & If specified, multi-mapping reads/fragments will be counted. A multi-mapping read will be counted up to N times if it has N reported mapping locations. The program uses the `NH' tag to find multi-mapping reads.\\
\hline
-o $<output>$ & Give the name of the output file. The output file contains the number of reads assigned to each meta-feature (or each feature if \code{-f} is specified). Note that the {\featureCounts} function in {\Rsubread} does not use this parameter. It returns a \code{list} object including read summarization results and other data. \\
\hline
-O \newline (\code{allowMultiOverlap}) & If specified, reads (or fragments if \code{-p} is specified) will be allowed to be assigned to more than one matched meta-feature (or feature if \code{-f} is specified). Reads/fragments overlapping with more than one meta-feature/feature will be counted more than once. Note that when performing meta-feature level summarization, a read (or fragment) will still be counted once if it overlaps with multiple features belonging to the same meta-feature but does not overlap with other meta-features. \\
\hline
-p \newline (\code{isPairedEnd}) & If specified, fragments (or templates) will be counted instead of reads. This option is only applicable for paired-end reads.\\
\hline
-P \newline (\code{checkFragLength}) & If specified, the fragment length will be checked when assigning fragments to meta-features or features. This option must be used together with \code{-p}. The fragment length thresholds should be specified using \code{-d} and \code{-D} options.\\
\hline
-Q $<int>$ \newline (\code{minMQS}) & The minimum mapping quality score a read must satisfy in order to be counted. For paired-end reads, at least one end should satisfy this criteria. 0 by default.\\
\hline
-R \newline (\code{reportReads}) & Output detailed read assignment results for each read (or fragment if paired end). They are saved to a tab-delimited file that contains four columns including read name, status(assigned or the reason if not assigned), name of target feature/meta-feature and total number of hits if the read/fragment is counted multiple times. Names of output files are the same as input file names except a suffix string `.featureCounts' is added.\\
\hline
-s $<int>$ \newline (\code{isStrandSpecific}) & Indicate if strand-specific read counting should be performed. Acceptable values: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). 0 by default. For paired-end reads, strand of the first read is taken as the strand of the whole fragment and FLAG field of the current read is used to tell if it is the first read in the fragment.\\
\hline
-t $<input>$ \newline (\code{GTF.featureType}) & Specify the feature type. Only rows which have the matched feature type in the provided GTF annotation file will be included for read counting. `exon' by default.\\
\hline
-T $<int>$ \newline (\code{nthreads}) & Number of the threads. The value should be between 1 and 32. 1 by default.\\
\hline
-v & Output version of the program. \\
\hline
$--$countSplit \newline AlignmentsOnly \newline (\code{splitOnly}) & If specified, only split alignments (CIGAR strings contain letter `N') will be counted. All the other alignments will be ignored. An example of split alignments is the exon-spanning reads in RNA-seq data. If exon-spanning reads need to be assigned to all their overlapping exons, `-f' and `-O' options should be provided as well.\\
\hline
$--$countNonSplit \newline AlignmentsOnly \newline (\code{nonSplitOnly}) & If specified, only non-split alignments (CIGAR strings do not contain letter `N') will be counted. All the other alignments will be ignored.\\
\hline
$--$donotsort \newline (\code{autosort}) & If specified, paired end reads will not be re-ordered even if reads from the same pair were found not to be next to each other in the input.\\
\hline
$--$fraction \newline (\code{fraction}) & Assign fractional counts to features. This option must be used together with '-M' or '-O' or both. When '-M' is specified, each reported alignment from a multi-mapping read (identified via `NH' tag) will carry a fractional count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read. When '-O' is specified, each overlapping feature will receive a fractional count of 1/y, where y is the total number of features overlapping with the read. When both '-M' and '-O' are specified, each alignment will carry a fraction count of 1/(x*y).\\
\hline
$--$fracOverlap $<value>$ \newline (\code{fracOverlap}) & Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be within range [0,1]. 0 by default. If paired end, number of overlapping bases is counted from both reads. Soft-clipped bases are counted when calculating total read length (but ignored when counting overlapping bases). Both this option and `--minOverlap' option need to be satisfied for read assignment. \\
\hline
$--$ignoreDup \newline (\code{ignoreDup}) & If specified, reads that were marked as duplicates will be ignored. Bit Ox400 in FLAG field of SAM/BAM file is used for identifying duplicate reads. In paired end data, the entire read pair will be ignored if at least one end is found to be a duplicate read.\\
\hline
$--$largestOverlap \newline (\code{largestOverlap}) & If specified, reads (or fragments) will be assigned to the target that has the largest number of overlapping bases.\\
\hline
$--$maxMOp $<int>$ \newline (\code{maxMOp}) & Specify the maximum number of `M' operations (matches or mis-matches) allowed in a CIGAR string. 10 by default. Both `X' and `=' operations are treated as `M' and adjacent `M' operations are merged in the CIGAR string. When the number of `M' operations exceeds the limit, only the first `maxMOp' number of `M' operations will be used in read assignment.\\
\hline
$--$minOverlap $<int>$ \newline (\code{minOverlap}) & Minimum number of overlapping bases in a read that is required for read assignment. 1 by default. If a negative value is provided, then a gap of up to specified size will be allowed between read and the feature that the read is assigned to. For assignment of read pairs (fragments), number of overlapping bases from each read from the same pair will be summed. \\
\hline
$--$primary \newline (\code{primaryOnly}) & If specified, only primary alignments will be counted. Primary and secondary alignments are identified using bit 0x100 in the Flag field of SAM/BAM files. All primary alignments in a dataset will be counted no matter they are from multi-mapping reads or not (ie. `-M' is ignored).\\
\hline
$--$read2pos $<int>$ \newline (\code{read2pos}) & The read is reduced to its 5' most base or 3' most base. Read summarization is then performed based on the single base position to which the read is reduced. By default, no read reduction will be performed.\\
\hline
$--$readExtension5 $<int>$ \newline (\code{readExtension5}) & Reads are extended upstream by $<int>$ bases from their 5' end. 0 by default.\\
\hline
$--$readExtension3 $<int>$ \newline (\code{readExtension3}) & Reads are extended downstream by $<int>$ bases from their 3' end. 0 by default.\\
\hline
$--$tmpDir $<string>$ \newline (\code{tmpDir}) & Directory under which intermediate files are saved (later removed). By default, intermediate files will be saved to the directory specified in `-o' argument (In \R, intermediate files are saved to the current working directory by default).\\
\hline
\end{longtable}
\pagebreak
\section{A quick start for {\featureCounts} in SourceForge \Subread}
You need to provide read mapping results (in either SAM or BAM format) and an annotation file for the read summarization.
The example commands below assume your annotation file is in GTF format.\\
\noindent Summarize SAM format single-end reads using 5 threads:\\
\noindent\code{featureCounts -T 5 -a annotation.gtf -t exon -g gene\_id \\
-o counts.txt mapping\_results\_SE.sam}\\
\noindent Summarize BAM format single-end read data:\\
\noindent\code{featureCounts -a annotation.gtf -t exon -g gene\_id \\
-o counts.txt mapping\_results\_SE.bam}\\
\noindent Summarize multiple libraries at the same time:\\
\noindent\code{featureCounts -a annotation.gtf -t exon -g gene\_id \\
-o counts.txt mapping\_results1.bam mapping\_results2.bam}\\
\noindent Summarize paired-end reads and count fragments (instead of reads):\\
\noindent\code{featureCounts -p -a annotation.gtf -t exon -g gene\_id \\
-o counts.txt mapping\_results\_PE.bam}\\
\noindent Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:\\
\noindent\code{featureCounts -p -P -d 50 -D 600 -a annotation.gtf -t exon -g gene\_id\\
-o counts.txt mapping\_results\_PE.bam}\\
\noindent Count fragments which have both ends successfully aligned without considering the fragment length constraint:\\
\noindent\code{featureCounts -p -B -a annotation.gtf -t exon -g gene\_id\\
-o counts.txt mapping\_results\_PE.bam}\\
\noindent Exclude chimeric fragments from the fragment counting:\\
\noindent\code{featureCounts -p -C -a annotation.gtf -t exon -g gene\_id\\
-o counts.txt mapping\_results\_PE.bam}
\section{A quick start for {\featureCounts} in Bioconductor \Rsubread}
You need to provide read mapping results (in either SAM or BAM format) and an annotation file for the read summarization.
The example commands below assume your annotation file is in GTF format.\\
\noindent Load {\Rsubread} library from you {\R} session:
\begin{Rcode}
library(Rsubread)
\end{Rcode}
\noindent Summarize single-end reads using built-in RefSeq annotation for mouse genome mm9:
\begin{Rcode}
featureCounts(files="mapping_results_SE.sam",annot.inbuilt="mm9")
\end{Rcode}
\noindent Summarize single-end reads using a user-provided GTF annotation file:
\begin{Rcode}
featureCounts(files="mapping_results_SE.sam",annot.ext="annotation.gtf",
isGTFAnnotationFile=TRUE,GTF.featureType="exon",GTF.attrType="gene_id")
\end{Rcode}
\noindent Summarize single-end reads using 5 threads:
\begin{Rcode}
featureCounts(files="mapping_results_SE.sam",nthreads=5)
\end{Rcode}
\noindent Summarize BAM format single-end read data:
\begin{Rcode}
featureCounts(files="mapping_results_SE.bam")
\end{Rcode}
\noindent Summarize multiple libraries at the same time:
\begin{Rcode}
featureCounts(files=c("mapping_results1.bam","mapping_results2.bam"))
\end{Rcode}
\noindent Summarize paired-end reads and counting fragments (instead of reads):
\begin{Rcode}
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE)
\end{Rcode}
\noindent Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:
\begin{Rcode}
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,checkFragLength=TRUE,
minFragLength=50,maxFragLength=600)
\end{Rcode}
\noindent Count fragments which have both ends successfully aligned without considering the fragment length constraint:
\begin{Rcode}
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,requireBothEndsMapped=TRUE)
\end{Rcode}
\noindent Exclude chimeric fragments from fragment counting:
\begin{Rcode}
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,countChimericFragments=FALSE)
\end{Rcode}
\chapter{SNP calling}
\section{Algorithm}
SNPs(Single Nucleotide Polymorphisms) are the mutations of single nucleotides in the genome.
It has been reported that many diseases were initiated and/or driven by such mutations.
Therefore, successful detection of SNPs is very useful in designing better diagnosis and treatments for a variety of diseases such as cancer.
SNP detection also is an important subject of many population studies.
Next-gen sequencing technologies provide an unprecedented opportunity to identify SNPs at the highest resolution.
However, it is extremely computing-intensive to analyze the data generated from these technologies for the purpose of SNP discovery because of the sheer volume of the data and the large number of chromosomal locations to be considered.
To discover SNPs, reads need to be mapped to the reference genome first and then all the read data mapped to a particular site will be used for SNP calling for that site.
Discovery of SNPs is often confounded by many sources of errors.
Mapping errors and sequencing errors are often the major sources of errors causing incorrect SNP calling.
Incorrect alignments of indels, exon-exon junctions and structural variants in the reads can also result in wrong placement of blocks of continuous read bases, likely giving rise to consecutive incorrectly reported SNPs.
We have developed a highly accurate and efficient SNP caller, called \emph{exactSNP} \cite{exactSNP}.
\emph{exactSNP} calls SNPs for individual samples, without requiring control samples to be provided.
It tests the statistical significance of SNPs by comparing SNP signals to their background noises.
It has been found to be an order of magnitude faster than existing SNP callers.
\section{exactSNP}
Below is the command for running \code{exactSNP} program.
The complete list of parameters used by \code{exactSNP} can be found in Table 4.\\
\noindent\code{exactSNP [options] -i input -g reference\_genome -o output}\\
\begin{longtable}{|p{4.5cm}|p{11cm}|}
\multicolumn{2}{p{16cm}}{Table 4: Arguments used by the \code{exactSNP} program included in the SourceForge {\Subread} package in alphabetical order.
Arguments included in parenthesis are the equivalent parameters used by \code{exactSNP} function in Bioconductor {\Rsubread} package.}
\endfirsthead
\hline
Arguments & Description \\
\hline
-a $<file>$ \newline (SNPAnnotationFile) & Specify name of a VCF-format file that includes annotated SNPs. Such annotation files can be downloaded from public databases such as the dbSNP database. Incorporating known SNPs into SNP calling has been found to be helpful. However note that the annotated SNPs may or may not be called for the sample being analyzed. \\
\hline
-b \newline (isBAM) & Indicate the input file provided via $-i$ is in BAM format. \\
\hline
-f $<float>$ \newline (minAllelicFraction) & Specify the minimum fraction of mis-matched bases a SNP-containing location must have. Its value must between 0 and 1. 0 by default. \\
\hline
-g $<file>$ \newline (refGenomeFile) & Specify name of the file including all reference sequences. Only one single FASTA format file should be provided. \\
\hline
-i $<file> [-b\ if\ BAM] \newline (readFile)$ & Specify name of an input file including read mapping results. The format of input file can be SAM or BAM (\code{-b} needs to be specified if a BAM file is provided).\\
\hline
-n $<int>$ \newline (minAllelicBases) & Specify the minimum number of mis-matched bases a SNP-containing location must have. 1 by default.\\
\hline
-o $<file>$ \newline (outputFile) & Specify name of the output file. This program outputs a VCF format file that includes discovered SNPs. \\
\hline
-Q $<int>$ \newline (qvalueCutoff) & Specify the q-value cutoff for SNP calling at sequencing depth of 50X. 12 by default. The corresponding p-value cutoff is $10^{-Q}$. Note that this program automatically adjusts the q-value cutoff according to the sequencing depth at each chromosomal location.\\
\hline
-r $<int>$ \newline (minReads) & Specify the minimum number of mapped reads a SNP-containing location must have (ie. the minimum coverage). 1 by default. \\
\hline
-s $<int>$ \newline (minBaseQuality) & Specify the cutoff for base calling quality scores (Phred scores) read bases must satisfy to be used for SNP calling. 13 by default. Read bases that have Phred scores lower than the cutoff value will be excluded from the analysis. \\
\hline
-t $<int>$ \newline (nTrimmedBases) & Specify the number of bases trimmed off from each end of the read. 3 by default. \\
\hline
-T $<int>$ \newline (nthreads) & Specify the number of threads. 1 by default. \\
\hline
-v & Output version of the program. \\
\hline
-x $<int>$ \newline (maxReads) & Specify the maximum number of mapped reads a SNP-containing location could have. 3000 by default. Any location having more than the threshold number of reads will not be considered for SNP calling. This option is useful for removing PCR artefacts. \\
\hline
\end{longtable}
\chapter{Utility programs}
Usage info for each utility program can be seen by just typing the program name on the command prompt.
\section{repair}
This program takes as input a paired-end BAM file and places reads from the same pair next to each other in its output.
BAM files generated by {\repair} are compatible with {\featureCounts} program, ie they will not be re-sorted by {\featureCounts}.
Note that you do not have to run {\repair} before running {\featureCounts}.
{\featureCounts} calls {\repair} automatically if it finds that reads need to be re-sorted.
The {\repair} program uses a novel approach to quickly find reads from the same pair, rather than performing time-consuming sort of read names.
It takes only about half a minute to re-order a location-sorted BAM file including 30 million read pairs.
\section{coverageCount}
Compute the read coverage for each chromosomal location in the genome.
\section{propmapped}
Get number of mapped reads from a BAM/SAM file.
\section{qualityScores}
Retrieve Phred scores for read bases from a Fastq/BAM/SAM file.
\section{removeDup}
Remove duplicated reads from a SAM file.
\section{subread-fullscan}
Get all chromosomal locations that contain a genomic sequence sharing high homology with a given input sequence.
\chapter{Case studies}
\section{A Bioconductor R pipeline for analyzing RNA-seq data}
Here we illustrate how to use two Bioconductor packages - {\Rsubread} and {\limma} - to perform a complete RNA-seq analysis, including {\Subread} read mapping, {\featureCounts} read summarization, {\voom} normalization and {\limma} differential expresssion analysis.\\
{\noindent\bf Data and software.} The RNA-seq data used in this case study include four libraries: A\_1, A\_2, B\_1 and B\_2.
Sample A is Universal Human Reference RNA (UHRR) and sample B is Human Brain Reference RNA (HBRR).
A\_1 and A\_2 are two replicates of sample A (undergoing separate sample preparation), and B\_1 and B\_2 are two replicates of sample B.
In this case study, A\_1 and A\_2 are treated as biological replicates although they are more like technical replicates.
B\_1 and B\_2 are treated as biological replicates as well.
Note that these libraries only included reads originating from human chromosome 1 (according to {\Subread} aligner).
Reads were generated by the MAQC/SEQC Consortium.
Data used in this case study can be downloaded from\\
\url{http://bioinf.wehi.edu.au/RNAseqCaseStudy/data.tar.gz} (283MB).
Both read data and reference sequence for chromosome 1 of human genome (GRCh37) were included in the data.
After downloading the data, you can uncompress it and save it to your current working directory.
Launch {\R} and load {\Rsubread}, {\limma} and {\edgeR} libraries by issuing the following commands at your R prompt.
Version of your {\R} should be 3.0.2 or later.
{\Rsubread} version should be 1.12.1 or later and {\limma} version should be 3.18.0 or later.
Note that this case study only runs on Linux/Unix and Mac OS X.
\begin{Rcode}
library(Rsubread)
library(limma)
library(edgeR)
\end{Rcode}
To install/update {\Rsubread} and {\limma} packages, issue the following commands at your R prompt:
\begin{Rcode}
source("http://bioconductor.org/biocLite.R")
biocLite(pkgs=c("Rsubread","limma","edgeR"))
\end{Rcode}
{\noindent\bf Index building.} Build an index for human chromosome 1. This typically takes $\sim$3 minutes. Index files with basename `chr1' will be generated in your current working directory.
\begin{Rcode}
buildindex(basename="chr1",reference="hg19_chr1.fa")
\end{Rcode}
{\noindent\bf Alignment.} Perform read alignment for all four libraries and report uniquely mapped reads only. This typically takes $\sim$5 minutes. BAM files containing the mapping results will be generated in your current working directory.
\begin{Rcode}
targets <- readTargets()
align(index="chr1",readfile1=targets$InputFile,input_format="gzFASTQ",output_format="BAM",
output_file=targets$OutputFile,unique=TRUE,indels=5)
\end{Rcode}
{\noindent\bf Read summarization.} Summarize mapped reads to NCBI RefSeq genes.
This will only take a few seconds.
Note that the {\featureCounts} function contains built-in RefSeq annotations for human and mouse genes.
{\featureCounts} returns an {\R} `List' object, which includes raw read count for each gene in each library and also annotation information such as gene identifiers and gene lengths.
\begin{Rcode}
fc <- featureCounts(files=targets$OutputFile,annot.inbuilt="hg19")
fc$counts[1:5,]
A_1.bam A_2.bam B_1.bam B_2.bam
653635 642 522 591 596
100422834 1 0 0 0
645520 5 3 0 0
79501 0 0 0 0
729737 82 72 30 25
fc$annotation[1:5,c("GeneID","Length")]
GeneID Length
1 653635 1769
2 100422834 138
3 645520 1130
4 79501 918
5 729737 3402
\end{Rcode}
Create a {\DGEList} object.
\begin{Rcode}
x <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")])
\end{Rcode}
Calculate RPKM (reads per kilobases of exon per million reads mapped) values for genes:
\begin{Rcode}
x_rpkm <- rpkm(x,x$genes$Length,prior.count=0)
x_rpkm[1:5,]
A_1.bam A_2.bam B_1.bam B_2.bam
653635 939 905.0 709 736
100422834 19 0.0 0 0
645520 11 8.1 0 0
79501 0 0.0 0 0
729737 62 64.9 19 16
\end{Rcode}
{\noindent\bf Filtering.} Only keep in the analysis those genes which had $>$10 reads per million mapped reads in at least two libraries.
\begin{Rcode}
isexpr <- rowSums(cpm(x) > 10) >= 2
x <- x[isexpr,]
\end{Rcode}
{\noindent\bf Design matrix.} Create a design matrix:
\begin{Rcode}
celltype <- factor(targets$CellType)
design <- model.matrix(~0+celltype)
colnames(design) <- levels(celltype)
\end{Rcode}
{\noindent\bf Normalization.} Perform {\voom} normalization:
\begin{Rcode}
y <- voom(x,design,plot=TRUE)
\end{Rcode}
The figure below shows the mean-variance relationship estimated by voom.
\begin{center}
\includegraphics[scale=0.3]{voom_mean_variance.png}
\end{center}
{\noindent\bf Sample clustering.} Multi-dimensional scaling (MDS) plot shows that sample A libraries are clearly separated from sample B libraries.
\begin{Rcode}
plotMDS(y,xlim=c(-2.5,2.5))
\end{Rcode}
\begin{center}
\includegraphics[scale=0.5]{MDSplot.png}
\end{center}
{\noindent\bf Linear model fitting and differential expression analysis.} Fit linear models to genes and assess differential expression using eBayes moderated t statistic.
Here we compare sample B vs sample A.
\begin{Rcode}
fit <- lmFit(y,design)
contr <- makeContrasts(BvsA=B-A,levels=design)
fit.contr <- eBayes(contrasts.fit(fit,contr))
dt <- decideTests(fit.contr)
summary(dt)
BvsA
-1 922
0 333
1 537
\end{Rcode}
List top 10 differentially expressed genes:
\begin{Rcode}
options(digits=2)
topTable(fit.contr)
GeneID Length logFC AveExpr t P.Value adj.P.Val B
100131754 100131754 1019 1.6 16 113 3.5e-28 6.3e-25 54
2023 2023 1812 -2.7 13 -91 2.2e-26 1.9e-23 51
2752 2752 4950 2.4 13 82 1.5e-25 9.1e-23 49
22883 22883 5192 2.3 12 64 1.8e-23 7.9e-21 44
6135 6135 609 -2.2 12 -62 3.1e-23 9.5e-21 44
6202 6202 705 -2.4 12 -62 3.2e-23 9.5e-21 44
4904 4904 1546 -3.0 11 -60 5.5e-23 1.4e-20 43
23154 23154 3705 3.7 11 55 2.9e-22 6.6e-20 41
8682 8682 2469 2.6 12 49 2.2e-21 4.3e-19 39
6125 6125 1031 -2.0 12 -48 3.1e-21 5.6e-19 39
\end{Rcode}
\bibliographystyle{unsrt}
\bibliography{SubreadUsersGuide}
\end{document}
|