1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
|
= Microbiome Utilities Portal of the Broad Institute =
image:images/broad-hmp-banner.gif["Broad HMP logo", width=800]
The Human Microbiome Project (HMP) is an exciting Roadmap initiative funded by the National Institutes of Health (NIH). The goal of the project is to understand how the microbial communities inhabiting our bodies contribute to normal human health, development, and disease (http://nihroadmap.nih.gov/hmp/[http://nihroadmap.nih.gov/hmp]).
The Broad Institute (http://www.broadinstitute.org[http://www.broadinstitute.org]) was launched in 2004 with the visionary philanthropic investment of Eli and Edythe Broad, who joined with leaders at Harvard and its affiliated hospitals, MIT, and the Whitehead Institute to pioneer a "new model” of collaborative science. The Broad Institute is organized as a transparent infrastructure that allows biology- and technology-focused scientists to work together to identify and overcome the most critical obstacles to realizing the full promise of genomic medicine.
The Broad Institute aggressively advances sequence-based technologies and the bioinformatics necessary to characterize the vast complexity of the human microbiome. In keeping with our mission, we make the microbiome analysis utilities developed by the Broad Institute available to the community in order to promote further innovation and collaborative research efforts. We appreciate your feedback.
The utilities developed by the Broad Institute and provided here apply to a range of challenges posed by the microbiome initiative, including:
- Sequence alignment (<<A_NASTiEr,NAST-iEr>>)
- Chimera detection (<<A_CS, ChimeraSlayer>>, <<A_WigeoN, WigeoN>>)
- Operational taxonomic unit OTU binning (<<A_TreeChopper, TreeChopper>>)
- Sequence assembly (<<A_AMOScmp, AmosCmp16Spipeline>>)
== Microbiome Analysis Utilities ==
[[A_CS]]
=== CMCS: ChimeraMaligner and ChimeraSlayer ===
CMCS http://sourceforge.net/project/showfiles.php?group_id=262346[(download)] is a chimeric sequence detection utility, compatible with near-full length Sanger sequences and shorter 454-FLX sequences (~500 bp).
The ChimeraSlayer (CS) Algorithm: Given a candidate chimera query sequence, candidate parental sequences of a chimera are identified by a homology search. The ends of the query sequence are searched separately to identify candidate parental sequences. All sequences, including the query and candidate parents are extracted in NAST format. Sequences most likely to correspond to parents of the chimera are identified by fitting an alignment of the query sequence through a multiple alignment of the candidate parents, allowing for breakpoints between aligned parental sequences such that the total alignment score is maximized Viterbi-like algorithm. This parent selection algorithm is called ChimeraMaligner (CM). Those candidate parents identified by this alignment fitting procedure are tested in all pairwise combinations as potential parents of the putative chimeric query sequence using a modified Bellerophon-like algorithm the heart of Chimera Slayer.
The Bellerophon-like algorithm of CS evaluates each pair of candidate parents and potential breakpoints between the parents that could have given rise to the chimeric query sequence. Breakpoints that satisfy the phylogenetic relationship supporting a query,chimera divergence that is less than both query,parentA and query,parentB divergence are assessed for chimera support by bootstrapping operations. The bootstrap is performed by sampling 10% of the SNPs on each side of the breakpoint and assessing the support for the phylogenetic relationship. Those breakpoints with at least 90% bootstrap support are flagged as chimeras.
[[A_WigeoN]]
=== WigeoN ===
WigeoN http://sourceforge.net/project/showfiles.php?group_id=262346[(download)] examines the sequence conservation between a query and a trusted reference sequence, both in NAST alignment format. Based on the sequence identity between the query and the reference sequence, there is an expected amount of variation among the alignment. If the observed variation is greater than the 95% quantile of the distribution of variation observed between non-anomalous sequences, then it is flagged as an anomaly.
WigeoN is a flexible command-line based reimplementation of the http://www.bioinformatics-toolkit.org/Pintail/[Pintail] algorithm http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=16332745[Appl Environ Microbiol. 2005 Dec;7112:7724-36].
WigeoN is useful for flagging chimeras and anomalies only in near full-length 16S rRNA sequences. WigeoN lacks sensitivity with sequences less than 1000 bp.
[[A_NASTiEr]]
=== NAST-iEr ===
The NAST-iEr alignment utility http://sourceforge.net/project/showfiles.php?group_id=262346[(download)] aligns a single raw nucleotide sequence against one or more NAST formatted sequences.
The alignment algorithm involves global dynamic programming alignment to a fixed template sequences without any end-gap penalty similar in principle to Pearson's align0 program with a fixed template sequence containing arbitrary gap positions.
[[A_AMOScmp]]
=== AmosCmp16Spipeline ===
AmosCmp16Spipeline http://sourceforge.net/project/showfiles.php?group_id=262346[(download)] uses the AMOScmp software to assemble multiple, potentially overlapping 16S rRNA sequencing reads based on read mappings to a reference 16S rRNA gene.
Given the following inputs:
-fasta file containing sequencing reads
-file containing the corresponding qual values
-file enumerating the accessions corresponding to reads of the same clone individual assembly tasks
-a reference database of 16S rRNA sequences
The single reference sequence that best matches all the reads is chosen. Lucy is used to trim the sequence reads of low quality termini. An additional homology-trimming operation is performed to exclude regions of the sequence that lack homology to the reference. The resulting trimmed reads and quality values are used to generate a sequence assembly using the AMOScmp software. A scaffold sequence is generated, where Ns are used to fill in gaps according to estimated gap sizes based on reference sequence anchoring, and quality values are reported according to the scaffold sequence.
[[A_TreeChopper]]
=== TreeChopper ===
TreeChopper http://sourceforge.net/project/showfiles.php?group_id=262346[(download)] clusters tree leaf nodes according to phylogenetic distance.
A graph is constructed from the tree like so: all leaves are visited, and from each leaf, all neighboring leaves within a specified distance threshold are added to a graph with an edge placed between them. After building this graph, each edge connecting pairs of nodes is examined and a Jaccard similarity coefficient is computed (see http://www.biomedcentral.com/1741-7007/3/7[http://www.biomedcentral.com/1741-7007/3/7] for details). Those edges that loosely connect nodes as defined by this similarity coefficient are removed. The nodes connected by the remaining edges are clustered by transitive closure (single linkage clustering) and reported as OTUs.
The minimum phylogenetic distance between clustered nodes, and the minimum similarity coefficient between nodes in the graph are tuneable parameters.
== Miscellaneous Remarks ==
- The eubacterial 16S rRNA is the primary target of these utilities.
== Questions, comments, etc? ==
Contact Brian Haas (bhaas at broadinstitute dot org)
|