File: OpenMS_Tutorial.doxygen

package info (click to toggle)
openms 2.4.0-real-1
links: PTS, VCS
area: main
in suites: buster
size: 646,136 kB
sloc: cpp: 392,260; xml: 215,373; python: 10,976; ansic: 3,325; php: 2,482; sh: 901; ruby: 399; makefile: 141; perl: 85
file content (676 lines) | stat: -rw-r--r-- 49,036 bytes
// --------------------------------------------------------------------------
//                   %OpenMS -- Open-Source Mass Spectrometry
// --------------------------------------------------------------------------
// Copyright The %OpenMS Team -- Eberhard Karls University Tuebingen,
// ETH Zurich, and Freie Universitaet Berlin 2002-2018.
//
// This software is released under a three-clause BSD license:
//  * Redistributions of source code must retain the above copyright
//    notice, this list of conditions and the following disclaimer.
//  * Redistributions in binary form must reproduce the above copyright
//    notice, this list of conditions and the following disclaimer in the
//    documentation and/or other materials provided with the distribution.
//  * Neither the name of any author or any participating institution
//    may be used to endorse or promote products derived from this software
//    without specific prior written permission.
// For a full list of authors, refer to the file AUTHORS.
// --------------------------------------------------------------------------
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
// ARE DISCLAIMED. IN NO EVENT SHALL ANY OF THE AUTHORS OR THE CONTRIBUTING
// INSTITUTIONS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
// WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
// OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
// ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// --------------------------------------------------------------------------
// $Maintainer: Oliver Alka  $
// $Authors: Oliver Alka, Timo Sachsenberg$
// --------------------------------------------------------------------------

//########################### Please read this carefully! ###########################

// Sections:
// - to add new pages you have to add them to:
//    - doc/OpenMS_tutorial/refman_overwrite.tex.in (pdf output)
//    - doc/doxygen/public/OpenMS_Tutorial_html.doxygen (html output)

// Conventions:
// - Please write a short introduction for each chapter that explains
//   what classes are described and where these classes can be found (folder)
// - Use @a to visually highlight class names, namespaces, etc
// - When using example code, put it in the %OpenMS/doc/code_examples folder
//   to make sure it can be compiled. The name of the file should be in the text
//   to make the file easy to find for the user.
// - When talking about %OpenMS in general, prefix it with a '%'. Otherwise a
//   link to the %OpenMS namespace is generated automatically

//####################################### INTRODUCTION  #######################################

/**

@page tutorial OpenMS Developer Quickstart Guide

@section tutorial_introduction Introduction
@subsection tutorial_gf General Information

Mass spectrometry (MS) is an essential analytical technique for high-throughput analysis in proteomics and metabolomics. The development of new separation techniques, precise mass analyzers and experimental protocols is a very active field of research. This leads to more complex experimental setups yielding ever increasing amounts of data. Consequently, analysis of the data is currently often the bottleneck for experimental studies. Although software tools for many data analysis tasks are available today, they are often hard to combine with each other or not flexible enough to allow for rapid prototyping of a new analysis workflow. <br>

<b>%OpenMS</b>, a software framework for rapid <b>application and method development</b> in mass spectrometry has been designed to be portable, easy-to-use, and robust while offering a rich functionality ranging from basic data structures to sophisticated algorithms for data analysis (https://www.nature.com/nmeth/journal/v13/n9/abs/nmeth.3959.html). <br>

<b>Ease of use:</b>
%OpenMS follows the <b>object-oriented</b> programming paradigm, which aims at mapping real-world entities to comprehensible data structures and interfaces. %OpenMS enforces a <b>coding style</b> that ensures consistent names of classes, methods and member variables which increases the usability as a software library. Another important feature of a software framework is documentation. We decided to use <b>doxygen</b> (www.doxygen.org/) to generate the class documentation from the <b>source code</b>, which ensures consistency of code and documentation. The documentation is generated in HTML format making it easy to read with a web browser. <br>

<b>Robustness:</b>
Robustness of algorithms is essential if a new method will be applied routinely to large scale datasets. Typically, there is a trade-off between performance and robustness. %OpenMS tries to address both issues equally. In general, we try to tolerate recoverable errors, e.g. files that do not entirely fulfill the format specifications. On the other hand, <b>exceptions</b> are used to handle fatal errors. To check for correctness, more than 1000 <b>unit tests</b> are implemented in total, covering public methods of classes. These tests check the behavior for both valid and invalid use. Additionally, <b>preprocessor macros</b> are used to enable additional consistency checks in debug mode, enforce <b>pre- and post-conditions</b>, and are then disabled in productive mode for performance reasons. <br>

<b>Extensibility:</b>
Since %OpenMS is based on several <b>external libraries</b> it is designed for the integration of external code. All classes are encapsulated in the %OpenMS namespace to avoid symbol clashes with other libraries. Through the use of <b>C++ templates</b>, many data structures are adaptable to specific use cases. Also, %OpenMS supports <b>standard formats</b> and is itself open-source software. The use of standard formats ensures that applications developed with %OpenMS can be easily integrated into existing analysis pipelines. %OpenMS source code is released under the permissive <b>BSD 3 license</b> and located on <b>GitHub</b>, a repository for open-source software. This allows users to participate in the project and to contribute to the code base. <br>

<b>Scriptable:</b>
%OpenMS allows exposing its functionality through python bindings (<b>%pyOpenMS</b>). This eases the rapid development of algorithms in Python that later can be translated to C++. Please see our <a href="https://pyopenms.readthedocs.io">pyOpenMS documentation</a> for a description and walk-through of the pyOpenMS capabilities. <br>

<b>Portability:</b>
%OpenMS supports <b>Windows</b>, <b>Linux</b>, and <b>OS X</b> platforms. <br>

@subsection tutorial_structure The structure of the OpenMS Framework

The following image shows the overall structure of %OpenMS:

\htmlonly <style>div.image img[src="OpenMS_overview.png"]{max-width:1200px;}</style> \endhtmlonly 
@image html OpenMS_overview.png "Overall design of OpenMS. Kindly provided by Timo Sachsenberg." width=1200px
@image latex OpenMS_overview.png "Overall design of OpenMS." width=14cm

The structure of the OpenMS framework. <br>

The %OpenMS software framework consists of three main layers:
- <b>%OpenMS Library:</b> the object-oriented %OpenMS core library contains over 1,300 classes and is built on modern C++ infrastructure with native compiler support on Windows, Linux and OS X. The classes are representing core concepts in mass spectrometry as well as the corresponding ontologies defined by the Human Proteome Organization Proteomics Standard Initiative (HUPO-PSI). 
<br>
- <b>Scripting:</b> a well-defined Python API offers scripting for rapid software prototyping and interactive data exploration by researchers with advanced scripting skills. The pyOpenMS interactive Python interface, providing easy integration of the %OpenMS library with other scientific Python libraries.
<br>
- <b>TOPP tools:</b>  a set of pre-built tools covering most core tasks in computational mass spectrometry. These tools are created using the %OpenMS library. These tools form the building blocks that can be chained together to form complex workflows. 
<br>
- <b>Workflow:</b> a set of over 185 different tools for common mass spectrometric tasks can be accessed by routine users through the KNIME, and Galaxy workflow systems. 
<br>

Each level of increasing abstraction provides better usability, but limits the extensibility as the Python and workflow levels only have access to the exposed Python API or the available set of TOPP tools respectively. Increasing abstraction, however, makes it easier to design and execute complex analyses, even across multiple omics types. By following a layered design the different needs of bioinformaticians and life scientists are addressed. <br>

@subsection tutorial_developing Developing with OpenMS

Before we get started developing with %OpenMS, we would like to point to some information on the development model and conventions we follow to maintain a coherent code base.
<br>
<b>Development model</b><br>
%OpenMS follows the Gitflow development workflow which is excellently describedI <a href="http://nvie.com/posts/a-successful-git-branching-model/">here</a>. Additionally we encourage every developer (even if he is eligible to push directly to %OpenMS) to create his own fork (e.g. username). The GitHub people provide superb documentation on <a href="https://help.github.com/articles/fork-a-repo">forking</a> and how to keep your fork <a href="https://help.github.com/articles/syncing-a-fork">up-to-date</a>. With your own fork you can follow the Gitflow development model directly, but instead of merging into "develop" in your own fork you can open a <a href="https://help.github.com/articles/using-pull-requests">pull request</a>. Before opening the pull request, please check the <a href="https://github.com/OpenMS/OpenMS/wiki/Pull-Request-Checklist">checklist</a>.
<br>
Some more details and tips are collected here.
<br>

<b>Conventions</b><br>
See the manual for proper coding style: <a href="https://github.com/OpenMS/OpenMS/wiki/Coding-conventions">Coding conventions</a> also see: <a href="https://github.com/OpenMS/OpenMS/wiki/C&%2343;&%2343;-Guide">C++ Guide</a>.
We automatically test for common coding convention violations using a modified version of cpplint. Style testing can be enabled using CMake options. We also provide a configuration file for Uncrustify for automated style corrections (see "tools/uncrustify.cfg").
<br>

<b>Commit Messages</b><br>
In order to ease the creation of a CHANGELOG we use a defined format for our commit messages.
See the manual for proper commit messages: <a href="https://github.com/OpenMS/OpenMS/wiki/HowTo---Write-Commit-Messages">How to write commit messages</a>.<br>
<br>
<b>Automated Unit Tests</b><br>
Pull requests are automatically tested using our continuous integration platform. In addition we perform nightly test runs covering different platforms. Even if everything compiled well on your machine and all tests passed, please check if you broke another platform on the next day.
Nightly tests: <a href="http://cdash.openms.de/index.php?project=OpenMS">CDASH</a>
<br>
<br>
<b>Experimental Installers</b><br>
We automatically build installers for different platforms. These usually contain unstable or partially untested code - so use them at your own risk.
The nightly (unstable) installers are available <a href="http://ftp.mi.fu-berlin.de/OpenMS/nightly_binaries/">here</a>.
<br>
<br>
<b>Technical Documentation</b><br>
Documentation of classes and tools is automatically generated using doxygen:
See the documentation for <a href="http://www.openms.de/current_doxygen/html/">HEAD</a>
See the documentation for the latest <a href="https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/release/latest/html/index.html">release branch</a>
<br>
<br>
<b>Building %OpenMS</b><br>
Before you get started coding with %OpenMS you need to build it for your operating system. Please follow the <b>build instructions</b> from the documentation.
<br>
<a href="https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/release/2.3.0/html/install_linux.html"><b>Building %OpenMS on GNU/Linux</b></a>
<br>
<a href="https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/release/2.3.0/html/install_mac.html"><b>Building %OpenMS on Mac OS X</b></a>
<br>
<a href="https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/release/2.3.0/html/install_win.html"><b>Building %OpenMS on Windows</b></a>
<br>
Note that for development purposes, you might want to set the variable <i>CMAKE_BUILD_TYPE</i> to <i>Debug</i>. Otherwise, the default <i>Release</i> will be applied and disables pre-condition and post-condition checks, and assertions.
<br>
<br>
<b>Choice of an IDE</b> <br>
You are, of course, free to choose your favorite (or even no) IDE for %OpenMS development but given the size of %OpenMS, not all IDEs perform equally well. We have good experiences with Qt Creator on Linux and Mac, because it can directly import CMake Projects and is rather fast in indexing all files. On Windows, Visual Studio is currently the preferred solution. Additionally, you may want to try JetBrains CLion (it is free for students, teachers and open source projects). Another option is Eclipse with C++ support, which can also import CMake projects directly with the respective CMake generator.

@subsection tutorial_terms Mass spectrometry terms

 The following terms for MS-related data are used in this tutorial and the %OpenMS class documentation:
 - <b>Raw or profile peak</b>: a typically Gaussian shaped mass peak measured by the instrument. <br>
 - <b>Centroid or picked peak</b>: a single m/z, intensity pair as obtained after using a peak picking (also: peak centroiding) algorithm. <br>
 - <b>Spectrum / Scan</b>: a mass spectrum containing profile or centroided peaks (profile spectrum) or centroided peaks (peak spectrum). E.g. a low resolution profile (blue) and a centroided peak spectrum (pink) are shown in the figure below. <br>

  @image html Terms_Spectrum.png "Part of a raw spectrum (blue) with three peaks (red)"
  @image latex Terms_Spectrum.png "Part of a raw spectrum (blue) with three peaks (red)" width=12cm

 - <b>(Peak or Raw) Map</b>: a collection of spectra of a single LC-MS run. If spectra are recorded in profile mode, we usually use the term raw map. If spectra are already centroided we usually refer to them as peak map. <br>
 - <b>Feature</b>: a signal from a chemical entity detected in an HPLC-MS experiment, typically a peptide. <br>

The image below shows a peak map and the red circle highlights a feature. 

  @image html Terms_Map.png "Peak map with a marked feature (red)"
  @image latex Terms_Map.png "Peak map with a marked feature (red)" width=14cm

@section tutorial_library OpenMS Library

The extensible %OpenMS library implements common mass spectrometric data processing tasks through a well defined API in C++ and Python using standardized open data formats.

@subsection tutorial_library_overview Overview on Central Algorithms and Methods

%OpenMS provides algorithms in many fields of computational metabolomics and proteomics. 
<br>
<b>The following list is intended to algorithm and tool developers a starting point to tools and classes relevant to their scientific question at hand</b>. It does not include third-party tools but only tools that were implemented in %OpenMS. 
<br>

- Proteomics:
    - Signal processing:
        - Conversion from profile to centroided spectra (Tool PeakPickerHiRes)
        - Precursor mass correction (Tool HiResPrecursorMassCorrector)
    - Filtering:
        - Large number of basic filters applicable to different types of data (e.g., remove identified spectra, filter MS2, extract m/z ranges, … in Tool FileFilter and IDFilter)
    - Identification:
        - Database search:
            - Peptides (Tool %SimpleSearchEngine and its classes - started simple but is, by now, rather complete peptide identification engine)
            - Protein-RNA cross-links (Tool %RNPxlSearch and its classes)
            - Protein-Protein cross-links (Tool OpenPepXL)
        - Spectral library search:
            - Tool SpecLibSearcher and its classes
        - DeNovo:
            - Tool CompNovoCID and its classes
    - Quantification:
        - Peptide Feature Detection:
            - Untargeted, label-free (Tools FeatureFinderCentroided, FeatureFinderMultiplex, and its classes)
            - ID-based label-free (Tool FeatureFinderIdentification “new”)
            - SILAC-labeling (Tool FeatureFinderMultiplex)
            - iTRAQ/TMT (Tool IsobaricAnalyzer)
            - Dynamically labeled (SIP) peptides (Tool MetaProSIP)
        - Retention Time Alignment:
            - Linear map alignment (Tool MapAlignerPoseClustering)
            - (Non-)linear map alignment (Tool MapAlignerIdentification “new”)
        - Peptide Feature linking (matching of features between runs):
            - fast, KD-tree based linking (Tool FeatureLinkerUnlabeledKD)
            - QT based clustering and linking (Tool FeatureLinkerUnlabeledQT)
        - Protein inference:
            - WIP (currently via third-party tool FIDO and Wrapper FidoAdapter)
        - Protein Quantification:
            - Tool ProteinQuantifier
        - Targeted data extraction: 
            - Analysis of data-independent acquisition or SWATH-MS data (Tool OpenSWATH)
        - Misc:
            - Theoretical spectra generators
<br>

- Metabolomics:
    - Quantification:
        - Small molecule feature detection:
            - Untargeted, label-free (Tool FeatureFinderMetabo)
        - Retention Time Alignment:
            - Linear map alignment (Tool MagAlignerPoseClustering)
        - Small molecule feature linking:
            - QT based clustering and linking (Tool FeatureLinkerUnlabeledQT)
            - fast, KD-tree based linking (Tool FeatureLinkerUnlabeledKD)
        - Adduct decharing:
            - Linear programming based determination of small molecule ion adducts and charges (Tool MetboliteAdductDecharger)
        - Targeted data extraction: 
            - Analysis of data-independent acquisition or SWATH-MS data (Tool OpenSWATH)
    - Identification:
        - Spectral library search:
            - Tool MetaboliteSpectralMatcher
        - Accurate mass search: 
            - Tool AccurateMassSearch
 
<br>
- General:
    - Mass decomposition algorithms
    - Isotope pattern generators
    - Quality control (Tools QCCalculator, QCExtractor) metrics and file format (QcML)

<br>
<table>
<caption>Directory structure of src folder (/src)</caption>
 <tr><th>Folder        <th>Description
 <tr><td>openms</td><td>Source code of core library</td></tr>
 <tr><td>openms_gui</td><td>Source code of GUI applications (e.g.: TOPPView)</td></tr>
 <tr><td>topp</td><td>Source code of (stable) %OpenMS Tools</td></tr>
 <tr><td>util</td><td>Source code of (experimental) %OpenMS Tools</td></tr>
 <tr><td>pyOpenMS</td><td>Source files providing the python bindings</td></tr>
 <tr><td>tests</td><td>Source code of class and tool tests</td></tr>
</table >
<br>
<table>
<caption>Directory structure of core library (/src/openms/include/OpenMS)</caption>
 <tr><th>Folder        <th>Description
 <tr><td>ANALYSIS</td><td>Source code of high-level analysis like PeakPicking, Quantitation, Identification, MapAlignment</td></tr>
 <tr><td>APPLICATIONS</td><td>Source code for tool base and handling</td></tr>
 <tr><td>CHEMISTRY</td><td>Source code dealing with Elements, Enzymes, Residues, Modifications, Isotope distributions and amino acid sequences</td></tr>
 <tr><td>COMPARISON</td><td>Different scoring functions for clustering and spectra comparison</td></tr>
 <tr><td>CONCEPT</td><td>%OpenMS concepts (types, macros, ...)</td></tr>
 <tr><td>DATASTRUCTURES</td><td>Auxiliary data structures</td></tr>
 <tr><td>FILTERING</td><td>Filter</td></tr>
 <tr><td>FORMAT</td><td>Source code for I/O classes and file formats</td></tr>
 <tr><td>INTERFACES</td><td>Interfaces (WIP)</td></tr>
 <tr><td>KERNEL</td><td>Core data structures</td></tr>
 <tr><td>MATH</td><td>Source code for math functions and classes</td></tr>
 <tr><td>METADATA</td><td>Source code for classes that capture metadata about a MS or HPLC-MS experiment</td></tr>
 <tr><td>SIMULATION</td><td>Source code of MS simulator</td></tr>
 <tr><td>SYSTEM</td><td>Source code for basic functionality (file system, stopwatch)</td></tr>
 <tr><td>TRANSFORMATIONS</td><td>Feature detection (MS1 label-free and isotopic labelling) and PeakPickers (centroiding algorithms)</td></tr>
</table>
<br>

Within the ANALYSIS folder, you can find several important tools

<br>
<table>
<caption>Directory structure of the algorithmic part of the library (/src/openms/include/OpenMS/ANALYSIS)</caption>
 <tr><th>Folder        <th>Description
 <tr><td>DECHARGING</td><td>Algorithms for de-charging (charge analysis) for peptides and metabolites</td></tr>
 <tr><td>DENOVO</td><td>Algorithms for "de-novo" identification tools including CompNovo</td></tr>
 <tr><td>ID</td><td>Source code dealing with identifications including ID conflict resolvers, metabolite spectrum matching and target-decoy models</td></tr>
 <tr><td>MAPMATCHING</td><td>Algorithms for retention time correction and feature matching (matching between runs)</td></tr>
 <tr><td>MRM</td><td>Algorithms for MRM Fragment selection</td></tr>
 <tr><td>OPENSWATH</td><td>OpenSWATH algorithms for targeted, chromatogram-based analysis of MRM, SRM, PRM, DIA and SWATH-MS data</td></tr>
 <tr><td>PIP</td><td>Peak intensity predictor</td></tr>
 <tr><td>QUANTITATION</td><td>Algorithms for quantitative analysis including isobaric labelling</td></tr>
 <tr><td>RNPXL</td><td>Algorithms for RNA cross-linking</td></tr>
 <tr><td>SVM</td><td>Algorithms for SVM</td></tr>
 <tr><td>TARGETED</td><td>Algorithms for targeted proteomics (MRM, SRM)</td></tr>
 <tr><td>XLMS</td><td>Algorithms for Cross-link mass spectrometry</td></tr>
</table>
<br>


For the sake of completeness you will find a short list of the THIRDPARTY tools, which are integrated via wrappers into the %OpenMS framework (usually called -Adapter e.g. SiriusAdapter)

Wrapper to third-party tools:
    - Search Engines (MSGFPLUS, XTandem, OMSSA, Comet, MyriMatch)
    - Protein Inference (Fido)
    - Spectral Library Search (SpectraST)
    - Metabolite Identification (Sirius)
    - Score calibration and FDR calculation (Percolator)

@subsection tutorial_library_kernelclasses Kernel Classes

The %OpenMS kernel contains the data structures that store the actual MS data.

For storing the basic MS data (spectra, chromatograms, and full runs) %OpenMS uses
- Peaks (Peak1D and ChromatogramPeak) stored in
- MSSpectrum and MSChromatogram, which in turn can both be stored in an
- MSExperiment

For storing quantified peptides or analytes in single MS runs, %OpenMS uses so called feature maps.

The main data structures for quantitative information are
- Features (for quantitative information in MS1 maps)
- MRMFeatures (for quantitative information in XIC traces on MS1 and MS2 level)
    - which are both stored in a FeatureMap

To store quantified peptides or analytes over several MS runs, %OpenMS uses so called consensus maps.
- ConsensusFeatures are stored in a
- ConsensusMap

To store identified peptides %OpenMS has classes
- PeptideHit, which corresponds to a Peptide-Spectrum-Matching stored in a
- PeptideIdentification object (which is associated with a single spectrum)

<br>
<table>
<caption>Directory structure of core library (/src/openms)</caption>
 <tr><th>Stored Entity        <th>Class Name 
 <tr><td>Mass Peak (m/z + intensity)</td><td>Peak1D</td></tr>
 <tr><td>Elution Peak (rt + intensity)</td><td>ChromatogramPeak</td></tr>
 <tr><td>Spectrum of Mass Peaks</td><td>MSSpectrum</td></tr>
 <tr><td>Chromatogram of Elution Peaks</td><td>MSChromatogram</td></tr>
 <tr><td>Mass trace for small molecule detection</td><td>MassTrace</td></tr>
 <tr><td>Full MS run, containing both spectra and chromatograms</td><td>MSExperiment (alias PeakMap)</td></tr>
 <tr><td>Feature (isotopic pattern of eluting analyte)</td><td>Feature</td></tr>
 <tr><td>All features detected in an MS Run</td><td>FeatureMap</td></tr>
 <tr><td>Linked / Grouped feature (e.g., same Peptide quantified in several MS runs) </td><td>ConsensusFeature</td></tr>
 <tr><td>All grouped ConsensusFeatures of a multi-run experiment</td><td>ConsensusMap</td></tr>
 <tr><td>Peptide Spectrum Match</td><td>PeptideHit</td></tr>
 <tr><td>Identified Spectrum with one or several PSMs</td><td>PeptideIdentification</td></tr>
 <tr><td>Identified Protein</td><td>ProteinHit</td></tr>
</table>
<br>

@subsection tutorial_library_kernelclasses_peaks Peaks
%OpenMS provides one-, two- and d-dimensional data points, either with or without metadata attached to them.

  @image html Kernel_DataPoints.png "Data structure for MS data points"
  @image latex Kernel_DataPoints.png "Data structure for MS data points" width=14cm

One-dimensional data points:
One-dimensional data points (Peak1D) are the most important ones and used throughout %OpenMS. The two-dimensional and d-dimensional data points are needed rarely and used for special purposes only. Peak1D provides getter and setter methods to store the mass-to-charge ratio and intensity. 
<br>
Two-dimensional data points:
The two-dimensional data points store mass-to-charge, retention time and intensity. The most prominent example we will later take a closer look at is the Feature class, which stores a two-dimensional position (m/z and RT) and intensity of the eluting peptide or analyte. 
<br>
The base class of the two-dimensional data points is Peak2D. It provides the same interface as Peak1D and additional getter and setter methods for the retention time. RichPeak2D is derived from Peak2D and adds an interface for metadata. The Feature is derived from RichPeak2D and adds information about the convex hull of the feature, quality and so on.
<br>
For information on d-dimensional data points see the appendix.

@subsection tutorial_library_kernelclasses_spectra Spectra
The most important container for raw/profile data and centroided peaks is MSSpectrum. The elements of a MSSpectrum are peaks (Peak1D). In fact it is so common that it has its own typedef PeakSpectrum. MSSpectrum is derived from SpectrumSettings, a container for the metadata of a spectrum (e.g. precursor information). Here, only MS data handling is explained, SpectrumSettings is described in subsection meta data of a spectrum. In the following example (Tutorial_MSSpectrum.cpp) program, a MSSpectrum is filled with peaks, sorted according to mass-to-charge ratio and a selection of peak positions is displaOne-dimensional data points:
<br><br>
<b>Example: Tutorial_MSSpectrum.cpp</b>
<br>
In this example, we create MS1 spectrum at 1 minute and insert peaks with descending mass-to-charge ratios (for educational reasons). We sort the peaks according to ascending mass-to-charge ratio. Finally we print the peak positions of those peaks between 800 and 1000 Thomson. For printing all the peaks in the spectrum, we simply would have used the STL-conform methods begin() and end(). In addition to the iterator access, we can also directly access the peaks via vector indices (e.g. spectrum[0] is the first Peak1D object of the MSSpectrum). 

\snippet Tutorial_MSSpectrum.cpp MSSpectrum

@subsection tutorial_library_kernelclasses_chrom  Chromatograms
The most important container for targeted analysis / XIC data is MSChromatogram. The elements of a MSChromatogram are chromatogram peaks (Peak1D). MSChromatogram is derived from ChromatogramSettings, a container for the metadata of a chromatogram (e.g. containing precursor and product information), similarly to SpectrumSettings. In the following example (Tutorial_MSChromatogram.cpp) program, a MSChromatogram is filled with chromatographic peaks, sorted according to retention time and a selection of peak positions is displayed. 
<br><br>
<b>Example: Tutorial_MSChromatogram</b>
<br>
Fill MSChromatogram with chromatographic peaks, sorted according to retention time
\snippet Tutorial_MSChromatogram.cpp MSChromatogram

Since much of the functionality is shared between MSChromatogram and MSSpectrum, further examples can be gathered from the MSSpectrum subsection.

@subsection tutorial_library_kernelclasses_precursor  Precursor
The precursor data stored along with MS/MS spectra contains invaluable information for MS/MS analysis (e.g, m/z, charge, activation mode, collision energy). This information is stored in Precursor objects that can be retrieved from each spectrum. For a complete list of functions please see the Precursor class documentation.
<br><br>
<b>Example: Tutorial_Precursor</b>
<br>
Retrieve precursor information

\snippet Tutorial_Precursor.cpp Precursor

@subsection tutorial_library_kernelclasses_mrm MRMTransitionGroup
The targeted analysis of SRM or DIA (SWATH-MS) type of data requires a set of targeted assays as well as raw data chromatograms. The MRMTransitionGroup class allows users to map these two types of information and store them together with identified features conveniently in a single object. 
<br><br>
<b>Example: Tutorial_MRMTransitionGroup</b>
<br>
Create an empty MRMTransitionGroup with two dummy transitions

\snippet Tutorial_MRMTransitionGroup.cpp MRMTransitionGroup

Note how the identifiers of the chromatograms and the assay information (ReactionMonitoringTransition) are matched so that downstream algorithms can utilize the meta-information stored in the assays for data analysis.

@subsection tutorial_library_kernelclasses_map   Maps
Although raw data maps, peak maps and feature maps are conceptually very similar they are stored in different data types. For raw data and peak maps, the default container is MSExperiment, which is an array of MSSpectrum instances. 
In contrast to raw data and peak maps, feature maps are not a collection of one-dimensional spectra, but an array of two-dimensional feature instances. The main data structure for feature maps is called FeatureMap.

Although MSExperiment and FeatureMap differ in the data they store, they also have things in common. Both store metadata that is valid for the whole map, i.e. sample description and instrument description. This data is stored in the common base class ExperimentalSettings.

@subsection tutorial_library_kernelclasses_mse  MSExperiment
MSExperiment contains ExperimentalSettings (metadata of the MS run) and a vector<MSSpectrum>. The one-dimensional spectrum MSSpectrum is derived from SpectrumSettings (metadata of a spectrum).
<br><br>
<b>Example: Tutorial_MSExperiment.cpp</b>
<br>
The following example creates a MSExperiment containing four MSSpectrum instances. We then iterate over RT range (2,3) and m/z range (603,802) and print the peak positions using an AreaIterator. Then we show how we iterate over all spectra and peaks. In the commented out part, we show how to load/store all spectra and associated metadata from/to an mzML file.

\snippet Tutorial_MSExperiment.cpp MSExperiment

@subsection tutorial_library_kernelclasses_fmap FeatureMap
FeatureMap, the container for features, is simply a vector<Feature>. Additionally, it is derived from ExperimentalSettings, to store the meta information. All peak and feature containers (MSSpectrum, MSExperiment, FeatureMap) are also derived from RangeManager. This class facilitates the handling of MS data ranges. It allows to calculate and store both the position range and the intensity range of the container.
<br><br>
<b>Example: Tutorial_FeatureMap.cpp</b>
<br>
The following examples creates a FeatureMap containing two Feature instances. Then we iterate over all features and output the retention time and m/z. We then show, how to use the underlying range manager to retrieve FeatureMap boundaries in rt, m/z, and intensity.

\snippet Tutorial_FeatureMap.cpp FeatureMap

@subsection tutorial_fileformat File Formats

<table>
 <tr><td>mzML</td><td>The HUPO-PSI standard format for mass spectrometry data</td></tr>
 <tr><td>mzIdentML</td><td>The HUPO-PSI standard format for identification results data from any search engines</td></tr>
 <tr><td>mzTAB</td><td>The HUPO-PSI standard format for reporting MS-based proteomics and metabolomics results</td></tr>
 <tr><td>traML</td><td>The HUPO-PSI standard format for exchange and transmission lists for selected reaction monitoring (SRM) experiments</td></tr>
 <tr><td>featureXML</td><td>The %OpenMS format for quantitation results</td></tr>
 <tr><td>consensusXML</td><td>The %OpenMS format for grouping features in one map or across several maps</td></tr>
 <tr><td>idXML</td><td>The %OpenMS format for identification results</td></tr>
 <tr><td>trafoXML</td><td>The %OpenMS format for storing of transformations</td></tr>
 <tr><td>OpenSWATH</td><td></td></tr>
</table>

For further information of the HUPO Proteomics Standards Initiative please visit:
http://www.psidev.info/

@subsection tutorial_logging Logging
To make direct output to std::out and std::err more consistent, %OpenMS provides several low-level macros:
<br>
LOG_FATAL_ERROR, 
<br>
LOG_ERROR
<br>
LOG_WARN,
<br>
LOG_INFO and
<br
>LOG_DEBUG
<br>
which should be used instead of the less descriptive std::out and std::err streams. 

If you are writing an %OpenMS tool, you can also use the ProgressLogger to indicate how many percent of the processing has already been performed:
<br><br>
<b>Example: Tutorial_Logger.cpp</b>
<br>
Logging the Tool Progress

\snippet Tutorial_Logger.cpp Logger

Depending on how the user configures the tool, this output is written to the command line or a log file.

@subsection tutorial_identifications Identifications 
Identifications of proteins, peptides, and the mapping between peptides and proteins (or groups of proteins) are stored in dedicated data structures. These data structures are typically stored to disc as idXML or mzIdentML file.
The highest-level structure is ProteinIdentification. It stores all identified proteins of an identification run as ProteinHit objects + additional metadata (search parameters, etc.).
Each ProteinHit contains the actual Proteinaccession, an associated score, and (optionally) the protein sequence. A ProteinIdentification object stores the data corresponding to a single identified spectrum or feature. It has members for the retention time, m/z, and a vector of PeptideHits. Each PeptideHit stores the information of a specific peptide-to-spectrum match (e.g., the score and the peptide sequence). Each PeptideHit also contains a vector of PeptideEvidence objects which store the reference to one (or in the case the peptide maps to multiple proteins multiple) Proteins and the position therein.
<br><br>
<b>Example: Tutorial_IdentificationClasses.cpp</b>
<br>
Create all identification data needed to store an idXML file

\snippet Tutorial_IdentificationClasses.cpp Identification

@subsection tutorial_chemistry Chemistry
@subsection tutorial_element Element, ElementDB, EmpiricalFormula 
An Element object is the representation of an element. It can store the name, symbol and mass (average/mono) and natural abundances of isotopes. Elements are retrieved from the ElementDB singleton which is created from the file “/OpenMS/CHEMISTRY/Elements.xml”. The EmpiricalFormula object can be used to represent the empirical formula of a compound as well as to extract its natural isotope abundance and weight. 
<br><br>
<b>Example: Tutorial_Element.cpp</b>
<br>
Work with Element object
\snippet Tutorial_Element.cpp Element
<br><br>
<b>Example: Tutorial_EmpiricalFormula.cpp</b>
<br>
Extract isotope distribution and monoisotopic weight of an EmpiricalFormula object

\snippet Tutorial_EmpiricalFormula.cpp EmpiricalFormula

@subsection tutorial_aaseq AASequence - Representing a Peptide 
An AASequence object stores a (potentially chemically modified) peptide.
It can conveniently be constructed from the amino acid sequence (e.g., a string or a string literal “DEFIANGR”). Modifications may be encoded using the unimod name.
Once constructed, many convenient functions are available to calculate peptide or ion properties.
<br><br>
<b>Example: Tutorial_AASequence.cpp</b>
<br>
Compute and output basic AASequence properties

\snippet Tutorial_AASequence.cpp AASequence

Internally, an AASequence object is composed of Residues.

@subsection tutorial_residue Residue, ResidueDB
Residues are the building blocks of AASequence objects. They store physico-chemical properties of specific amino acids. ResidueDB stores that data and is initialized from the file “data/CHEMISTRY/residues.xml”.
<br><br>
<b>Example: Tutorial_Residue.cpp</b>
<br>
Compute and output basic Residue properties

\snippet Tutorial_Residue.cpp Residue

@subsection tutorial_residuemod ResidueModification, ModificationsDB
If a residue is modified (e.g. phosphorylation of an amino acid) it can be stored in the ResidueModification class.
The ResidueModification class stores information about chemical modifications of residues.
Each ResidueModification has an ID, the residue that can be modified with this modification and the difference in mass between the unmodified and the modified residue, among other information. The Residue class allows to set one modification per residue and the mass difference of the modification is accounted for in the mass of the residue.
The class ModificationsDB is a database of ResidueModifications. These are mostly initialized from the file “/share/CHEMISTRY/unimod.xml” containing a slightly modified version of the UniMod database of modifications. ModificationsDB has functions to search for modifications by name or mass.
<br><br>
<b>Example: Tutorial_ResidueModification.cpp</b>
<br>
Set a ResidueModification on a Residue

\snippet Tutorial_ResidueModification.cpp ResidueModification

@subsection tutorial_tsg TheoreticalSpectrumGenerator
The TheoreticalSpectrumGenerator generates ion ladders from AASequences.
<br><br>
<b>Example: Tutorial_TheoreticalSpectrumGenerator.cpp</b>
<br>
Generate theoretical spectra

\snippet Tutorial_TheoreticalSpectrumGenerator.cpp TSG

@subsection tutorial_dep DigestionEnzymeProtein, ProteaseDB and ProteaseDigestion
%OpenMS provides the most common digestion enzymes (DigestionEnzymeProtein) used in MS. They are stored in the ProteaseDB singleton and loaded from “/share/CHEMISTRY/Enzymes.xml”. 
<br><br>
<b>Example: Tutorial_Enzyme.cpp</b>
<br>
Digest amino acid sequence

\snippet Tutorial_Enzyme.cpp Enzyme

@section tutorial_tooldev Tool development
@subsection tutorial_topp TOPP-Tool
TOPP (The %OpenMS Pipeline) tools are small command line applications built using the %OpenMS library. They act as building blocks for complex analysis workflows and may perform e.g. simple signal processing tasks like filtering, up to more complex tasks like protein inference and quantitation over several MS runs. Common to all TOPP tools is a command line interface allowing automatic integration into workflow engines like KNIME. They are the preferred way to integrate novel methods as application into %OpenMS.
When we first create a novel TOPP tool it is considered unstable. To set it apart from the stable and well tested tools it gets first created as TOPP Util (note: the name “util” has historic reasons and may be changed to unstable tools in the future).
If it is well tested it will be promoted to a stable Tool in future %OpenMS versions.

Imagine that you want to create a new tool that allows filtering of sequence databases. What you usually would first do is check if such or similar functionality has already been implemented in any of the >150 TOPP tools. If you are unsure which one to use, just ask on the mailing list, the gitter chat or contact one of the developers directly. The following subsection demonstrates how the original “DatabaseFilter” tool was created from scratch an integrated into %OpenMS. Basically any tool you want to integrate needs to follow the steps outlined below.

But let’s first get started by defining what our tool should actually do:
The DatabaseFilter tool should provide functionality to reduce a fasta database by filtering its entries based on different criteria. A simple criterion could be the length of a protein. To make the task a bit more interesting and to show other parts of the %OpenMS library, we will start with a bit more complex filtering step that keeps all entries from the fasta database that have been identified in a peptide search  (e.g., using X!Tandem, Mascot or MSGF+). This functionality might come in handy if the size of large databases needs to be reduced to a manageable size. In addition, we want the user to be able to choose between keeping and removing matching protein id.

@subsection tutorial_create Create and register a minimal tool in OpenMS
- Create an empty file src/utils/DatabaseFilter.cpp
- Add the scaffold code for a minimal TOPP tool. Text in bold will later be adapted to our DatabaseFilter tool. 

<b>Example: Tutorial_Template.cpp</b>
<br>
Template for %OpenMS tool development

\snippet Tutorial_Template.cpp Template

- Now add a line with %DatabaseFilter.cpp to src/utils/executables.cmake. This registers the novel tool in the %OpenMS build system. 
- Then add the tool to getUtilList() in src/openms/source/APPLICATIONS/ToolHandler.cpp 
This creates a manual (doxygen) page with the information –help output of the tool (using TOPPDocumenter). This page must be included at the end of the doxygen documentation of your tool (see other tools for an example).
- Add yourself as Maintainer/Author
- Write the basic documentation (doxygen docu). You probably need to refine it later but you can already insert the correct Toolname etc..

@subsection tutorial_param_def Define tool parameters
Define tool parameters Each TOPP tool defines a set of parameters that will be available from the command line, KNIME, and other workflow systems. This is done in the void registerOptionsAndFlags_() method. In our case we want to read a protein database (fasta format), a file containing identification data (idXML format), and an option to switch between keeping (whitelisting) and removing (blacklisting) entries based on the filter result. This is our input. The reduced database forms the output and should be written to a protein database in fasta format. This is easily done by adding following lines to:
<br><br>
<b>Example: Tutorial_Final.cpp</b>
<br>
Registration of tool parameters

\snippet Tutorial_Final.cpp  Register

Functions, classes and references can be checked in the %OpenMS / TOPP documentation (ftp://ftp.mi.fu-berlin.de/pub/OpenMS/release-documentation/html/index.html)

@subsection tutorial_param_read Read tool parameters
After a tool is executed, the registered parameters are available in the main_ function of the TOPP tool and can be read using the getStringOption_ method. Special methods for integers, lists and floating point parameters exist and are in the TOPPBase documentation but are not needed for this example.
<br><br>
<b>Example: Tutorial_Final.cpp</b>
<br>

\snippet Tutorial_Final.cpp InputParam

@subsection tutorial_read Read Input Files
First the different file formats and data structures for peptide identifications have to be included at the top of the file.
<br><br>
<b>Example: Tutorial_Final.cpp</b>
<br>
Add essential includes

\snippet Tutorial_Final.cpp Includes

<b> Read the input files </b>
<br>

\snippet Tutorial_Final.cpp InputRead

Note: both peptide_identifications and protein_identifications contain protein accessions. The difference between them is that protein_identifications only contain the inferred set of protein accessions while peptide_identifications contains all protein accessions the peptides map to. We consider only the larger set of protein accessions stored in the peptide identifications. In principle, it would be easy to add another parameter that adds a filter for the inferred accessions stored in protein_identifications.

@subsection tutorial_add Add the tool functionality
First, the accessions are extracted from the IdXML file. Here knowledge of the data structure is needed to extract the protein accessions. The class PeptideIdentification stores general information about a single identified spectrum (e.g., retention time, precursor mass-to-charge). A vector of PeptideHits is stored in each PeptideIdentification object and represent the potentially multiple PSMs of a single spectrum. They can be returned by calling .getHits(). Each peptide sequence stored in a PeptideHit may map to one or multiple proteins. This peptide to protein mapping information is stored in a vector of PeptideEvidence accessible by .getPepitdeEvidences(). From each of these evidences we can extract the protein accession with .getProteinAccession(). 

To store all proteins accessions in the set id_accessions, we write: 
<br><br>
<b>Example: Tutorial_Final.cpp</b>
<br>
Store protein accessions

\snippet Tutorial_Final.cpp Functionality_1

Now that we assembled the set of all protein accessions we are ready to compare them to the fasta_accessions. If they are similar and the method whitelist or they are different and the method blacklist was chosen, the fasta entries are copied to the new fasta database.
<br><br>
<b>Example: Tutorial_Final.cpp</b>
<br>
Add method functionality
\snippet Tutorial_Final.cpp Functionality_2

@subsection tutorial_write Write Output Files 
<b>Example: Tutorial_Final.cpp</b>
<br>
Write the output

\snippet Tutorial_Final.cpp output

@subsection tutorial_test Adding TOPP tests
Testing your tools is essential and required to promote your experimental util to an official TOPP tool. It is not mandatory to provide a test for a util but appreciated.
For this test a .fasta and a compatible .idXML file have to be added to /src/tests/topp/. Further the test procedure has to be added to CMakeLists.txt in the same folder.
<br><br>
<b>Example: Tutorial_Test.cpp</b>
<br>
Add tests

\snippet Tutorial_Test.cpp Test

These tests run the program with the given parameters and then call a diff tool to compare the generated output to the expected output.

@subsection tutorial_doc Finish documentation
We add it to the UTILS docu page (in doc/doxygen/public/UTILS.doxygen).
Later (when we have a working application) we will write an application test (this is optional but recommended for Utils. For Tools it is mandatory). See TOPP tools above and add the test to the bottom of src/tests/topp/CMakeLists.txt.

@subsection tutorial_polish Polish your code 
This is how a util should look after code polishing: 
Here, the support for different formats was extended (idXML and MZIdentML).
Since different filter criteria may be introduced in the future, the structure was slightly changed with a function for the filtering by ID (filterByProteinIDs_) - in order to allow higher flexibility when adding new a functionality later on.
<br><br>
<b>Example: Tutorial_final.cpp</b>
<br>
Polish your code - add additional functionality

\snippet Tutorial_Final.cpp final 

@subsection tutorial_pull Open a pull request 
Afterwards you can commit your changes to a new branch “feature/DatabaseFilter” of your %OpenMS clone on github and submit a pull request on your github page. After a short review process by the %OpenMS Team, the tool will be added the %OpenMS Library.

@section tutorial_appendix Appendix 
@subsection tutorial_d_dim D-dimensional data points 
The d-dimensional data points are needed in special cases only, e.g. in template classes that operate in any number of dimensions. The base class of the d-dimensional data points is DPeak. The methods to access the position are getPosition and setPosition. Note that the one-dimensional and two-dimensional data points also have the methods getPosition and setPosition. They are needed in order to be able to write algorithms that can operate on all data point types. It is, however, recommended not to use these members unless you really write such a generic algorithm.

@subsection tutorial_ext_project OpenMS as external project
If %OpenMS TOPP_tools and UTILS_tools are not sufficient for a certain scenario, you can either request changes to %OpenMS or modify/extend your own fork of %OpenMS. A third alternative is using %OpenMS as a dependency while not touching %OpenMS itself. Once you've finished your new tool, and it runs on the development machine, you're done. If you want to develop with %OpenMS as external project have a look the example code ( /share/%OpenMS/examples/external_code/).

*/