File: molform.Rmd

package info (click to toggle)
r-cran-rcdk 3.8.1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 888 kB
  • sloc: makefile: 14; sh: 13
file content (139 lines) | stat: -rwxr-xr-x 5,975 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: "Handling Molecular Formulae"
author: "Rajarshi Guha"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true	
vignette: >
  %\VignetteIndexEntry{Handling Molecular Formulae}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Handling Molecular Formulae

## Introduction

The molecular formula is the simplest way to characterize a molecular compound.
It specifies the actual number of atoms of each element contained
in the molecule. A molecular formula is represented by the chemical symbol
of each constituent element. If a molecule contains more than one atom for
a particular element, the quantity is shown as subscript after the chemical
symbol. Otherwise, the number of neutrons (atomic mass) that an atom is
composed can differ. This different type of atoms are known as isotopes. The
number of nuclei is represented as superscripted prefix previous to the chemical
element. Generally it is not added when the isotope that characterizes
the element is the most occurrence in nature, e.g., $C_4H_{11}O^2D$.

## Parsing a Molecule To a Molecular Formula

Front a molecule, defined as conjunct of atoms helding together by chemical
bonds, we can simplify it taking only the information about the atoms. The `rcdk`
package provides a parser to translate molecules to molecular formlulas, the
`get.mol2formula` function.

```{r}
library(rcdk)
sp <- get.smiles.parser()
molecule <- parse.smiles('N')[[1]]
convert.implicit.to.explicit(molecule)
formula <- get.mol2formula(molecule,charge=0)
```

Note that the above formula object is a `CDKFormula-class`. This class
contains some attributes that defines a molecular formula. For example,
the mass, the charge, the isotopes, the character representation of the
molecular formula and the [IMolecularFormula](https://cdk.github.io/cdk/2.3/docs/api/org/openscience/cdk/interfaces/IMolecularFormula.html) `jobjRef` object.

The molecular mass, charge and string representation for this formula are given by
```{r}
formula@mass
formula@charge
formula@string
```
The isotopes for this formula. It is formed from  three attributes. `isoto` (the
symbol expression of the isotope), `number` (number of atoms for this isotope)
and `mass` (exact mass of this isotope).
```{r}
formula@isotopes
```
Depending of the circumstances, you may want to change the charge of the
molecular formula.
```{r}
formula <- set.charge.formula(formula, charge=1)
```

## Initializing a Formula from the Symbol Expression

Other way to create a `cdkFormula` is from the symbol expression. Thus,
setting the characters of the elemental formula, the function `get.formula`
parses it to an object of `cdkFormula-class`.
```{r}
formula <- get.formula('NH4', charge = 1)
formula
```

## Generating Molecular Formula

Mass spectrometry is an essential and reliable technique to determine the
molecular mass of compounds. Conversely, one can use the measured mass
to identify the compound via its elemental formula. One of the limitations
of the method is the precision and accuracy of the instrumentation. As
a result, rather than specify exact masses, we specify tolerances or ranges
of possible mass, resulting in multiple candidate formulae for a given mass
window. The `generate.formula` function returns a list of formulae which have
a given mass (within an error window)
```{r warn=FALSE}
mfSet <- generate.formula(18.03383, window=1,
    elements=list(c("C",0,50),c("H",0,50),c("N",0,50)),
    validation=FALSE)
mfSet
```
It is important to know if an elemental formula is valid. The method `isvalid.formula`
provides this function. Two constraints can be applied, the
[nitrogen rule](https://en.wikipedia.org/wiki/Nitrogen_rule) and the  [RDBE rule](https://fiehnlab.ucdavis.edu/projects/seven-golden-rules/ring-double-bonds) (Ring Double Bond Equivalent).
```{r}
formula <- get.formula('NH4', charge = 0)
isvalid.formula(formula,rule=c("nitrogen","RDBE"))
```
We can observe that the ammonium is only valid if it is defined with charge
of +1.
```{r}
formula <- get.formula('NH4', charge = 1)
isvalid.formula(formula,rule=c("nitrogen","RDBE"))
```
The `generate.formula` method can perform these validation tests during formula generation by setting `validation=TRUE`. However, this can significantly slow down the process of genersating formulae, especially with larger mass windows. When the default of `FALSE` is used, nonsensical formulae may be generated, and it's up to the user to filter them out.

In contrast, the `generate.formula.iter` method employs the Round Robin formula generation algorithm, which is significantly faster than the default method. In contrast to `generate/formula`, this method returns an iterator which allows you to process large formula sets without excessive memory usage.
```{r warn=FALSE}
mit <- generate.formula.iter(100, charge=0, window=0.1,
                             elements=list(c("C",0,50), c("H",0,50), c("N",0,50)))
hit <- itertools::ihasNext(mit)
while (itertools::hasNext(hit)) 
    print(iterators::nextElem(hit))
```
Note that by default, this function will not check for validity of the generated formulae.

## Calculating Isotope Pattern


Due to the measurement errors in medium resolution spectrometry, a given
error window can result in a massive number of candidate formulae. The
isotope pattern of ions obtained experimentally can be compared with the
theoretical ones. The best match is reflected as the most probable elemental
formula. `rcdk` provides the function `get.isotope.pattern` which predicts
the theoretical isotope pattern given a formula.

```{r}
formula <- get.formula('CHCl3', charge = 0)
isotopes <- get.isotopes.pattern(formula,minAbund=0.1)
isotopes
```

In this example we generate a formula for a possible compound with a charge
($z \approx 0$) containing the standard elements C, H, and Cl. The isotope
pattern can be visually inspected, as shown below.
```{r}
plot(isotopes, type="h", xlab="m/z", ylab="Intensity")
```