File: using-rcdk.Rmd

package info (click to toggle)
r-cran-rcdk 3.8.1%2Bdfsg-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 888 kB
sloc: makefile: 14; sh: 13
file content (455 lines) | stat: -rwxr-xr-x 22,049 bytes
parent folder | download | duplicates (4)
---
title: "Using the CDK from R"
author: "Rajarshi Guha"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Using the CDK from R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Using the CDK from R

## Introduction

Given that much of cheminformatics involves mathematical and statistical
modeling of chemical information, R is a natural platform for such work.
There are many cheminformatics applications that will generate useful information
such as descriptors, fingerprints and so on. While one can always
run these applications to generate data that is then imported into R, it can
be convenient to be able to manipulate chemical structures and generate
chemical information with the R environment.

The [CDK](https://cdk.github.io) is a Java library for cheminformatics that supports a wide variety
of cheminformatics functionality ranging from reading molecular file
formats, performing ring perception and aromaticity detection to fingerprint
generation and molecular descriptors. The CDK website provides links to useful documentation as well as complete [Javadocs](https://cdk.github.io/cdk/2.3/docs/api/index.html?overview-summary.html)

## Getting started

The goal of the `rcdk` package is to allow an R user to access the cheminformatics
functionality of the [CDK](https://cdk.github.io) from within R. While one can use the
[rJava](https://CRAN.R-project.org/package=rJava) package to make direct calls to specific methods in the [CDK](https://cdk.github.io), from R,
such usage does not usually follow common R idioms. Thus `rcdk` aims to 
 allow users to use the [CDK](https://cdk.github.io) classes and methods in an R-like fashion.

The library is loaded as follows
```{r}
library(rcdk)
```
We can also check the version of the [CDK](https://cdk.github.io) that is being used in the package
```{r}
cdk.version()
```
The package also provides an example data set, called `bpdata` which contains
277 molecules, in SMILES format and their associated boiling points (BP)
in Kelvin. The `data.frame` has two columns, viz., the SMILES and the BP.
Molecules names are used as row names:
```{r echo=FALSE}
data(bpdata)
str(bpdata)
```
## Input and Output

Chemical structures come in a variety of formats and the CDK supports
many of them. Many such formats are disk based and these files can be
parsed and loaded by specifying their full paths

```{r, eval=FALSE}
mols <- load.molecules( c('data1.sdf', '/some/path/data2.sdf') )
```

Note that the above function will load any file format that is supported by
the CDK, so there’s no need to specify formats. In addition one can specify a
URL (which should start with `http://`) to specify remote files as well. The
result of this function is a list of molecule objects. The molecule objects
are of class `jobjRef` (provided by the [rJava](https://CRAN.R-project.org/package=rJava) package). As a result,they
are pretty opaque to the user and are really meant to be processed using
methods from the `rcdk` or [rJava](https://CRAN.R-project.org/package=rJava) packages.

However, since it loads all the molecules from the specified file into a list,
large files can lead to out of memory errors. In such a situtation it is preferable
to iterate over the file, one structure at a time. Currently this behavior
is supported for SDF and SMILES files. An example of such a usage for a large
SD file would be
```{r eval=FALSE}
iter <- iload.molecules('verybig.sdf', type='sdf')
while(hasNext(iter)) {
 mol <- nextElem(iter)
 print(get.property(mol, "cdk:Title"))
}
```

### Parsing SMILES

Another common way to obtain molecule objects is by parsing SMILES
strings. The simplest way to do this is
```{r eval=FALSE}
smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1'
mol <- parse.smiles(smile)[[1]]
```
Usage is more efficient when multiple SMILE are supplied, since then a single
SMILES parser object is used to parse all the supplied SMILES.

If you plan on parsing a large number of SMILES, you may run into memory
issues, due to the large size of `IAtomContainer` objects. In such a case, it
can be useful to call the Java and R garbage collectors explicitly at the
appropriate time. In addition it can be useful to explicitly allocate a large
amount of memory for the JVM. For example,
```{r eval=FALSE}
options("java.parameters"=c("-Xmx4000m"))
library(rcdk)
for (smile in smiles) {
    m <- parse.smiles(smile)
    ## perform operations on this molecule
    
    jcall("java/lang/System","V","gc")
    gc()
}
```
Given a list of molecule objects, it is possible to serialize them to a file in
some specified format. Currently, the only output formats are SMILES or
SDF. To write molecules to a disk file in SDF format.
```{r eval=FALSE}
write.molecules(mols, filename='mymols.sdf')
```
By default, if mols is a list of multiple molecules, all of them will be written
to a single SDF file. If this is not desired, you can write each on to individual
files (which are prefixed by the value of filename):

```{r eval=FALSE}
 write.molecules(mols, filename='mymols.sdf', together=FALSE)
```
 
### Generating SMILES
 
 Finally, we can generate a SMILES representation of a molecule using
 
```{r}
smiles <- c('CCC', 'c1ccccc1', 'CCCC(C)(C)CC(=O)NC')
mols <- parse.smiles(smiles)
get.smiles(mols[[1]])
unlist(lapply(mols, get.smiles))
```

 
 The [CDK](https://cdk.github.io) supports a number of _flavors_ when generating SMILES. For example, you can generate a SMILES with or without chirality information or generate SMILES in [Kekule form](https://www.chemguide.co.uk/basicorg/bonding/benzene1.html). The `smiles.flavors` generates an object that represents the various flavors desired for SMILES output. See the [SmiFlavor javadocs](https://cdk.github.io/cdk/2.3/docs/api/index.html?org/openscience/cdk/smiles/SmiFlavor.html) for the full list of possible flavors. Example usage is
 
```{r}
smiles <- c('CCC', 'c1ccccc1', 'CCc1ccccc1CC(C)(C)CC(=O)NC')
mols <- parse.smiles(smiles)
get.smiles(mols[[3]], smiles.flavors(c('UseAromaticSymbols')))
get.smiles(mols[[3]], smiles.flavors(c('Generic','CxSmiles')))
```

Using the [CxSmiles](http://butane.chem.uiuc.edu/jsmoore/marvin/help/formats/cxsmiles-doc.html) flavors allows the user to encode a variety of information in the SMILES string, such as 2D or 3D coordinates.

```{r}
m <- parse.smiles('CCC')[[1]]
m <- generate.2d.coordinates(m)
m
get.smiles(m, smiles.flavors(c('CxSmiles')))
get.smiles(m, smiles.flavors(c('CxCoordinates')))
```

## Visualization

The `rcdk` package supports 2D rendering of chemical structures. This can be
used to view the structure of individual molecules or multiple molecules in
a tabular format. It is also possible to view a molecular-data table, where
one of the columns is the 2D image and the remainder can contain data
associated with the molecules.

Due to Java event handling issues on OS X, depictions are handled using an external helper, which means that 
depiction generation can be slower on OS X compared to other platforms. 

Molecule visualization is performed using the `view.molecule.2d` function. This handles individual molecules as well as a list of molecules. In the latter case, the depictions are arranged in a grid (with 4 columns by default).

```{r eval=FALSE}
smiles <- c('CCC', 'CCN', 'CCN(C)(C)',
            'c1ccccc1Cc1ccccc1',
            'C1CCC1CC(CN(C)(C))CC(=O)CC')
mols <- parse.smiles(smiles)
view.molecule.2d(mols[[1]])
view.molecule.2d(mols)
```

The CDK depiction routines allow for extensive customization. These customizations can be accessed by creating a depictor object using `get.depictor`, which allows you to specify the size of the depiction, the depiction style (black and white, color on white, etc.), atom annotations (e.g., atom index), whether functional group abbreviations should be used or not and so on.

```{r eval=FALSE}
depictor <- get.depictor(style='cob', abbr='reagents', width=300, height=300)
view.molecule.2d(mols[[5]], depictor=depictor)
```
Once you have a depictor object, you can set individual properties using the `$` notation. This can be useful if you plan to generate a lot of depictions so that a new depictor is not recreated for each new structure.
```{r eval=FALSE}
depictor <- get.depictor(style='cob', abbr='reagents', width=300, height=300)
view.molecule.2d(mols[[5]], depictor=depictor)
#depictor$setStyle('cow')
#view.molecule.2d(mols[[5]], depictor=depictor)
```

The method also allows you to highlight substructures using [SMARTS](https://en.wikipedia.org/wiki/Smiles_arbitrary_target_specification). This is useful in highlight commen substructures in a set of molecules
```{r eval=FALSE}
depictor <- get.depictor(style='cob', abbr='reagents', sma='N(C)(C)')
view.molecule.2d(mols, depictor=depictor)
```

In many cases, it is useful to view a “molecular spreadsheet”, which is a table
of molecular structures along with information (numeric or textual) related to the molecules being viewed. The data is arranged in a spreadsheet like manner, with one of the columns
being molecules and the remainder being textual or numeric information.

This can be achieved using the `view.table` method which takes a list of
molecule objects and a `data.frame` containing the associated data. As expected,
the number of rows in the `data.frame` should equal the length of
the molecule list. Note that currently, there is not explicit binding between the rows of the `data.frame` and the elements of the list containing the molecules. Thus the user should take care that the ordering of the `data.frame` matches that of the list.
```{r eval=FALSE}
smiles <- c('CCC', 'CCN', 'CCN(C)(C)','c1ccccc1Cc1ccccc1')
mols <- parse.smiles(smiles)
dframe <- data.frame(x = runif(4),
  toxicity = factor(c('Toxic', 'Toxic', 'Nontoxic', 'Nontoxic')),
  solubility = c('yes', 'yes', 'no', 'yes'))
view.table(mols, dframe)
```

While the `view.molecule.2d` function is useful to visualize structures, the depictions can't be included in other visualizations such as plots. For such use cases, the `view.image.2d` function produces a raster image that can be included in plots. This function handles one molecule at a time.
```{r}
img <- view.image.2d(parse.smiles("B([C@H](CC(C)C)NC(=O)[C@H](CC1=CC=CC=C1)NC(=O)C2=NC=CN=C2)(O)O")[[1]])
plot(1:10, 1:10, pch=19)
rasterImage(img, 1,6, 5,10)
```

Finally, the `copy.image.to.clipboard` function allows you to copy a depiction to the system clipboard, from where it can be pasted into other applications. This can be more convenient than saving a raster image.

## Manipulating Molecules

In general, given a `jobjRef` for a molecule object one can access all the
class and methods of the [CDK](https://cdk.github.io) library via rJava. However this can be
cumbersome. The `rcdk` package exposes methods and
classes that manipulate molecules.

### Adding Information to Molecules

In many scenarios it’s useful to associate information with molecules. Within
R, you could always create a `data.frame` and store the molecule objects
along with relevant information in it. However, when serializing the molecules,
you want to be able to store the associated information with the structure itself (though keep in mind that only certain chemical file formats support metadata along with the structure).

Using the [CDK](https://cdk.github.io) it’s possible to directly add information to a molecule object
using properties. Note that adding such properties uses a key-value
paradigm, where the key should be of class character. The value can be
of class `integer`, `double`, `character` or `jobjRef`. Obviously, after setting
a property, you can get a property by its key.

```{r}
mol <- parse.smiles('c1ccccc1')[[1]]
set.property(mol, "title", "Molecule 1")
set.property(mol, "hvyAtomCount", 6)
get.property(mol, "title")
```
It is also possible to get all available properties at once in the from of a list.
The property names are used as the list names.
```{r}
get.properties(mol)
```
After adding such properties to the molecule, you can write it out to an SD
file, so that the property values become SD tags.
```{r eval=FALSE}
write.molecules(mol, 'tagged.sdf', write.props=TRUE)
```

### Atoms and Bonds

Probably the most important thing to do is to get the atoms and bonds of
a molecule. The code below gets the atoms and bonds as lists of `jobjRef`
objects, which can be manipulated using rJava or via other methods of this
package.
```{r}
mol <- parse.smiles('c1ccccc1C(Cl)(Br)c1ccccc1')[[1]]
atoms <- get.atoms(mol)
bonds <- get.bonds(mol)
cat('No. of atoms =', length(atoms), '\n')
cat('No. of bonds =', length(bonds), '\n')
```
Given an atom the `rcdk` package does not offer a lot of methods
to operate on it. One must access the [CDK](https://cdk.github.io) directly.
In the future more
manipulators will be added. Right now, you can get the symbol for each
atom

```{r}
unlist(lapply(atoms, get.symbol))
```
It’s also possible to get the 3D (or 2D coordinates) for an atom.
```{r}
coords <- get.point3d(atoms[[1]])
```
Given this, it’s quite easy to get the 3D coordinate matrix for a molecule
```{r}
coords <- do.call('rbind', lapply(atoms, get.point3d))
```
Once you have the coordinate matrix, a quick way to check whether the
molecule is flat is to do
```{r}
if ( any(apply(coords, 2, function(x) length(unique(x))) == 1) ) {
    print("molecule is flat")
}
```
This is quite a simplistic check that just looks at whether the X, Y or
Z coordinates are constant. To be more rigorous one could evaluate the
moments of inertia about the axes.

### Substructure matching

The [CDK](https://cdk.github.io) library supports substructure searches using [SMARTS](https://en.wikipedia.org/wiki/Smiles_arbitrary_target_specification) (or SMILES)
patterns. The implementation allows one to check whether a target molecule
contains a substructure or not as well as to retrieve the atoms and bonds
of the target molecule that match the query substructure. At this point,
the `rcdk` only support the former operation - given a query pattern, does it
occur or not in a list of target molecules. The `matches` method of this package
returns a logical vector where the $i$'th element is `TRUE` if the $i$'th target molecules contains the query substructure. An example of its
usage would be to identify molecules that contain a carbon atom that has
exactly two bonded neighbors.
```{r}
mols <- parse.smiles(c('CC(C)(C)C','c1ccc(Cl)cc1C(=O)O', 'CCC(N)(N)CC'))
query <- '[#6D2]'
matches(query, mols)
```

## Molecular Descriptors

A key requirement for the predictive modeling of molecular properties and activities are [molecular descriptors](https://en.wikipedia.org/wiki/Molecular_descriptor) - numerical characterizations of the molecular structure.  The [CDK](https://cdk.github.io/) implements a variety of molecular descriptors, categorized into topological, constitutional, geometric, electronic and hybrid. It is possible to evaluate all available descriptors at one go, or evaluate individual descriptors.

First, we can take a look at the available descriptor categories.
```{r}
dc <- get.desc.categories()
dc
```
Given the categories we can get the names of the descriptors for a single
category. Of course, you can always provide the category name directly.
```{r}
dn <- get.desc.names(dc[4])
dn
```
Each descriptor name is actually a fully qualified Java class name for the
corresponding descriptor. These names can be supplied to `eval.desc` to
evaluate a single or multiple descriptors for one or more molecules.
```{r eval=FALSE}
aDesc <- eval.desc(mol, dn[4])
allDescs <- eval.desc(mol, dn)
```

The return value of eval.desc is a `data.frame` with the descriptors in the
columns and the molecules in the rows. For the above example we get a single
row. But given a list of molecules, we can easily get a descriptor matrix.

For example, let's build a linear regression model to predict boiling points
for the BP dataset. First we need a set of descriptors and so we evaluate
all available descriptors. Also note that since a descriptor might belong to
more than one category, we should obtain a unique set of descriptor names
```{r}
descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names)))
```
For the current discussion we focus on a few, manually selected descriptors
that we know will be related to boiling point.
```{r}
data(bpdata)
mols <- parse.smiles(bpdata[,1])
descNames <- c(
 'org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor',
 'org.openscience.cdk.qsar.descriptors.molecular.APolDescriptor',
 'org.openscience.cdk.qsar.descriptors.molecular.HBondDonorCountDescriptor')
descs <- eval.desc(mols, descNames)
class(descs)
dim(descs)
```
When a descriptor value cannot be computed, it's value is set to `NA`. This may happen if a descriptor requires 3D coordinates, but only 2D coordinates are available. In this case, we have manually selected descriptors such that there will be no undefined values.

Given the ubiquity of certain descriptors, some of them are directly available
via their own functions. Specifically, one can calculate [TPSA](https://pubmed.ncbi.nlm.nih.gov/19149561/) (topological
polar surface area), [AlogP](https://pubs.acs.org/doi/abs/10.1021/jp980230o) and [XlogP](http://www.compchemcons.com/xlogp/xlogp.html) without having to go through
`eval.desc`. (Note that AlogP and XlogP assume that hydrogens are explicitly specified in the molecule. This may not be true if the molecules were obtained from SMILES)
```{r}
mol <- parse.smiles('CC(=O)CC(=O)NCN')[[1]]
convert.implicit.to.explicit(mol)
get.tpsa(mol)
get.xlogp(mol)
get.alogp(mol)
```

Now that we have a descriptor matrix, we easily build a linear
regression model. First, remove `NA`’s, correlated and constant columns. The
code is shown below, but since it involves a stochastic element, we will not
run it for this example. If we were to perform feature selection, then this
type of reduction would have to be performed.

```{r}
descs <- descs[, !apply(descs, 2, function(x) any(is.na(x)) )]
descs <- descs[, !apply( descs, 2, function(x) length(unique(x)) == 1 )]
r2 <- which(cor(descs)^2 > .6, arr.ind=TRUE)
r2 <- r2[ r2[,1] > r2[,2] , ]
descs <- descs[, -unique(r2[,2])]
```

Note that the above correlation reduction step is pretty crude and there
are better ways to do it. Given the reduced descriptor matrix, we can
perform feature selection (say using [leaps](https://CRAN.R-project.org/package=leaps), [caret](https://CRAN.R-project.org/package=caret) or a [GA](https://CRAN.R-project.org/package=GenAlgo) to identify a suitable
subset of descriptors. Given that we selected the descriptors by hand, we can skip this section, and directly build the model and generate a plot of predicted versus observed BP. (Note that this is a toy example and is not an example of good QSAR practice!)
```{r}
model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs))
summary(model)
plot(bpdata$BP, predict(model, descs),
     xlab="Observed BP", ylab="Predicted BP",
     pch=19, xlim=c(100, 700), ylim=c(100, 700))
abline(0,1, col='red')
```
 


## Fingerprints

Fingerprints are a common representation used for a variety of purposes such as similarity searching and predictive modeling. The [CDK](https://cdk.github.io/) provides a variety of fingerprints ranging from path-based hashed fingerprints to circular (specifically, an implementation fo the [ECFP](https://pubs.acs.org/doi/abs/10.1021/ci100050t) fingerprints) and signature fingerprints (based on the [signature](https://pubs.acs.org/doi/abs/10.1021/ci020345w) molecular descriptor). Some of the fingerprints are represented as binary strings and other by integer vectors. The `rcdk` employs the [fingerprint](https://CRAN.R-project.org/package=fingerprint) package to support operations on the resultant fingerprints.

In this section, we present an example of using fingerprints to generate a hierarchical clustering of a set of molecules from the included boiling point dataset. We first parse the SMILES for the molecules in the dataset and then compute the fingerprints, specifying the `circular` type.
```{r}
data(bpdata)
mols <- parse.smiles(bpdata[,1])
fps <- lapply(mols, get.fingerprint, type='circular')
```

With the fingerprints, we can then compute a pairwise similarity matrix using the [Tanimoto](https://en.wikipedia.org/wiki/Chemical_similarity) metric. Since R's `hclust` method requires a distance matrix, we convert the similarity matrix to a distance matrix
```{r}
fp.sim <- fingerprint::fp.sim.matrix(fps, method='tanimoto')
fp.dist <- 1 - fp.sim
```
Finally, we can perform the clustering. In this case we use the `hclust` method though any of R's clustering methods could be used.
```{r}
cls <- hclust(as.dist(fp.dist))
plot(cls, main='A Clustering of the BP dataset', labels=FALSE)
```

Another common task for fingerprints is similarity searching. That is, 
given a collection of _target_ molecules, find those molecules that are similar
to a _query_ molecule. This is achieved by evaluating a similarity metric
between the query and each of the target molecules. Those target molecules
exceeding a user defined cutoff will be returned. With the help of the [fingerprint](https://CRAN.R-project.org/package=fingerprint) package this is easily accomplished.

For example, we can identify all the molecules in the BP dataset that have
a Tanimoto similarity of 0.3 or more with acetalehyde, and then create a tabular
summary. Note that this could also be accomplished with
molecular descriptors, in which case you’d probably evaluate the Euclidean
distance between descriptor vectors.

```{r}
query.mol <- parse.smiles('CC(=O)')[[1]]
target.mols <- parse.smiles(bpdata[,1])
query.fp <- get.fingerprint(query.mol, type='circular')
target.fps <- lapply(target.mols, get.fingerprint, type='circular')
sims <- data.frame(sim=do.call(rbind, lapply(target.fps,
     fingerprint::distance,
     fp2=query.fp, method='tanimoto')))
subset(sims, sim >= 0.3)
```