1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
|
---
title: "Performance Notes"
author: "Zaachary Charlop-Powers"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Performance Notes}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## rCDK Performance
In September 2022, of this year, [Stepehn Neumann](https://gist.github.com/sneumann)
[created a benchmark](https://gist.github.com/sneumann/959a6d205ea4ac73eaf1393da0ec0673) for moecular weight calculation that he [announced on twitter](https://twitter.com/sneumannoffice/status/1570070283083710465?s=20&t=RqJR3Bbh-DEcbCf2tWUEBQ) showing that rCDK had dismal performance relative to other tools in the R ecosystem. Something seemed a bit off so I looked into the code.
What I discovered is that the mass spec calculations were mediated by R classes instead of accessing the underlying Java code directly and if you write a function that does
that you get a speedup, and if you avoid reflection by creating static calls then you
get a really fast function.
This is a good example of how to think about using Java from within R where the
use of the helper rJava functions, `J` and `$` allow you to prototype code and benefit
from reflection so you can code sort of like you would in R. Then, if you need
performance, you can tighten down the code a bit by making the calls static. You can
see that progression in the code below which is accompanied by the outputs from those benchmarks.
```sh
will give (2/3) runtime in µs:
21 OrgMassSpecR
163 MetaboCoreUtils
197 enviPat
545 Rdisop
645 CHNOSZ
4863 ChemmineR
22510 rcdk
```
https://gist.github.com/sneumann/959a6d205ea4ac73eaf1393da0ec0673
## Benchmark
```{r, eval=FALSE}
# Bioconductor Packages. Use BiocManager::install()
# Rdisop MetaboCoreUtils ChemmineR ChemmineOB enviPat
library(plyr)
library(CHNOSZ)
library(enviPat)
library(MetaboCoreUtils)
library(rcdk)
library(ChemmineR)
library(OrgMassSpecR)
library(Rdisop)
#library(ChemmineOB)
data(isotopes)
# original
# https://github.com/CDK-R/cdkr/blob/master/rcdk/R/formula.R
# get.formula <- function(mf, charge=0) {
#
# manipulator <- get("mfManipulator", envir = .rcdk.GlobalEnv)
# if(!is.character(mf)) {
# stop("Must supply a Formula string");
# }else{
# dcob <- .cdkFormula.createChemObject()
# molecularformula <- .cdkFormula.createFormulaObject()
# molecularFormula <- .jcall(manipulator,
# "Lorg/openscience/cdk/interfaces/IMolecularFormula;",
# "getMolecularFormula",
# mf,
# .jcast(molecularformula,.IMolecularFormula),
# TRUE);
# }
#
# D <- new(J("java/lang/Integer"), as.integer(charge))
# .jcall(molecularFormula,"V","setCharge",D);
# object <- .cdkFormula.createObject(.jcast(molecularFormula,.IMolecularFormula));
# return(object);
# }
mfManipulator <- J("org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator")
silentchemobject <- J("org.openscience.cdk.silent.SilentChemObjectBuilder")
#' Rewrite the formual object and directly access Java
#'
get.formula2 <- function(mf) {
formula <- mfManipulator$getMolecularFormula(
"C2H3",
silentchemobject$getInstance())
mfManipulator$getMass(formula)
}
#' Add type hints
#'
get.formula3 <- function(mf) {
builderinstance <- .jcall(
silentchemobject,
"Lorg/openscience/cdk/interfaces/IChemObjectBuilder;",
"getInstance")
formula <- .jcall(
mfManipulator,
"Lorg/openscience/cdk/interfaces/IMolecularFormula;",
"getMolecularFormula",
mf,
builderinstance);
mfManipulator$getMass(formula)
}
#' Add type hints
#'
get.formula4 <- function(mf) {
builderinstance <- .jcall(
silentchemobject,
"Lorg/openscience/cdk/interfaces/IChemObjectBuilder;",
"getInstance")
formula <- .jcall(
mfManipulator,
"Lorg/openscience/cdk/interfaces/IMolecularFormula;",
"getMolecularFormula",
mf,
builderinstance);
.jcall(
mfManipulator,
"D",
"getMass",
formula)
}
benchmark <- microbenchmark::microbenchmark(
MetaboCoreUtils = MetaboCoreUtils::calculateMass("C2H6O"),
rcdk = rcdk::get.formula("C2H6O", charge = 0)@mass,
rcdk2 = get.formula2("C2H6O"),
rcdk3 = get.formula3("C2H6O"),
rcdk4 = get.formula4("C2H6O"),
Rdisop = Rdisop::getMolecule("C2H6O")$exactmass,
ChemmineR = ChemmineR::exactMassOB(ChemmineR::smiles2sdf("CCO")),
OrgMassSpecR = OrgMassSpecR::MonoisotopicMass(formula = OrgMassSpecR::ListFormula("C2H6O)"), charge = 0),
CHNOSZ = CHNOSZ::mass("C2H6O"),
enviPat = enviPat::isopattern(isotopes, "C2H6O", charge=FALSE, verbose=FALSE)[[1]][1,1]
, times=1000L)
masses <- c(
MetaboCoreUtils=MetaboCoreUtils::calculateMass("C2H6O"),
rcdk=rcdk::get.formula("C2H6O", charge = 0)@mass,
Rdisop=Rdisop::getMolecule("C2H6O")$exactmass,
#ChemmineR=ChemmineR::exactMassOB(ChemmineR::smiles2sdf("CCO")),
OrgMassSpecR=OrgMassSpecR::MonoisotopicMass(formula = OrgMassSpecR::ListFormula("C2H6O)"), charge = 0),
CHNOSZ=CHNOSZ::mass("C2H6O"),
enviPat=enviPat::isopattern(isotopes, "C2H6O", charge=FALSE, verbose=FALSE)[[1]][1,1]
)
options(digits=10)
t(t(sort(masses)))
summary(benchmark)[order(summary(benchmark)[,"median"]) , ]
clipr::write_clip(as.data.frame(summary(benchmark)[order(summary(benchmark)[,"median"]) , ] ))
```
## Results
```shell
expr min lq mean median uq
1 MetaboCoreUtils 69.479 122.8465 154.049427 139.6495 156.2700
10 enviPat 83.250 143.0935 170.429197 160.5360 179.6570
5 rcdk4 175.889 228.8605 324.182735 271.2955 327.7135
8 OrgMassSpecR 249.287 333.3135 392.479869 357.6665 401.5585
6 Rdisop 382.417 459.8790 538.068697 490.1505 557.9975
9 CHNOSZ 355.145 510.2910 588.186951 555.9165 632.2060
4 rcdk3 781.987 1004.7160 1294.507318 1133.3415 1339.4695
3 rcdk2 2078.465 2392.4950 2920.601088 2612.8025 2931.5465
7 ChemmineR 3227.320 3790.0455 4808.783873 4044.1410 4465.1000
2 rcdk 14823.815 16456.7715 19088.569430 17485.0800 19468.7195
```
|