File: cgdsr.Rnw

package info (click to toggle)
r-cran-cgdsr 1.3.0-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye
  • size: 400 kB
  • sloc: sh: 13; makefile: 6
file content (290 lines) | stat: -rwxr-xr-x 11,226 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
\documentclass[a4paper]{article}

%\VignetteIndexEntry{Introduction to the CGDS R library}
%\VignettePackage{cgdsr}

% Definitions

\usepackage{url}

\title{The CGDS-R library}
\author{Anders Jacobsen and Augustin Luna}

\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\tableofcontents

\section{Introduction}

This package provides a basic set of R functions for querying
the Cancer Genomic Data Server (CGDS) hosted by the Computational
Biology Center (cBio) at the Memorial Sloan-Kettering Cancer Center (MSKCC). This
service is a part of the cBio Cancer Genomics Portal,
\url{http://www.cbioportal.org/}.

In summary, the library can issue the following types of queries:

\begin{itemize}
\item{
\texttt{getCancerStudies()} : What cancer studies are hosted on the server?
For example, TCGA glioblastoma or TCGA ovarian cancer.
}
\item{
\texttt{getGeneticProfiles()} : What genetic profile types are available for
cancer study X? For example, mRNA expression or copy number alterations.
}
\item{
\texttt{getCaseLists()} : what case sets are available for cancer study X? For
example, all samples or only samples corresponding to a given cancer subtype.
}
\item{
\texttt{getProfileData()}: Retrieve slices of genomic data.  For
example, a client can retrieve all mutation data for PTEN and EGFR in
TCGA glioblastoma.
}
\item{
\texttt{getClinicalData()}: Retrieve clinical data (e.g. patient
survival time and age) for a given cancer study and list of cases.
}
\end{itemize}

Each of these functions will be briefly described in the following
sections. The last part of this document includes some concrete examples
of how to access and plot the data.

The purpose of this document is to give the reader a quick overview of
the \texttt{cgdsr} package. Please refer to the corresponding R manual
pages for a more detailed explanation of arguments and output for each
function.

\section{The CGDS R interface}

\subsection{\texttt{CGDS()} : Create a CGDS connection object}
Initially, we will establish a connection to the public CGDS
server hosted by Memorial Sloan-Kettering Cancer Center. The function
for creating a CGDS connection object requires the URL of the CGDS
server service, in this case \url{http://www.cbioportal.org/}, as an argument.

<<>>=
library(cgdsr)
# Create CGDS object
mycgds = CGDS("http://www.cbioportal.org/")
@

The variable \texttt{mycgds} is now a CGDS connection object
pointing at the URL for the public CGDS server. This connection object must
be included as an argument to all subsequent interface
calls. Optionally, we can now perform a set of simple tests of the data
returned from the CGDS connection object using the \texttt{test} function:

<<>>=
# Test the CGDS endpoint URL using a few simple API tests
test(mycgds)
@

Note that the tests may not work if you are connecting to a portal other
than the one in the above example. The tests can fail if the portal
instance does not contain the data that is being tested against, or
if you do have have authorization to access the data that is being tested against.

A verbose option can be set for the CGDS connection object. This will cause
function calls that retrieve data from cBioPortal to additionally display the
programming interface URL to be displayed. This is useful for debugging and
troubleshooting issues with the package.

<<>>=
# Set verbose flag
setVerbose(mycgds, TRUE)
@

[Optional] A data access token can be optionally attached to a CGDS connection object
when it is created. This allows you to connect to cBioPortal instances that require
authentication. Data access tokens (when this feature is enabled) can be created
through the cbioportal website. If you attempt to access data that you are not
authorized to access you will get an \texttt{Unauthorized (HTTP 401)} error.
Note the public portal at http://www.cbioportal.org/ does not require authentication
so you do not need a token to connect to it.

<<>>=
# Connect to a portal instance that requires authetication
mysecurecgds = CGDS("https://cbioportal.mskcc.org/",
                    token="fd0522cb-7972-40d0-9d83-cb4c14e8a337")
@

\subsection{\texttt{getCancerStudies()} : Retrieve a set of available cancer studies}

Having created a CGDS connection object, we can now retrieve a data
frame with available cancer studies using the \texttt{getCancerStudies} function:

<<>>=
# Get list of cancer studies at server
getCancerStudies(mycgds)[,c(1,2)]
@

Here we are only showing the first two columns, the cancer study ID and
short name, of the result data frame. There is also a third column,
a longer description of the cancer study. The cancer study ID must be
used in subsequent interface calls to retrieve case lists and genetic
data profiles (see below).

\subsection{\texttt{getGeneticProfiles()} : Retrieve genetic data profiles for a specific cancer study}
This function queries the CGDS API and returns the available genetic
profiles, e.g. mutation or copy number profiles, stored about a
specific cancer study. Below we list the current genetic profiles for
the TCGA glioblastoma cancer study:

<<>>=
getGeneticProfiles(mycgds,'gbm_tcga')[,c(1:2)]
@

Here we are only listing the first two columns, genetic profile ID and
short name, of the resulting data frame. Please refer to the R manual
pages for a more extended specification of the arguments and output.


\subsection{\texttt{getCaseLists()} : Retrieve case lists for a specific cancer study}
This function queries the CGDS API and returns available case lists
for a specific cancer study. For example, within a particular study, only
some cases may have sequence data, and another subset of cases may
have been sequenced and treated with a specific therapeutic protocol.  Multiple
case lists may be associated with each cancer study, and this method
enables you to retrieve meta-data regarding all of these case
lists. Below we list the current case lists for the TCGA glioblastoma
cancer study:

<<>>=
getCaseLists(mycgds,'gbm_tcga')[,c(1:2)]
@

Here we are only listing the first two columns, case list ID and
short name, of the resulting data frame. Please refer to the R manual
pages for a more extended specification of the arguments and output.

\subsection{\texttt{getProfileData()} : Retrieve genomic profile data for genes and genetic profiles}
The function queries the CGDS API and returns data based on gene(s),
genetic profile(s), and a case list. The function only allows
specifying a list of genes and a single genetic profile, or oppositely
a single gene and a list of genetic profiles. Importantly, the format of the output
data frame depends on if a single or a list of genes was specified in
the arguments. Below we are retrieving mRNA expression and copy number
alteration genetic profiles for the NF1 gene in all samples of the TCGA glioblastoma
cancer study:

<<>>=
getProfileData(mycgds, "NF1", c("gbm_tcga_gistic","gbm_tcga_mrna"), "gbm_tcga_all")[c(1:5),]
@

We are here only showing the first five rows of the data frame. Entries with NaN indicate missing values.  In the next example, we are retrieving mRNA expression data for the MDM2 and MDM4 genes:

<<>>=
getProfileData(mycgds, c("MDM2","MDM4"), "gbm_tcga_mrna", "gbm_tcga_all")[c(25:30),]
@

We are again only showing the first five rows of the data frame.

\subsection{\texttt{getClinicalData()} : Retrieve clinical data for a list of cases}
The function queries the CGDS API and returns available clinical data (e.g. patient
survival time and age) for a given case list. Results are returned in
a data frame with a row for each case and a column for each clinical
attribute. The available clinical attributes are:

\begin{itemize}
\item{
\texttt{overall\_survival\_months}: Overall survival, in months.
}
\item{
\texttt{overall\_survival\_status}: Overall survival status, usually
indicated as "LIVING" or "DECEASED".
}
\item{
\texttt{disease\_free\_survival\_months}: Disease free survival, in months.
}
\item{
\texttt{disease\_free\_survival\_status}: Disease free survival status, usually indicated as "DiseaseFree" or "Recurred/Progressed".
}
\item{
\texttt{age\_at\_diagnosis}: Age at diagnosis.
}
\end{itemize}

Below we retrieve clinical data for the TCGA ovarian cancer dataset (only first five
cases/rows are shown):

<<>>=
getClinicalData(mycgds, "ova_all")[c(1:5),]
@

\section{Examples}

\subsection{Example 1: Association of NF1 copy number alteration and mRNA expression in glioblastoma}
As a simple example, we will generate a plot of the association between
copy number alteration (CNA) status and mRNA expression change for the
NF1 tumor suprpressor gene in glioblastoma. This plot is very similar
to Figure 2b in the TCGA research network paper on glioblastoma
(McLendon et al. 2008). The mRNA expression of NF1 has been
median adjusted on the gene level (by globally subtracting the median expression
level of NF1 across all samples).

\begin{center}
<<NF1plot1,fig=TRUE,echo=TRUE>>=
df = getProfileData(mycgds, "NF1", c("gbm_tcga_gistic","gbm_tcga_mrna"), "gbm_tcga_all")
head(df)
boxplot(df[,2] ~ df[,1], main="NF1 : CNA status vs mRNA expression", xlab="CNA status", ylab="mRNA expression", outpch = NA)
stripchart(df[,2] ~ df[,1], vertical=T, add=T, method="jitter",pch=1,col='red')
@
\end{center}

Alternatively, the generic \texttt{cgdsr} \texttt{plot()}
function can be used to generate a similar plot:

\begin{center}
<<NF1plot2,fig=TRUE,echo=TRUE>>=
plot(mycgds, "gbm_tcga", "NF1", c("gbm_tcga_gistic","gbm_tcga_mrna"), "gbm_tcga_all", skin = 'disc_cont')
@
\end{center}

\subsection{Example 2: MDM2 and MDM4 mRNA expression levels in glioblastoma}
In this example, we evaluate the relationship of MDM2 and MDM4
expression levels in glioblastoma. mRNA expression levels of MDM2 and MDM4 have been
median adjusted on the gene level (by globally subtracting the median expression
level of the individual gene across all samples). Samples with "NaN" do not have
measurements.

\begin{center}
<<MDM2plot1,fig=TRUE,echo=TRUE>>=
df = getProfileData(mycgds, c("MDM2","MDM4"), "gbm_tcga_mrna", "gbm_tcga_all")
head(df)
plot(df, main="MDM2 and MDM4 mRNA expression", xlab="MDM2 mRNA expression", ylab="MDM4 mRNA expression")
@
\end{center}

Alternatively, the generic \texttt{cgdsr} \texttt{plot()}
function can be used to generate a similar plot:

\begin{center}
<<MDMplot2,fig=TRUE,echo=TRUE>>=
plot(mycgds, "gbm_tcga", c("MDM2","MDM4"), "gbm_tcga_mrna" ,"gbm_tcga_all")
@
\end{center}


\subsection{Example 3: Comparing expression of PTEN in primary and metastatic
  prostate cancer tumors}
In this example we plot the mRNA expression levels of PTEN in primary
and metastatic prostate cancer tumors.

\begin{center}
<<PTENplot,fig=TRUE,echo=TRUE>>=
df.pri = getProfileData(mycgds, "PTEN", "prad_mskcc_mrna_median_Zscores", "prad_mskcc_primary")
head(df.pri)
df.met = getProfileData(mycgds, "PTEN", "prad_mskcc_mrna_median_Zscores", "prad_mskcc_mets")
head(df.met)
boxplot(list(t(df.pri),t(df.met)), main="PTEN expression in primary and metastatic tumors", xlab="Tumor type", ylab="PTEN mRNA expression",names=c('primary','metastatic'), outpch = NA)
stripchart(list(t(df.pri),t(df.met)), vertical=T, add=T, method="jitter",pch=1,col='red')
@
\end{center}

\end{document}