File: context_potential_damage_analysis.Rd

package info (click to toggle)
r-bioc-mutationalpatterns 3.16.0%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 5,360 kB
  • sloc: sh: 8; makefile: 2
file content (103 lines) | stat: -rw-r--r-- 4,323 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/context_potential_damage_analysis.R
\name{context_potential_damage_analysis}
\alias{context_potential_damage_analysis}
\title{Potential damage analysis for the supplied mutational contexts}
\usage{
context_potential_damage_analysis(
  contexts,
  txdb,
  ref_genome,
  gene_ids,
  verbose = FALSE
)
}
\arguments{
\item{contexts}{Vector of mutational contexts to use for the analysis.}

\item{txdb}{Transcription annotation database}

\item{ref_genome}{BSgenome reference genome object}

\item{gene_ids}{Entrez gene ids}

\item{verbose}{Boolean. Determines whether progress is printed. (Default: FALSE)}
}
\value{
A tibble with the ratio of 'stop gain', 'mismatch', 'synonymous' and 
'splice site' mutations per mutation context.
}
\description{
The ratio of possible 'stop gain', 'mismatches', 'synonymous mutations' and
'splice site mutations' is counted per mutational context. This is done for
the supplied ENTREZ gene ids. This way it can be determined how damaging a
mutational context could be. N gives the total number of possible mutations
per context.
}
\details{
The function works by first selecting the longest transcript per gene. The
coding sequence (cds) of this transcript is then assembled. Next, the
function loops over the reference contexts. For each context (and it's
reverse complement), all possible mutation locations are determined. Splice
site mutations are removed at this stage. It's also determined whether these
locations are the first, second or third base of the cds codon (mut loc).
Each unique combination of codon and mut loc is then counted. For each
combination the reference amino acid and the possible alternative amino acids
are determined. By comparing the reference and alternative amino acids, the
number of 'stop_gains', 'mismatches' and 'synonymous mutations' is
determined. This is then normalized per mutation context.
For example, mutations with the ACA context could be located in the third
position of a codon like TAC. This might happen 200 times in the supplied
genes. This TAC codon could then be mutated in either a TAA, TAG or a TAT.
The first two of these options would induce a stop codon, while the third one
would be synonymous. By summing up all codons the number of stop_gains',
'mismatches' and 'synonymous mutations' is determined per mutation context.

For mismatches the blosum62 score is also calculated. This is a score based
on the BLOSUM62 matrix, that describes how similar two amino acids are. This
score is normalized over the total amount of possible mismatches. A lower
score means that the amino acids in the mismatches are more dissimilar. More
dissimilar amino acids are more likely to have a detrimental effect. 

To identify splice sites, sequences around the splice locations are used
instead of the cds. The 2 bases 5' and 2 bases 3' of a splice site are
considered to be splice site mutation locations.
}
\examples{

## See the 'mut_matrix()' example for how we obtained the
## mutation matrix information:
mut_mat <- readRDS(system.file("states/mut_mat_data.rds",
  package = "MutationalPatterns"
))

contexts <- rownames(mut_mat)

## Load the corresponding reference genome.
ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)

## Load the transcription annotation database
## You can obtain the database from the UCSC hg19 dataset using
## Bioconductor:
# BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## Here we will use the Entrez Gene IDs from several cancer
## genes. In practice you might want to use larger gene lists,
## but here we only use a few to keep the runtime low.
## In this example we are using:
## TP53, KRAS, NRAS, BRAF, BRCA2
gene_ids <- c(7157, 3845, 4893, 673, 675)

## Run the function
context_potential_damage_analysis(contexts, txdb, ref_genome, gene_ids)

## The function can provide updates about its progress.
## This can be usefull when it's running slowly,
## which can happen when you are using many gene_ids.
## To reduce the example runtime, we don't re-run the analysis, but only show the command
## context_potential_damage_analysis(contexts, txdb, ref_genome, gene_ids, verbose = TRUE)

}