1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/determine_regional_similarity.R
\name{determine_regional_similarity}
\alias{determine_regional_similarity}
\title{Determine regional mutation pattern similarity}
\usage{
determine_regional_similarity(
vcf,
ref_genome,
chromosomes,
window_size = 100,
stepsize = 25,
extension = 1,
oligo_correction = FALSE,
exclude_self_mut_mat = TRUE,
max_window_size_gen = 2e+07,
verbose = FALSE
)
}
\arguments{
\item{vcf}{GRanges object}
\item{ref_genome}{BSgenome reference genome object}
\item{chromosomes}{Vector of chromosome/contig names of the reference genome
to be plotted.}
\item{window_size}{The number of mutations in a window. (Default: 100)}
\item{stepsize}{The number of mutations that a window slides in each step.
(Default: 25)}
\item{extension}{The number of bases, that's extracted upstream and
downstream of the base substitutions, to create the mutation matrices.
(Default: 1).}
\item{oligo_correction}{Boolean describing whether oligonucleotide frequency
correction should be applied. (Default: FALSE)}
\item{exclude_self_mut_mat}{Boolean describing whether the mutations in a
window should be subtracted from the global mutation matrix. (Default:
TRUE)}
\item{max_window_size_gen}{The maximum size of a window before it is removed.
(Default: 20,000,000)}
\item{verbose}{Boolean determining the verbosity of the function. (Default: FALSE)}
}
\value{
A "region_cossim" object containing both the cosine similarities and
the settings used in this analysis.
}
\description{
Calculate the cosine similarities between the global mutation profile and the
mutation profile of smaller genomic windows, using a sliding window approach.
Regions with a very different mutation profile can be identified in this way.
This function generally requires many mutations (~100,000) to work properly.
}
\details{
First a global mutation matrix is calculated using all mutations.
Next, a sliding window is used. This means that we create a window
containing the first x mutations. The cosine similarity, between the
mutation profiles of this window and the global mutation matrix, is then
calculated. The window then slides y mutations to the right and the cosine
similarity is again calculated. This process is repeated until the final
mutation on a chromosome is reached. This process is performed separately
per chromosome. Windows that span a too large region of the genome are
removed, because they are unlikely to contain biologically relevant
information.
The number of mutations that the window slides to the right in each step is
called the stepsize. The best stepsize depends on the window size. In
general, we recommend setting the stepsize between 25% and 100% of the
window size.
The analysis can be performed for trinucleotides contexts, for a larger
context, or for just the base substitutions. A smaller context might miss
detailed differences in mutation profiles, but is also less noisy. We
recommend using a smaller extension when analyzing small datasets.
It's possible to correct for the oligonucleotide frequency of the windows.
This is done by calculating the cosine similarity of the oligonucleotide
frequency between each window and the genome. The cosine similarity of the
mutation profiles is then divided by the oligonucleotide similarity. This
ensures that regions with an abnormal oligonucleotide frequency don't show
up as having a very different profile. The oligonucleotide frequency
correction slows down the function, so we advise the user to keep it off
for exploratory analyses and to only turn it on to validate interesting
results.
By default the mutations in a window are subtracted from the global
mutation matrix, before calculating the cosine similarity. This increases
sensitivity, but could also decrease specificity. This subtraction can be
turned of with the 'exclude_self_mut_mat' argument.
}
\examples{
## See the 'read_vcfs_as_granges()' example for how we obtained the
## following data:
grl <- readRDS(system.file("states/read_vcfs_as_granges_output.rds",
package = "MutationalPatterns"
))
## We pool all the variants together, because the function doesn't work well
## with a limited number of mutations. Still, in practice we recommend to use
## more mutations that in this example.
gr = unlist(grl)
## Specifiy the chromosomes of interest.
chromosomes <- names(genome(gr)[1:3])
## Load the corresponding reference genome.
ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)
## Determine the regional similarities. Here we use a small window size to make the function work.
## In practice, we recommend a larger window size.
regional_sims = determine_regional_similarity(gr,
ref_genome,
chromosomes,
window_size = 40,
stepsize = 10,
max_window_size_gen = 40000000
)
## Here we use an extensiof of 0 to reduce noise.
## We also turned verbosity on, so you can see at what step the function is.
## This can be useful on large datasets.
regional_sims_0_extension = determine_regional_similarity(gr,
ref_genome,
chromosomes,
window_size = 40,
stepsize = 10,
extension = 0,
max_window_size_gen = 40000000,
verbose = TRUE
)
}
\seealso{
\code{\link{plot_regional_similarity}}
Other regional_similarity:
\code{\link{plot_regional_similarity}()}
}
\concept{regional_similarity}
|