File: determine_regional_similarity.Rd

package info (click to toggle)
r-bioc-mutationalpatterns 3.16.0%2Bdfsg-2
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 5,360 kB
sloc: sh: 8; makefile: 2
file content (145 lines) | stat: -rw-r--r-- 5,456 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/determine_regional_similarity.R
\name{determine_regional_similarity}
\alias{determine_regional_similarity}
\title{Determine regional mutation pattern similarity}
\usage{
determine_regional_similarity(
  vcf,
  ref_genome,
  chromosomes,
  window_size = 100,
  stepsize = 25,
  extension = 1,
  oligo_correction = FALSE,
  exclude_self_mut_mat = TRUE,
  max_window_size_gen = 2e+07,
  verbose = FALSE
)
}
\arguments{
\item{vcf}{GRanges object}

\item{ref_genome}{BSgenome reference genome object}

\item{chromosomes}{Vector of chromosome/contig names of the reference genome
to be plotted.}

\item{window_size}{The number of mutations in a window. (Default: 100)}

\item{stepsize}{The number of mutations that a window slides in each step.
(Default: 25)}

\item{extension}{The number of bases, that's extracted upstream and
downstream of the base substitutions, to create the mutation matrices.
(Default: 1).}

\item{oligo_correction}{Boolean describing whether oligonucleotide frequency
correction should be applied. (Default: FALSE)}

\item{exclude_self_mut_mat}{Boolean describing whether the mutations in a
window should be subtracted from the global mutation matrix. (Default:
TRUE)}

\item{max_window_size_gen}{The maximum size of a window before it is removed.
(Default: 20,000,000)}

\item{verbose}{Boolean determining the verbosity of the function. (Default: FALSE)}
}
\value{
A "region_cossim" object containing both the cosine similarities and
  the settings used in this analysis.
}
\description{
Calculate the cosine similarities between the global mutation profile and the
mutation profile of smaller genomic windows, using a sliding window approach.
Regions with a very different mutation profile can be identified in this way.
This function generally requires many mutations (~100,000) to work properly.
}
\details{
First a global mutation matrix is calculated using all mutations.
  Next, a sliding window is used. This means that we create a window
  containing the first x mutations. The cosine similarity, between the
  mutation profiles of this window and the global mutation matrix, is then
  calculated. The window then slides y mutations to the right and the cosine
  similarity is again calculated. This process is repeated until the final
  mutation on a chromosome is reached. This process is performed separately
  per chromosome. Windows that span a too large region of the genome are
  removed, because they are unlikely to contain biologically relevant
  information.
  
  The number of mutations that the window slides to the right in each step is
  called the stepsize. The best stepsize depends on the window size. In
  general, we recommend setting the stepsize between 25% and 100% of the
  window size.

  The analysis can be performed for trinucleotides contexts, for a larger
  context, or for just the base substitutions. A smaller context might miss
  detailed differences in mutation profiles, but is also less noisy. We
  recommend using a smaller extension when analyzing small datasets.

  It's possible to correct for the oligonucleotide frequency of the windows.
  This is done by calculating the cosine similarity of the oligonucleotide
  frequency between each window and the genome. The cosine similarity of the
  mutation profiles is then divided by the oligonucleotide similarity. This
  ensures that regions with an abnormal oligonucleotide frequency don't show
  up as having a very different profile. The oligonucleotide frequency
  correction slows down the function, so we advise the user to keep it off
  for exploratory analyses and to only turn it on to validate interesting
  results.

  By default the mutations in a window are subtracted from the global
  mutation matrix, before calculating the cosine similarity. This increases
  sensitivity, but could also decrease specificity. This subtraction can be
  turned of with the 'exclude_self_mut_mat' argument.
}
\examples{

## See the 'read_vcfs_as_granges()' example for how we obtained the
## following data:
grl <- readRDS(system.file("states/read_vcfs_as_granges_output.rds",
  package = "MutationalPatterns"
))

## We pool all the variants together, because the function doesn't work well
## with a limited number of mutations. Still, in practice we recommend to use
## more mutations that in this example.
gr = unlist(grl)

## Specifiy the chromosomes of interest.
chromosomes <- names(genome(gr)[1:3])

 ## Load the corresponding reference genome.
ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)

## Determine the regional similarities. Here we use a small window size to make the function work.
## In practice, we recommend a larger window size.
regional_sims = determine_regional_similarity(gr,
  ref_genome, 
  chromosomes,
  window_size = 40,
  stepsize = 10,
  max_window_size_gen = 40000000
)

## Here we use an extensiof of 0 to reduce noise.
## We also turned verbosity on, so you can see at what step the function is.
## This can be useful on large datasets.
regional_sims_0_extension = determine_regional_similarity(gr,
  ref_genome, 
  chromosomes,
  window_size = 40,
  stepsize = 10,
  extension = 0,
  max_window_size_gen = 40000000,
  verbose = TRUE
)
}
\seealso{
\code{\link{plot_regional_similarity}}

Other regional_similarity: 
\code{\link{plot_regional_similarity}()}
}
\concept{regional_similarity}