1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/proteinToX.R
\name{proteinToGenome}
\alias{proteinToGenome}
\title{Map within-protein coordinates to genomic coordinates}
\usage{
proteinToGenome(x, db, id = "name", idType = "protein_id")
}
\arguments{
\item{x}{\code{IRanges} with the coordinates within the protein(s). The
object has also to provide some means to identify the protein (see
details).}
\item{db}{\code{EnsDb} object to be used to retrieve genomic coordinates of
encoding transcripts.}
\item{id}{\code{character(1)} specifying where the protein identifier can be
found. Has to be either \code{"name"} or one of \code{colnames(mcols(prng))}.}
\item{idType}{\code{character(1)} defining what type of IDs are provided. Has to
be one of \code{"protein_id"} (default), \code{"uniprot_id"} or \code{"tx_id"}.}
}
\value{
\code{list}, each element being the mapping results for one of the input
ranges in \code{x} and names being the IDs used for the mapping. Each
element can be either a:
\itemize{
\item \code{GRanges} object with the genomic coordinates calculated on the
protein-relative coordinates for the respective Ensembl protein (stored in
the \code{"protein_id"} metadata column.
\item \code{GRangesList} object, if the provided protein identifier in \code{x} was
mapped to several Ensembl protein IDs (e.g. if Uniprot identifiers were
used). Each element in this \code{GRangesList} is a \code{GRanges} with the genomic
coordinates calculated for the protein-relative coordinates from the
respective Ensembl protein ID.
}
The following metadata columns are available in each \code{GRanges} in the result:
\itemize{
\item \code{"protein_id"}: the ID of the Ensembl protein for which the within-protein
coordinates were mapped to the genome.
\item \code{"tx_id"}: the Ensembl transcript ID of the encoding transcript.
\item \code{"exon_id"}: ID of the exons that have overlapping genomic coordinates.
\item \code{"exon_rank"}: the rank/index of the exon within the encoding transcript.
\item \code{"cds_ok"}: contains \code{TRUE} if the length of the CDS matches the length
of the amino acid sequence and \code{FALSE} otherwise.
\item \code{"protein_start"}: the within-protein sequence start coordinate of the
mapping.
\item \code{"protein_end"}: the within-protein sequence end coordinate of the mapping.
}
Genomic coordinates are returned ordered by the exon index within the
transcript.
}
\description{
\code{proteinToGenome} maps protein-relative coordinates to genomic coordinates
based on the genomic coordinates of the CDS of the encoding transcript. The
encoding transcript is identified using protein-to-transcript annotations
(and eventually Uniprot to Ensembl protein identifier mappings) from the
submitted \code{EnsDb} object (and thus based on annotations from Ensembl).
Not all coding regions for protein coding transcripts are complete, and the
function thus checks also if the length of the coding region matches the
length of the protein sequence and throws a warning if that is not the case.
The genomic coordinates for the within-protein coordinates, the Ensembl
protein ID, the ID of the encoding transcript and the within protein start
and end coordinates are reported for each input range.
}
\details{
Protein identifiers (supported are Ensembl protein IDs or Uniprot IDs) can
be passed to the function as \code{names} of the \code{x} \code{IRanges} object, or
alternatively in any one of the metadata columns (\code{mcols}) of \code{x}.
}
\note{
While the mapping for Ensembl protein IDs to encoding transcripts (and
thus CDS) is 1:1, the mapping between Uniprot identifiers and encoding
transcripts (which is based on Ensembl annotations) can be one to many. In
such cases \code{proteinToGenome} calculates genomic coordinates for
within-protein coordinates for all of the annotated Ensembl proteins and
returns all of them. See below for examples.
Mapping using Uniprot identifiers needs also additional internal checks that
have a significant impact on the performance of the function. It is thus
strongly suggested to first identify the Ensembl protein identifiers for the
list of input Uniprot identifiers (e.g. using the \code{\link[=proteins]{proteins()}} function and
use these as input for the mapping function.
A warning is thrown for proteins which sequence does not match the coding
sequence length of any encoding transcripts. For such proteins/transcripts
a \code{FALSE} is reported in the respective \code{"cds_ok"} metadata column.
The most common reason for such discrepancies are incomplete 3' or 5' ends
of the CDS. The positions within the protein might not be correclty
mapped to the genome in such cases and it might be required to check
the mapping manually in the Ensembl genome browser.
}
\examples{
library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")
## Define an IRange with protein-relative coordinates within a protein for
## the gene SYP
syp <- IRanges(start = 4, end = 17)
names(syp) <- "ENSP00000418169"
res <- proteinToGenome(syp, edbx)
res
## Positions 4 to 17 within the protein span two exons of the encoding
## transcript.
## Perform the mapping for multiple proteins identified by their Uniprot
## IDs.
ids <- c("O15266", "Q9HBJ8", "unexistant")
prngs <- IRanges(start = c(13, 43, 100), end = c(21, 80, 100))
names(prngs) <- ids
res <- proteinToGenome(prngs, edbx, idType = "uniprot_id")
## The result is a list, same length as the input object
length(res)
names(res)
## No protein/encoding transcript could be found for the last one
res[[3]]
## The first protein could be mapped to multiple Ensembl proteins. The
## mapping result using all of their encoding transcripts are returned
res[[1]]
## The coordinates within the second protein span two exons
res[[2]]
}
\seealso{
Other coordinate mapping functions:
\code{\link{cdsToTranscript}()},
\code{\link{genomeToProtein}()},
\code{\link{genomeToTranscript}()},
\code{\link{proteinToTranscript}()},
\code{\link{transcriptToCds}()},
\code{\link{transcriptToGenome}()},
\code{\link{transcriptToProtein}()}
}
\author{
Johannes Rainer based on initial code from Laurent Gatto and
Sebastian Gibb
}
\concept{coordinate mapping functions}
|