File: cosine.similarity.Rd

package info (click to toggle)
r-cran-tcr 2.3.2%2Bds-1
links: PTS, VCS
area: main
in suites: bookworm, bullseye, trixie
size: 2,316 kB
sloc: cpp: 187; makefile: 5
file content (94 lines) | stat: -rw-r--r-- 4,267 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/measures.R
\name{cosine.similarity}
\alias{cosine.similarity}
\alias{similarity}
\alias{tversky.index}
\alias{overlap.coef}
\alias{morisitas.index}
\alias{jaccard.index}
\alias{horn.index}
\title{Set and vector similarity measures.}
\usage{
cosine.similarity(.alpha, .beta, .do.norm = NA, .laplace = 0)

tversky.index(x, y, .a = 0.5, .b = 0.5)

overlap.coef(.alpha, .beta)

jaccard.index(.alpha, .beta, .intersection.number = NA)

morisitas.index(.alpha, .beta, .do.unique = T)

horn.index(.alpha, .beta, .do.unique = T)
}
\arguments{
\item{.alpha, .beta, x, y}{Vector of numeric values for cosine similarity, vector of any values
(like characters) for \code{tversky.index} and \code{overlap.coef}, matrix or data.frame with 2 columns for \code{morisitas.index} and \code{horn.index},
either two sets or two numbers of elements in sets for \code{jaccard.index}.}

\item{.do.norm}{One of the three values - NA, T or F. If NA than check for distrubution (sum(.data) == 1)
and normalise if needed with the given laplace correction value. if T then do normalisation and laplace
correction. If F than don't do normalisaton and laplace correction.}

\item{.laplace}{Value for Laplace correction.}

\item{.a, .b}{Alpha and beta parameters for Tversky Index. Default values gives the Jaccard index measure.}

\item{.do.unique}{if T then call unique on the first columns of the given data.frame or matrix.}

\item{.intersection.number}{Number of intersected elements between two sets. See "Details" for more information.}
}
\value{
Value of similarity between the given sets or vectors.
}
\description{
Functions for computing similarity between two vectors or sets. See "Details" for exact formulas.

- Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.

- Tversky index is an asymmetric similarity measure on sets that compares a variant to a prototype.

- Overlap cofficient is a similarity measure related to the Jaccard index that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets.

- Jaccard index is a statistic used for comparing the similarity and diversity of sample sets.

- Morisita's overlap index is a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples (Morisita 1959). This formula is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats (i.e. different faunas).

- Horn's overlap index based on Shannon's entropy.

Use the \link{repOverlap} function for computing similarities of clonesets.
}
\details{
For \code{morisitas.index} input data are matrices or data.frames with two columns: first column is
elements (species or individuals), second is a number of elements (species or individuals) in a population.

Formulas:

Cosine similarity: \code{cos(a, b) = a * b / (||a|| * ||b||)}

Tversky index: \code{S(X, Y) = |X and Y| / (|X and Y| + a*|X - Y| + b*|Y - X|)}

Overlap coefficient: \code{overlap(X, Y) = |X and Y| / min(|X|, |Y|)}

Jaccard index: \code{J(A, B) = |A and B| / |A U B|}
For Jaccard index user can provide |A and B| in \code{.intersection.number} otherwise it will be computed
using \code{base::intersect} function. In this case \code{.alpha} and \code{.beta} expected to be vectors of elements.
If \code{.intersection.number} is provided than \code{.alpha} and \code{.beta} are exptected to be numbers of elements.

Formula for Morisita's overlap index is quite complicated and can't be easily shown here, so just look at this webpage: http://en.wikipedia.org/wiki/Morisita%27s_overlap_index
}
\examples{
\dontrun{
jaccard.index(1:10, 2:20)
a <- length(unique(immdata[[1]][, c('CDR3.amino.acid.sequence', 'V.gene')]))
b <- length(unique(immdata[[2]][, c('CDR3.amino.acid.sequence', 'V.gene')]))
# Next
jaccard.index(a, b, repOverlap(immdata[1:2], .seq = 'aa', .vgene = T))
# is equal to
repOverlap(immdata[1:2], 'jaccard', seq = 'aa', .vgene = T)
}
}
\seealso{
\link{repOverlap}, \link{intersectClonesets}, \link{entropy}, \link{diversity}
}