1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/stringsim.R
\name{stringsim}
\alias{stringsim}
\alias{stringsimmatrix}
\title{Compute similarity scores between strings}
\usage{
stringsim(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
q = 1,
...
)
stringsimmatrix(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
q = 1,
...
)
}
\arguments{
\item{a}{R object (target); will be converted by \code{as.character}.}
\item{b}{R object (source); will be converted by \code{as.character}.}
\item{method}{Method for distance calculation. The default is \code{"osa"},
see \code{\link{stringdist-metrics}}.}
\item{useBytes}{Perform byte-wise comparison, see \code{\link{stringdist-encoding}}.}
\item{q}{Size of the \eqn{q}-gram; must be nonnegative. Only applies to
\code{method='qgram'}, \code{'jaccard'} or \code{'cosine'}.}
\item{...}{additional arguments are passed on to \code{\link{stringdist}} and
\code{\link{stringdistmatrix}} respectively.}
}
\value{
\code{stringsim} returns a vector with similarities, which are values between
0 and 1 where 1 corresponds to perfect similarity (distance 0) and 0 to
complete dissimilarity. \code{NA} is returned when \code{\link{stringdist}}
returns \code{NA}. Distances equal to \code{Inf} are truncated to a
similarity of 0. \code{stringsimmatrix} works the same way but, equivalent to
\code{\link{stringdistmatrix}}, returns a similarity matrix instead of a
vector.
}
\description{
\code{stringsim} computes pairwise string similarities between elements of
\code{character} vectors \code{a} and \code{b}, where the vector with less
elements is recycled.
\code{stringsimmatrix} computes the string similarity matrix with rows
according to \code{a} and columns according to \code{b}.
}
\details{
The similarity is calculated by first calculating the distance using
\code{\link{stringdist}}, dividing the distance by the maximum
possible distance, and substracting the result from 1.
This results in a score between 0 and 1, with 1
corresponding to complete similarity and 0 to complete dissimilarity.
Note that complete similarity only means equality for distances satisfying
the identity property. This is not the case e.g. for q-gram based distances
(for example if q=1, anagrams are completely similar).
For distances where weights can be specified, the maximum distance
is currently computed by assuming that all weights are equal to 1.
}
\examples{
# Calculate the similarity using the default method of optimal string alignment
stringsim("ca", "abc")
# Calculate the similarity using the Jaro-Winkler method
# The p argument is passed on to stringdist
stringsim('MARTHA','MATHRA',method='jw', p=0.1)
}
|