File: stringsim.Rd

package info (click to toggle)
r-cran-stringdist 0.9.15-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,424 kB
  • sloc: ansic: 1,690; sh: 13; makefile: 2
file content (82 lines) | stat: -rw-r--r-- 2,864 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/stringsim.R
\name{stringsim}
\alias{stringsim}
\alias{stringsimmatrix}
\title{Compute similarity scores between strings}
\usage{
stringsim(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  q = 1,
  ...
)

stringsimmatrix(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  q = 1,
  ...
)
}
\arguments{
\item{a}{R object (target); will be converted by \code{as.character}.}

\item{b}{R object (source); will be converted by \code{as.character}.}

\item{method}{Method for distance calculation. The default is \code{"osa"}, 
see \code{\link{stringdist-metrics}}.}

\item{useBytes}{Perform byte-wise comparison, see \code{\link{stringdist-encoding}}.}

\item{q}{Size of the \eqn{q}-gram; must be nonnegative. Only applies to
\code{method='qgram'}, \code{'jaccard'} or \code{'cosine'}.}

\item{...}{additional arguments are passed on to \code{\link{stringdist}} and
\code{\link{stringdistmatrix}} respectively.}
}
\value{
\code{stringsim} returns a vector with similarities, which are values between
0 and 1 where 1 corresponds to perfect similarity (distance 0) and 0 to
complete dissimilarity. \code{NA} is returned when \code{\link{stringdist}}
returns \code{NA}. Distances equal to \code{Inf} are truncated to a
similarity of 0. \code{stringsimmatrix} works the same way but, equivalent to
\code{\link{stringdistmatrix}}, returns a similarity matrix instead of a
vector.
}
\description{
\code{stringsim} computes pairwise string similarities between elements of
\code{character} vectors \code{a} and \code{b}, where the vector with less
elements is recycled. 
\code{stringsimmatrix} computes the string similarity matrix with rows
according to \code{a} and columns according to \code{b}.
}
\details{
The similarity is calculated by first calculating the distance using
\code{\link{stringdist}}, dividing the distance by the maximum
possible distance, and substracting the result from 1. 
This results in a score between 0 and 1, with 1
corresponding to complete similarity and 0 to complete dissimilarity.
Note that complete similarity only means equality for distances satisfying
the identity property. This is not the case e.g. for q-gram based distances
(for example if q=1, anagrams are completely similar).
For distances where weights can be specified, the maximum distance 
is currently computed by assuming that all weights are equal to 1.
}
\examples{


# Calculate the similarity using the default method of optimal string alignment
stringsim("ca", "abc")

# Calculate the similarity using the Jaro-Winkler method
# The p argument is passed on to stringdist
stringsim('MARTHA','MATHRA',method='jw', p=0.1)

}