File: afind.Rd

package info (click to toggle)
r-cran-stringdist 0.9.15-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 1,424 kB
sloc: ansic: 1,690; sh: 13; makefile: 2
file content (143 lines) | stat: -rw-r--r-- 5,201 bytes
parent folder | download | duplicates (3)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/afind.R
\name{afind}
\alias{afind}
\alias{grab}
\alias{grabl}
\alias{extract}
\title{Stringdist-based fuzzy text search}
\usage{
afind(
  x,
  pattern,
  window = NULL,
  value = TRUE,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine",
    "jaccard", "jw", "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)

grab(x, pattern, maxDist = Inf, value = FALSE, ...)

grabl(x, pattern, maxDist = Inf, ...)

extract(x, pattern, maxDist = Inf, ...)
}
\arguments{
\item{x}{strings to search in}

\item{pattern}{strings to find (not a regular expression). For \code{grab},
\code{grabl}, and \code{extract} this must be a single string.}

\item{window}{width of moving window.}

\item{value}{toggle return matrix with matched strings.}

\item{method}{Matching algorithm to use. See \code{\link{stringdist-metrics}}.}

\item{useBytes}{Perform byte-wise comparison. See \code{\link{stringdist-encoding}}.}

\item{weight}{For \code{method='osa'} or \code{'dl'}, the penalty for
deletion, insertion, substitution and transposition, in that order. When
\code{method='lv'}, the penalty for transposition is ignored. When
\code{method='jw'}, the weights associated with characters of \code{a},
characters from \code{b} and the transposition weight, in that order. 
Weights must be positive and not exceed 1. \code{weight} is ignored
completely when \code{method='hamming'}, \code{'qgram'}, \code{'cosine'},
\code{'Jaccard'}, \code{'lcs'}, or \code{'soundex'}.}

\item{q}{q-gram size, only when method is \code{'qgram'}, \code{'jaccard'},
or \code{'cosine'}.}

\item{p}{Winklers 'prefix' parameter for Jaro-Winkler distance, with
\eqn{0\leq p\leq0.25}. Only when method is \code{'jw'}}

\item{bt}{Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than \code{bt}.
Applies only to \code{method='jw'} and \code{p>0}.}

\item{nthread}{Number of threads used by the underlying C-code. A sensible
default is chosen, see \code{\link{stringdist-parallelization}}.}

\item{maxDist}{Only windows with distance \code{<= maxDist} are considered a match.}

\item{...}{passed to \code{afind}.}
}
\value{
For \code{afind}: a \code{list} of three matrices, each with
\code{length(x)} rows and \code{length(pattern)} columns. In each matrix,
element \eqn{(i,j)} corresponds to \code{x[i]} and \code{pattern[j]}. The 
names and description of each matrix is as follows.
\itemize{
\item{\code{location}. \code{[integer]}, location of the start of best matching window.
      When \code{useBytes=FALSE}, this corresponds to the location of a \code{UTF} code point
      in \code{x}, possibly after conversion from its original encoding.}
\item{\code{distance}. \code{[character]}, the string distance between pattern and
      the best matching window.}
\item{\code{match}. \code{[character]}, the first, best matching window.}

}

For \code{grab}, an \code{integer} vector, indicating in which elements of
\code{x} a match was found with a distance \code{<= maxDist}. The matched
values when \code{value=TRUE} (equivalent to \code{\link[base]{grep}}).

For \code{grabl}, a \code{logical} vector, indicating in which elements of
\code{x} a match was found with a distance \code{<= maxDist}.  (equivalent
to \code{\link[base:grep]{grepl}}).

For \code{extract}, a \code{character} matrix with \code{length(x)} rows and
\code{length(pattern)} columns.  If match was found, element \eqn{(i,j)}
contains the match, otherwise it is set to \code{NA}.
}
\description{
\code{afind} slides a window of fixed width over a string \code{x} and
computes the distance between the each window and the sought-after
\code{pattern}. The location, content, and distance corresponding to the
window with the best match is returned.
}
\details{
Matching is case-sensitive.  Both \code{x} and \code{pattern} are converted
to \code{UTF-8} prior to search, unless \code{useBytes=TRUE}, in which case
the distances are measured bytewise.

Code is parallelized over the \code{x} variable: each value of \code{x}
is scanned for every element in \code{pattern} using a separate thread (when \code{nthread}
is larger than 1).

The functions \code{grab} and \code{grabl} are approximate string matching
functions that somewhat resemble base R's \code{\link[base]{grep}} and
\code{\link[base:grep]{grepl}}. They are implemented as convenience wrappers
of \code{afind}.
}
\section{Running cosine distance}{

This algorithm gains efficiency by using that two consecutive windows have
a large overlap in their q-gram profiles. It gives the same result as
the \code{"cosine"} distance, but much faster.
}

\examples{
texts = c("When I grow up, I want to be"
       , "one of the harvesters of the sea"
       , "I think before my days are gone"
       , "I want to be a fisherman")
patterns = c("fish", "gone","to be")

afind(texts, patterns, method="running_cosine", q=3)

grabl(texts,"grew", maxDist=1)
extract(texts, "harvested", maxDist=3)


}
\seealso{
Other matching: 
\code{\link{amatch}()}
}
\concept{matching}