1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/clustering.R
\name{clustering_cv}
\alias{clustering_cv}
\title{Cluster Cross-Validation}
\usage{
clustering_cv(
data,
vars,
v = 10,
repeats = 1,
distance_function = "dist",
cluster_function = c("kmeans", "hclust"),
...
)
}
\arguments{
\item{data}{A data frame.}
\item{vars}{A vector of bare variable names to use to cluster the data.}
\item{v}{The number of partitions of the data set.}
\item{repeats}{The number of times to repeat the clustered partitioning.}
\item{distance_function}{Which function should be used for distance calculations?
Defaults to \code{\link[stats:dist]{stats::dist()}}. You can also provide your own
function; see \code{Details}.}
\item{cluster_function}{Which function should be used for clustering?
Options are either \code{"kmeans"} (to use \code{\link[stats:kmeans]{stats::kmeans()}})
or \code{"hclust"} (to use \code{\link[stats:hclust]{stats::hclust()}}). You can also provide your own
function; see \code{Details}.}
\item{...}{Extra arguments passed on to \code{cluster_function}.}
}
\value{
A tibble with classes \code{rset}, \code{tbl_df}, \code{tbl}, and \code{data.frame}.
The results include a column for the data split objects and
an identification variable \code{id}.
}
\description{
Cluster cross-validation splits the data into V groups of
disjointed sets using k-means clustering of some variables.
A resample of the analysis data consists of V-1 of the
folds/clusters while the assessment set contains the final fold/cluster. In
basic cross-validation (i.e. no repeats), the number of resamples
is equal to V.
}
\details{
The variables in the \code{vars} argument are used for k-means clustering of
the data into disjointed sets or for hierarchical clustering of the data.
These clusters are used as the folds for cross-validation. Depending on how
the data are distributed, there may not be an equal number of points
in each fold.
You can optionally provide a custom function to \code{distance_function}. The
function should take a data frame (as created via \code{data[vars]}) and return
a \code{\link[stats:dist]{stats::dist()}} object with distances between data points.
You can optionally provide a custom function to \code{cluster_function}. The
function must take three arguments:
\itemize{
\item \code{dists}, a \code{\link[stats:dist]{stats::dist()}} object with distances between data points
\item \code{v}, a length-1 numeric for the number of folds to create
\item \code{...}, to pass any additional named arguments to your function
}
The function should return a vector of cluster assignments of length
\code{nrow(data)}, with each element of the vector corresponding to the matching
row of the data frame.
}
\examples{
\dontshow{if (rlang::is_installed("modeldata")) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
data(ames, package = "modeldata")
clustering_cv(ames, vars = c(Sale_Price, First_Flr_SF, Second_Flr_SF), v = 2)
\dontshow{\}) # examplesIf}
}
|