File: make_strata.Rd

package info (click to toggle)
r-cran-rsample 1.2.1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,932 kB
  • sloc: sh: 13; makefile: 2
file content (82 lines) | stat: -rw-r--r-- 2,544 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/make_strata.R
\name{make_strata}
\alias{make_strata}
\title{Create or Modify Stratification Variables}
\usage{
make_strata(x, breaks = 4, nunique = 5, pool = 0.1, depth = 20)
}
\arguments{
\item{x}{An input vector.}

\item{breaks}{A single number giving the number of bins desired to stratify a
numeric stratification variable.}

\item{nunique}{An integer for the number of unique value threshold in the
algorithm.}

\item{pool}{A proportion of data used to determine if a particular group is
too small and should be pooled into another group. We do not recommend
decreasing this argument below its default of 0.1 because of the dangers
of stratifying groups that are too small.}

\item{depth}{An integer that is used to determine the best number of
percentiles that should be used. The number of bins are based on
\code{min(5, floor(n / depth))} where \code{n = length(x)}.
If \code{x} is numeric, there must be at least 40 rows in the data set
(when \code{depth = 20}) to conduct stratified sampling.}
}
\value{
A factor vector.
}
\description{
This function can create strata from numeric data and make non-numeric data
more conducive for stratification.
}
\details{
For numeric data, if the number of unique levels is less than
\code{nunique}, the data are treated as categorical data.

For categorical inputs, the function will find levels of \code{x} than
occur in the data with percentage less than \code{pool}. The values from
these groups will be randomly assigned to the remaining strata (as will
data points that have missing values in \code{x}).

For numeric data with more unique values than \code{nunique}, the data
will be converted to being categorical based on percentiles of the data.
The percentile groups will have no more than 20 percent of the data in
each group. Again, missing values in \code{x} are randomly assigned
to groups.
}
\examples{
set.seed(61)
x1 <- rpois(100, lambda = 5)
table(x1)
table(make_strata(x1))

set.seed(554)
x2 <- rpois(100, lambda = 1)
table(x2)
table(make_strata(x2))

# small groups are randomly assigned
x3 <- factor(x2)
table(x3)
table(make_strata(x3))

# `oilType` data from `caret`
x4 <- rep(LETTERS[1:7], c(37, 26, 3, 7, 11, 10, 2))
table(x4)
table(make_strata(x4))
table(make_strata(x4, pool = 0.1))
table(make_strata(x4, pool = 0.0))

# not enough data to stratify
x5 <- rnorm(20)
table(make_strata(x5))

set.seed(483)
x6 <- rnorm(200)
quantile(x6, probs = (0:10) / 10)
table(make_strata(x6, breaks = 10))
}