File: selection_syntax.Rmd

package info (click to toggle)
r-cran-datawizard 0.6.5%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 1,736 kB
  • sloc: sh: 13; makefile: 2
file content (241 lines) | stat: -rw-r--r-- 6,123 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
title: "A quick summary of selection syntax in `{datawizard}`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{A quick summary of selection syntax in `{datawizard}`}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

pkgs <- c(
  "datawizard",
  "dplyr",
  "htmltools"
)

if (!all(sapply(pkgs, requireNamespace, quietly = TRUE))) {
  knitr::opts_chunk$set(eval = FALSE)
}
```

```{r load, echo=FALSE, message=FALSE}
library(datawizard)
library(dplyr)
library(htmltools)

set.seed(123)
iris <- iris[sample(nrow(iris), 10), ]
row.names(iris) <- NULL

row <- function(...) {
  div(
    class = "custom_note",
    ...
  )
}
```

```{css, echo=FALSE}
.custom_note {
  border-left: solid 5px hsl(220, 100%, 30%);
  background-color: hsl(220, 100%, 95%);
  padding: 5px;
  margin-bottom: 10px
}
```

This vignette can be referred to by citing the following:

Patil et al., (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. *Journal of Open Source Software*, *7*(78), 4684, https://doi.org/10.21105/joss.04684

# Selecting variables

## Quoted names

This is the most simple way to select one or several variables. Just use a character
vector containing variables names, like in base R.

```{r}
data_select(iris, c("Sepal.Length", "Petal.Width"))
```

## Unquoted names

It is also possible to use unquoted names. This is useful if we use the `tidyverse`
and want to be consistent about the way variable names are passed.

```{r}
iris %>%
  group_by(Species) %>%
  standardise(Petal.Length) %>%
  ungroup()
```


## Positions

In addition to variable names, `select` can also take indices for the variables 
to select in the dataframe.

```{r}
data_select(iris, c(1, 2, 5))
```


## Functions

We can also pass a function to the `select` argument. This function will be applied
to all columns and should return `TRUE` or `FALSE`. For example, if we want to 
keep only numeric columns, we can use `is.numeric`.

```{r}
data_select(iris, is.numeric)
```

Note that we can provide any custom function to `select`, *provided it returns `TRUE` or `FALSE`* when applied to a column.

```{r}
my_function <- function(i) {
  is.numeric(i) && mean(i, na.rm = TRUE) > 3.5
}

data_select(iris, my_function)
```


## Patterns

With larger datasets, it would be tedious to write the names of variables to select,
and it would be fragile to rely on variable positions as they may change later. 
To this end, we can use four select helpers: `starts_with()`, `ends_with()`,
`contains()`, and `regex()`. The first three can take several patterns, while
`regex()` takes a single regular expression.
 
```{r}
data_select(iris, starts_with("Sep", "Peta"))

data_select(iris, ends_with("dth", "ies"))

data_select(iris, contains("pal", "ec"))

data_select(iris, regex("^Sep|ies"))
```

```{r echo=FALSE}
row("Note: these functions are not exported by `datawizard` but are detected and
applied internally. This means that they won't be detected by autocompletion
when we write them.")
```

```{r echo=FALSE}
row("Note #2: because these functions are not exported, they will not create
conflicts with the ones that come from the `tidyverse` and that have the same name.
So we can still use `dplyr` and its friends, it won't change anything for selection
in `datawizard` functions!")
```


# Excluding variables

What if we want to keep all variables except for a few ones? There are two ways
we can invert our selection.

The first way is to put a minus sign `"-"` in front of the `select` argument.

```{r}
data_select(iris, -c("Sepal.Length", "Petal.Width"))

data_select(iris, -starts_with("Sep", "Peta"))

data_select(iris, -is.numeric)
```

Note that if we use numeric indices, we can't mix negative and positive values. 
This means that we have to use `select = -(1:2)` if we want to exclude the first
two columns; `select = -1:2` will *not* work:

```{r}
data_select(iris, -(1:2))
```

The second way is to use the argument `exclude`. This argument has the same 
possibilities as `select`. Although this may not be required in most contexts, 
if we wanted to, we could use both `select` and `exclude` arguments at the same 
time.

```{r}
data_select(iris, exclude = c("Sepal.Length", "Petal.Width"))

data_select(iris, exclude = starts_with("Sep", "Peta"))
```

# Programming with selections

Since `datawizard` 0.6.0, it is possible to pass function arguments and loop indices
in `select` and `exclude` arguments. This makes it easier to program with
`datawizard`.

For example, if we want to let the user decide the selection they want to use:

```{r}
my_function <- function(data, selection) {
  find_columns(data, select = selection)
}
my_function(iris, c("Sepal.Length"))
my_function(iris, starts_with("Sep"))

my_function_2 <- function(data, pattern) {
  find_columns(data, select = starts_with(pattern))
}
my_function_2(iris, "Sep")
```

It is also possible to pass these values in loops, for example if we have a list 
of patterns and we want to relocate columns based on these patterns, one by one:

```{r}
new_iris <- iris
for (i in c("Sep", "Pet")) {
  new_iris <- new_iris %>%
    data_relocate(select = starts_with(i), after = -1)
}
new_iris
```

In the loop above, all columns starting with `"Sep"` are moved at the end of the
data frame, and the same thing was made with all columns starting with `"Pet"`.



# Useful to know

## Ignore the case

In every selection that uses variable names, we can ignore the case in the 
selection by applying `ignore_case = TRUE`.

```{r}
data_select(iris, c("sepal.length", "petal.width"), ignore_case = TRUE)

data_select(iris, ~ Sepal.length + petal.Width, ignore_case = TRUE)

data_select(iris, starts_with("sep", "peta"), ignore_case = TRUE)
```

## Formulas

It is also possible to use formulas to select variables:

```{r}
data_select(iris, ~ Sepal.Length + Petal.Width)
```

This made it easier to use selection in custom functions before `datawizard` 
0.6.0, and is kept available for backward compatibility.