File: separate_wider_delim.Rd

package info (click to toggle)
r-cran-tidyr 1.3.1-1
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 2,720 kB
  • sloc: cpp: 268; sh: 9; makefile: 2
file content (208 lines) | stat: -rw-r--r-- 8,347 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/separate-wider.R
\name{separate_wider_delim}
\alias{separate_wider_delim}
\alias{separate_wider_position}
\alias{separate_wider_regex}
\title{Split a string into columns}
\usage{
separate_wider_delim(
  data,
  cols,
  delim,
  ...,
  names = NULL,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start", "align_end"),
  too_many = c("error", "debug", "drop", "merge"),
  cols_remove = TRUE
)

separate_wider_position(
  data,
  cols,
  widths,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start"),
  too_many = c("error", "debug", "drop"),
  cols_remove = TRUE
)

separate_wider_regex(
  data,
  cols,
  patterns,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start"),
  cols_remove = TRUE
)
}
\arguments{
\item{data}{A data frame.}

\item{cols}{<\code{\link[=tidyr_tidy_select]{tidy-select}}> Columns to separate.}

\item{delim}{For \code{separate_wider_delim()}, a string giving the delimiter
between values. By default, it is interpreted as a fixed string; use
\code{\link[stringr:modifiers]{stringr::regex()}} and friends to split in other ways.}

\item{...}{These dots are for future extensions and must be empty.}

\item{names}{For \code{separate_wider_delim()}, a character vector of output
column names. Use \code{NA} if there are components that you don't want
to appear in the output; the number of non-\code{NA} elements determines the
number of new columns in the result.}

\item{names_sep}{If supplied, output names will be composed
of the input column name followed by the separator followed by the
new column name. Required when \code{cols} selects multiple columns.

For \code{separate_wider_delim()} you can specify instead of \code{names}, in which
case the names will be generated from the source column name, \code{names_sep},
and a numeric suffix.}

\item{names_repair}{Used to check that output data frame has valid
names. Must be one of the following options:
\itemize{
\item \verb{"minimal}": no name repair or checks, beyond basic existence,
\item \verb{"unique}": make sure names are unique and not empty,
\item \verb{"check_unique}": (the default), no name repair, but check they are unique,
\item \verb{"universal}": make the names unique and syntactic
\item a function: apply custom name repair.
\item \link{tidyr_legacy}: use the name repair from tidyr 0.8.
\item a formula: a purrr-style anonymous function (see \code{\link[rlang:as_function]{rlang::as_function()}})
}

See \code{\link[vctrs:vec_as_names]{vctrs::vec_as_names()}} for more details on these terms and the
strategies used to enforce them.}

\item{too_few}{What should happen if a value separates into too few
pieces?
\itemize{
\item \code{"error"}, the default, will throw an error.
\item \code{"debug"} adds additional columns to the output to help you
locate and resolve the underlying problem. This option is intended to
help you debug the issue and address and should not generally remain in
your final code.
\item \code{"align_start"} aligns starts of short matches, adding \code{NA} on the end
to pad to the correct length.
\item \code{"align_end"} (\code{separate_wider_delim()} only) aligns the ends of short
matches, adding \code{NA} at the start to pad to the correct length.
}}

\item{too_many}{What should happen if a value separates into too many
pieces?
\itemize{
\item \code{"error"}, the default, will throw an error.
\item \code{"debug"} will add additional columns to the output to help you
locate and resolve the underlying problem.
\item \code{"drop"} will silently drop any extra pieces.
\item \code{"merge"} (\code{separate_wider_delim()} only) will merge together any
additional pieces.
}}

\item{cols_remove}{Should the input \code{cols} be removed from the output?
Always \code{FALSE} if \code{too_few} or \code{too_many} are set to \code{"debug"}.}

\item{widths}{A named numeric vector where the names become column names,
and the values specify the column width. Unnamed components will match,
but not be included in the output.}

\item{patterns}{A named character vector where the names become column names
and the values are regular expressions that match the contents of the
vector. Unnamed components will match, but not be included in the output.}
}
\value{
A data frame based on \code{data}. It has the same rows, but different
columns:
\itemize{
\item The primary purpose of the functions are to create new columns from
components of the string.
For \code{separate_wider_delim()} the names of new columns come from \code{names}.
For \code{separate_wider_position()} the names come from the names of \code{widths}.
For \code{separate_wider_regex()} the names come from the names of
\code{patterns}.
\item If \code{too_few} or \code{too_many} is \code{"debug"}, the output will contain additional
columns useful for debugging:
\itemize{
\item \verb{\{col\}_ok}: a logical vector which tells you if the input was ok or
not. Use to quickly find the problematic rows.
\item \verb{\{col\}_remainder}: any text remaining after separation.
\item \verb{\{col\}_pieces}, \verb{\{col\}_width}, \verb{\{col\}_matches}: number of pieces,
number of characters, and number of matches for \code{separate_wider_delim()},
\code{separate_wider_position()} and \code{separate_regexp_wider()} respectively.
}
\item If \code{cols_remove = TRUE} (the default), the input \code{cols} will be removed
from the output.
}
}
\description{
\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#experimental}{\figure{lifecycle-experimental.svg}{options: alt='[Experimental]'}}}{\strong{[Experimental]}}

Each of these functions takes a string column and splits it into multiple
new columns:
\itemize{
\item \code{separate_wider_delim()} splits by delimiter.
\item \code{separate_wider_position()} splits at fixed widths.
\item \code{separate_wider_regex()} splits with regular expression matches.
}

These functions are equivalent to \code{\link[=separate]{separate()}} and \code{\link[=extract]{extract()}}, but use
\href{https://stringr.tidyverse.org/}{stringr} as the underlying string
manipulation engine, and their interfaces reflect what we've learned from
\code{\link[=unnest_wider]{unnest_wider()}} and \code{\link[=unnest_longer]{unnest_longer()}}.
}
\examples{
df <- tibble(id = 1:3, x = c("m-123", "f-455", "f-123"))
# There are three basic ways to split up a string into pieces:
# 1. with a delimiter
df \%>\% separate_wider_delim(x, delim = "-", names = c("gender", "unit"))
# 2. by length
df \%>\% separate_wider_position(x, c(gender = 1, 1, unit = 3))
# 3. defining each component with a regular expression
df \%>\% separate_wider_regex(x, c(gender = ".", ".", unit = "\\\\d+"))

# Sometimes you split on the "last" delimiter
df <- tibble(var = c("race_1", "race_2", "age_bucket_1", "age_bucket_2"))
# _delim won't help because it always splits on the first delimiter
try(df \%>\% separate_wider_delim(var, "_", names = c("var1", "var2")))
df \%>\% separate_wider_delim(var, "_", names = c("var1", "var2"), too_many = "merge")
# Instead, you can use _regex
df \%>\% separate_wider_regex(var, c(var1 = ".*", "_", var2 = ".*"))
# this works because * is greedy; you can mimic the _delim behaviour with .*?
df \%>\% separate_wider_regex(var, c(var1 = ".*?", "_", var2 = ".*"))

# If the number of components varies, it's most natural to split into rows
df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA))
df \%>\% separate_longer_delim(x, delim = " ")
# But separate_wider_delim() provides some tools to deal with the problem
# The default behaviour tells you that there's a problem
try(df \%>\% separate_wider_delim(x, delim = " ", names = c("a", "b")))
# You can get additional insight by using the debug options
df \%>\%
  separate_wider_delim(
    x,
    delim = " ",
    names = c("a", "b"),
    too_few = "debug",
    too_many = "debug"
  )

# But you can suppress the warnings
df \%>\%
  separate_wider_delim(
    x,
    delim = " ",
    names = c("a", "b"),
    too_few = "align_start",
    too_many = "merge"
  )

# Or choose to automatically name the columns, producing as many as needed
df \%>\% separate_wider_delim(x, delim = " ", names_sep = "", too_few = "align_start")
}