File: dplyr_by.Rd

package info (click to toggle)
r-cran-dplyr 1.1.4-4
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 4,292 kB
sloc: cpp: 1,403; sh: 17; makefile: 7
file content (225 lines) | stat: -rw-r--r-- 9,592 bytes
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/by.R
\name{dplyr_by}
\alias{dplyr_by}
\title{Per-operation grouping with \code{.by}/\code{by}}
\description{
There are two ways to group in dplyr:
\itemize{
\item Persistent grouping with \code{\link[=group_by]{group_by()}}
\item Per-operation grouping with \code{.by}/\code{by}
}

This help page is dedicated to explaining where and why you might want to use the latter.

Depending on the dplyr verb, the per-operation grouping argument may be named \code{.by} or \code{by}.
The \emph{Supported verbs} section below outlines this on a case-by-case basis.
The remainder of this page will refer to \code{.by} for simplicity.

Grouping radically affects the computation of the dplyr verb you use it with, and one of the goals of \code{.by} is to allow you to place that grouping specification alongside the code that actually uses it.
As an added benefit, with \code{.by} you no longer need to remember to \code{\link[=ungroup]{ungroup()}} after \code{\link[=summarise]{summarise()}}, and \code{summarise()} won't ever message you about how it's handling the groups!

This idea comes from \href{https://CRAN.R-project.org/package=data.table}{data.table}, which allows you to specify \code{by} alongside modifications in \code{j}, like: \code{dt[, .(x = mean(x)), by = g]}.
\subsection{Supported verbs}{
\itemize{
\item \code{\link[=mutate]{mutate(.by = )}}
\item \code{\link[=summarise]{summarise(.by = )}}
\item \code{\link[=reframe]{reframe(.by = )}}
\item \code{\link[=filter]{filter(.by = )}}
\item \code{\link[=slice]{slice(.by = )}}
\item \code{\link[=slice_head]{slice_head(by = )}} and \code{\link[=slice_tail]{slice_tail(by = )}}
\item \code{\link[=slice_min]{slice_min(by = )}} and \code{\link[=slice_max]{slice_max(by = )}}
\item \code{\link[=slice_sample]{slice_sample(by = )}}
}

Note that some dplyr verbs use \code{by} while others use \code{.by}.
This is a purely technical difference.
}

\subsection{Differences between \code{.by} and \code{group_by()}}{\tabular{ll}{
   \code{.by} \tab \code{group_by()} \cr
   Grouping only affects a single verb \tab Grouping is persistent across multiple verbs \cr
   Selects variables with \link[=dplyr_tidy_select]{tidy-select} \tab Computes expressions with \link[rlang:args_data_masking]{data-masking} \cr
   Summaries use existing order of group keys \tab Summaries sort group keys in ascending order \cr
}

}

\subsection{Using \code{.by}}{

Let's take a look at the two grouping approaches using this \code{expenses} data set, which tracks costs accumulated across various \code{id}s and \code{region}s:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{expenses <- tibble(
  id = c(1, 2, 1, 3, 1, 2, 3),
  region = c("A", "A", "A", "B", "B", "A", "A"),
  cost = c(25, 20, 19, 12, 9, 6, 6)
)
expenses
#> # A tibble: 7 x 3
#>      id region  cost
#>   <dbl> <chr>  <dbl>
#> 1     1 A         25
#> 2     2 A         20
#> 3     1 A         19
#> 4     3 B         12
#> 5     1 B          9
#> 6     2 A          6
#> 7     3 A          6
}\if{html}{\out{</div>}}

Imagine that you wanted to compute the average cost per region.
You'd probably write something like this:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{expenses \%>\%
  group_by(region) \%>\%
  summarise(cost = mean(cost))
#> # A tibble: 2 x 2
#>   region  cost
#>   <chr>  <dbl>
#> 1 A       15.2
#> 2 B       10.5
}\if{html}{\out{</div>}}

Instead, you can now specify the grouping \emph{inline} within the verb:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{expenses \%>\%
  summarise(cost = mean(cost), .by = region)
#> # A tibble: 2 x 2
#>   region  cost
#>   <chr>  <dbl>
#> 1 A       15.2
#> 2 B       10.5
}\if{html}{\out{</div>}}

\code{.by} applies to a single operation, meaning that since \code{expenses} was an ungrouped data frame, the result after applying \code{.by} will also always be an ungrouped data frame, regardless of the number of grouping columns.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{expenses \%>\%
  summarise(cost = mean(cost), .by = c(id, region))
#> # A tibble: 5 x 3
#>      id region  cost
#>   <dbl> <chr>  <dbl>
#> 1     1 A         22
#> 2     2 A         13
#> 3     3 B         12
#> 4     1 B          9
#> 5     3 A          6
}\if{html}{\out{</div>}}

Compare that with \code{group_by() \%>\% summarise()}, where \code{summarise()} generally peels off 1 layer of grouping by default, typically with a message that it is doing so:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{expenses \%>\%
  group_by(id, region) \%>\%
  summarise(cost = mean(cost))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 5 x 3
#> # Groups:   id [3]
#>      id region  cost
#>   <dbl> <chr>  <dbl>
#> 1     1 A         22
#> 2     1 B          9
#> 3     2 A         13
#> 4     3 A          6
#> 5     3 B         12
}\if{html}{\out{</div>}}

Because \code{.by} grouping applies to a single operation, you don't need to worry about ungrouping, and it never needs to emit a message to remind you what it is doing with the groups.

Note that with \code{.by} we specified multiple columns to group by using the \link[=dplyr_tidy_select]{tidy-select} syntax \code{c(id, region)}.
If you have a character vector of column names you'd like to group by, you can do so with \code{.by = all_of(my_cols)}.
It will group by the columns in the order they were provided.

To prevent surprising results, you can't use \code{.by} on an existing grouped data frame:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{expenses \%>\% 
  group_by(id) \%>\%
  summarise(cost = mean(cost), .by = c(id, region))
#> Error in `summarise()`:
#> ! Can't supply `.by` when `.data` is a grouped data frame.
}\if{html}{\out{</div>}}

So far we've focused on the usage of \code{.by} with \code{summarise()}, but \code{.by} works with a number of other dplyr verbs.
For example, you could append the mean cost per region onto the original data frame as a new column rather than computing a summary:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{expenses \%>\%
  mutate(cost_by_region = mean(cost), .by = region)
#> # A tibble: 7 x 4
#>      id region  cost cost_by_region
#>   <dbl> <chr>  <dbl>          <dbl>
#> 1     1 A         25           15.2
#> 2     2 A         20           15.2
#> 3     1 A         19           15.2
#> 4     3 B         12           10.5
#> 5     1 B          9           10.5
#> 6     2 A          6           15.2
#> 7     3 A          6           15.2
}\if{html}{\out{</div>}}

Or you could slice out the maximum cost per combination of id and region:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{# Note that the argument is named `by` in `slice_max()`
expenses \%>\%
  slice_max(cost, n = 1, by = c(id, region))
#> # A tibble: 5 x 3
#>      id region  cost
#>   <dbl> <chr>  <dbl>
#> 1     1 A         25
#> 2     2 A         20
#> 3     3 B         12
#> 4     1 B          9
#> 5     3 A          6
}\if{html}{\out{</div>}}
}

\subsection{Result ordering}{

When used with \code{.by}, \code{summarise()}, \code{reframe()}, and \code{slice()} all maintain the ordering of the existing data.
This is different from \code{group_by()}, which has always sorted the group keys in ascending order.

\if{html}{\out{<div class="sourceCode r">}}\preformatted{df <- tibble(
  month = c("jan", "jan", "feb", "feb", "mar"),
  temp = c(20, 25, 18, 20, 40)
)

# Uses ordering by "first appearance" in the original data
df \%>\%
  summarise(average_temp = mean(temp), .by = month)
#> # A tibble: 3 x 2
#>   month average_temp
#>   <chr>        <dbl>
#> 1 jan           22.5
#> 2 feb           19  
#> 3 mar           40

# Sorts in ascending order
df \%>\%
  group_by(month) \%>\%
  summarise(average_temp = mean(temp))
#> # A tibble: 3 x 2
#>   month average_temp
#>   <chr>        <dbl>
#> 1 feb           19  
#> 2 jan           22.5
#> 3 mar           40
}\if{html}{\out{</div>}}

If you need sorted group keys, we recommend that you explicitly use \code{\link[=arrange]{arrange()}} either before or after the call to \code{summarise()}, \code{reframe()}, or \code{slice()}.
This also gives you full access to all of \code{arrange()}'s features, such as \code{desc()} and the \code{.locale} argument.
}

\subsection{Verbs without \code{.by} support}{

If a dplyr verb doesn't support \code{.by}, then that typically means that the verb isn't inherently affected by grouping.
For example, \code{\link[=pull]{pull()}} and \code{\link[=rename]{rename()}} don't support \code{.by}, because specifying columns to group by would not affect their implementations.

That said, there are a few exceptions to this where sometimes a dplyr verb doesn't support \code{.by}, but \emph{does} have special support for grouped data frames created by \code{\link[=group_by]{group_by()}}.
This is typically because the verbs are required to retain the grouping columns, for example:
\itemize{
\item \code{\link[=select]{select()}} always retains grouping columns, with a message if any aren't specified in the \code{select()} call.
\item \code{\link[=distinct]{distinct()}} and \code{\link[=count]{count()}} place unspecified grouping columns at the front of the data frame before computing their results.
\item \code{\link[=arrange]{arrange()}} has a \code{.by_group} argument to optionally order by grouping columns first.
}

If \code{group_by()} didn't exist, then these verbs would not have special support for grouped data frames.
}
}