1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/demean.R
\name{demean}
\alias{demean}
\alias{degroup}
\alias{detrend}
\title{Compute group-meaned and de-meaned variables}
\usage{
demean(
x,
select,
by,
nested = FALSE,
suffix_demean = "_within",
suffix_groupmean = "_between",
append = TRUE,
add_attributes = TRUE,
verbose = TRUE
)
degroup(
x,
select,
by,
nested = FALSE,
center = "mean",
suffix_demean = "_within",
suffix_groupmean = "_between",
append = TRUE,
add_attributes = TRUE,
verbose = TRUE
)
detrend(
x,
select,
by,
nested = FALSE,
center = "mean",
suffix_demean = "_within",
suffix_groupmean = "_between",
append = TRUE,
add_attributes = TRUE,
verbose = TRUE
)
}
\arguments{
\item{x}{A data frame.}
\item{select}{Character vector (or formula) with names of variables to select
that should be group- and de-meaned.}
\item{by}{Character vector (or formula) with the name of the variable that
indicates the group- or cluster-ID. For cross-classified or nested designs,
\code{by} can also identify two or more variables as group- or cluster-IDs. If
the data is nested and should be treated as such, set \code{nested = TRUE}. Else,
if \code{by} defines two or more variables and \code{nested = FALSE}, a cross-classified
design is assumed. Note that \code{demean()} and \code{degroup()} can't handle a mix
of nested and cross-classified designs in one model.
For nested designs, \code{by} can be:
\itemize{
\item a character vector with the name of the variable that indicates the
levels, ordered from \emph{highest} level to \emph{lowest} (e.g.
\code{by = c("L4", "L3", "L2")}.
\item a character vector with variable names in the format \code{by = "L4/L3/L2"},
where the levels are separated by \code{/}.
}
See also section \emph{De-meaning for cross-classified designs} and
\emph{De-meaning for nested designs} below.}
\item{nested}{Logical, if \code{TRUE}, the data is treated as nested. If \code{FALSE},
the data is treated as cross-classified. Only applies if \code{by} contains more
than one variable.}
\item{suffix_demean, suffix_groupmean}{String value, will be appended to the
names of the group-meaned and de-meaned variables of \code{x}. By default,
de-meaned variables will be suffixed with \code{"_within"} and
grouped-meaned variables with \code{"_between"}.}
\item{append}{Logical, if \code{TRUE} (default), the group- and de-meaned
variables will be appended (column bind) to the original data \code{x},
thus returning both the original and the de-/group-meaned variables.}
\item{add_attributes}{Logical, if \code{TRUE}, the returned variables gain
attributes to indicate the within- and between-effects. This is only
relevant when printing \code{model_parameters()} - in such cases, the
within- and between-effects are printed in separated blocks.}
\item{verbose}{Toggle warnings and messages.}
\item{center}{Method for centering. \code{demean()} always performs
mean-centering, while \code{degroup()} can use \code{center = "median"} or
\code{center = "mode"} for median- or mode-centering, and also \code{"min"}
or \code{"max"}.}
}
\value{
A data frame with the group-/de-meaned variables, which get the suffix
\code{"_between"} (for the group-meaned variable) and \code{"_within"} (for the
de-meaned variable) by default. For cross-classified or nested designs,
the name pattern of the group-meaned variables is the name of the centered
variable followed by the name of the variable that indicates the related
grouping level, e.g. \code{predictor_L3_between} and \code{predictor_L2_between}.
}
\description{
\code{demean()} computes group- and de-meaned versions of a variable that can be
used in regression analysis to model the between- and within-subject effect
(person-mean centering or centering within clusters). \code{degroup()} is more
generic in terms of the centering-operation. While \code{demean()} always uses
mean-centering, \code{degroup()} can also use the mode or median for centering.
}
\section{Heterogeneity Bias}{
Mixed models include different levels of sources of variability, i.e.
error terms at each level. When macro-indicators (or level-2 predictors,
or higher-level units, or more general: \emph{group-level predictors that
\strong{vary} within and across groups}) are included as fixed effects (i.e.
treated as covariate at level-1), the variance that is left unaccounted for
this covariate will be absorbed into the error terms of level-1 and level-2
(\emph{Bafumi and Gelman 2006; Gelman and Hill 2007, Chapter 12.6.}):
"Such covariates contain two parts: one that is specific to the higher-level
entity that does not vary between occasions, and one that represents the
difference between occasions, within higher-level entities" (\emph{Bell et al. 2015}).
Hence, the error terms will be correlated with the covariate, which violates
one of the assumptions of mixed models (iid, independent and identically
distributed error terms). This bias is also called the \emph{heterogeneity bias}
(\emph{Bell et al. 2015}). To resolve this problem, level-2 predictors used as
(level-1) covariates should be separated into their "within" and "between"
effects by "de-meaning" and "group-meaning": After demeaning time-varying
predictors, "at the higher level, the mean term is no longer constrained by
Level 1 effects, so it is free to account for all the higher-level variance
associated with that variable" (\emph{Bell et al. 2015}).
}
\section{Panel data and correlating fixed and group effects}{
\code{demean()} is intended to create group- and de-meaned variables for panel
regression models (fixed effects models), or for complex
random-effect-within-between models (see \emph{Bell et al. 2015, 2018}), where
group-effects (random effects) and fixed effects correlate (see
\emph{Bafumi and Gelman 2006}). This can happen, for instance, when analyzing
panel data, which can lead to \emph{Heterogeneity Bias}. To control for correlating
predictors and group effects, it is recommended to include the group-meaned
and de-meaned version of \emph{time-varying covariates} (and group-meaned version
of \emph{time-invariant covariates} that are on a higher level, e.g. level-2
predictors) in the model. By this, one can fit complex multilevel models for
panel data, including time-varying predictors, time-invariant predictors and
random effects.
}
\section{Why mixed models are preferred over fixed effects models}{
A mixed models approach can model the causes of endogeneity explicitly
by including the (separated) within- and between-effects of time-varying
fixed effects and including time-constant fixed effects. Furthermore,
mixed models also include random effects, thus a mixed models approach
is superior to classic fixed-effects models, which lack information of
variation in the group-effects or between-subject effects. Furthermore,
fixed effects regression cannot include random slopes, which means that
fixed effects regressions are neglecting "cross-cluster differences in the
effects of lower-level controls (which) reduces the precision of estimated
context effects, resulting in unnecessarily wide confidence intervals and
low statistical power" (\emph{Heisig et al. 2017}).
}
\section{Terminology}{
The group-meaned variable is simply the mean of an independent variable
within each group (or id-level or cluster) represented by \code{by}. It represents
the cluster-mean of an independent variable. The regression coefficient of a
group-meaned variable is the \emph{between-subject-effect}. The de-meaned variable
is then the centered version of the group-meaned variable. De-meaning is
sometimes also called person-mean centering or centering within clusters.
The regression coefficient of a de-meaned variable represents the
\emph{within-subject-effect}.
}
\section{De-meaning with continuous predictors}{
For continuous time-varying predictors, the recommendation is to include
both their de-meaned and group-meaned versions as fixed effects, but not
the raw (untransformed) time-varying predictors themselves. The de-meaned
predictor should also be included as random effect (random slope). In
regression models, the coefficient of the de-meaned predictors indicates
the within-subject effect, while the coefficient of the group-meaned
predictor indicates the between-subject effect.
}
\section{De-meaning with binary predictors}{
For binary time-varying predictors, there are two recommendations. First
is to include the raw (untransformed) binary predictor as fixed effect
only and the \emph{de-meaned} variable as random effect (random slope).
The alternative would be to add the de-meaned version(s) of binary
time-varying covariates as additional fixed effect as well (instead of
adding it as random slope). Centering time-varying binary variables to
obtain within-effects (level 1) isn't necessary. They have a sensible
interpretation when left in the typical 0/1 format (\emph{Hoffmann 2015,
chapter 8-2.I}). \code{demean()} will thus coerce categorical time-varying
predictors to numeric to compute the de- and group-meaned versions for
these variables, where the raw (untransformed) binary predictor and the
de-meaned version should be added to the model.
}
\section{De-meaning of factors with more than 2 levels}{
Factors with more than two levels are demeaned in two ways: first, these
are also converted to numeric and de-meaned; second, dummy variables
are created (binary, with 0/1 coding for each level) and these binary
dummy-variables are de-meaned in the same way (as described above).
Packages like \strong{panelr} internally convert factors to dummies before
demeaning, so this behaviour can be mimicked here.
}
\section{De-meaning interaction terms}{
There are multiple ways to deal with interaction terms of within- and
between-effects.
\itemize{
\item A classical approach is to simply use the product term of the de-meaned
variables (i.e. introducing the de-meaned variables as interaction term
in the model formula, e.g. \code{y ~ x_within * time_within}). This approach,
however, might be subject to bias (see \emph{Giesselmann & Schmidt-Catran 2020}).
\item Another option is to first calculate the product term and then apply the
de-meaning to it. This approach produces an estimator "that reflects
unit-level differences of interacted variables whose moderators vary
within units", which is desirable if \emph{no} within interaction of
two time-dependent variables is required. This is what \code{demean()} does
internally when \code{select} contains interaction terms.
\item A third option, when the interaction should result in a genuine within
estimator, is to "double de-mean" the interaction terms
(\emph{Giesselmann & Schmidt-Catran 2018}), however, this is currently
not supported by \code{demean()}. If this is required, the \code{wmb()}
function from the \strong{panelr} package should be used.
}
To de-mean interaction terms for within-between models, simply specify
the term as interaction for the \code{select}-argument, e.g. \code{select = "a*b"}
(see 'Examples').
}
\section{De-meaning for cross-classified designs}{
\code{demean()} can handle cross-classified designs, where the data has two or
more groups at the higher (i.e. second) level. In such cases, the
\code{by}-argument can identify two or more variables that represent the
cross-classified group- or cluster-IDs. The de-meaned variables for
cross-classified designs are simply subtracting all group means from each
individual value, i.e. \emph{fully cluster-mean-centering} (see \emph{Guo et al. 2024}
for details). Note that de-meaning for cross-classified designs is \emph{not}
equivalent to de-meaning of nested data structures from models with three or
more levels. Set \code{nested = TRUE} to explicitly assume a nested design. For
cross-classified designs, de-meaning is supposed to work for models like
\code{y ~ x + (1|level3) + (1|level2)}, but \emph{not} for models like
\code{y ~ x + (1|level3/level2)}. Note that \code{demean()} and \code{degroup()} can't
handle a mix of nested and cross-classified designs in one model.
}
\section{De-meaning for nested designs}{
\emph{Brincks et al. (2017)} have suggested an algorithm to center variables for
nested designs, which is implemented in \code{demean()}. For nested designs, set
\code{nested = TRUE} \emph{and} specify the variables that indicate the different
levels in descending order in the \code{by} argument. E.g.,
\verb{by = c("level4", "level3, "level2")} assumes a model like
\code{y ~ x + (1|level4/level3/level2)}. An alternative notation for the
\code{by}-argument would be \code{by = "level4/level3/level2"}, similar to the
formula notation.
}
\section{Analysing panel data with mixed models using lme4}{
A description of how to translate the formulas described in \emph{Bell et al. 2018}
into R using \code{lmer()} from \strong{lme4} can be found in
\href{https://easystats.github.io/parameters/articles/demean.html}{this vignette}.
}
\examples{
data(iris)
iris$ID <- sample(1:4, nrow(iris), replace = TRUE) # fake-ID
iris$binary <- as.factor(rbinom(150, 1, .35)) # binary variable
x <- demean(iris, select = c("Sepal.Length", "Petal.Length"), by = "ID")
head(x)
x <- demean(iris, select = c("Sepal.Length", "binary", "Species"), by = "ID")
head(x)
# demean interaction term x*y
dat <- data.frame(
a = c(1, 2, 3, 4, 1, 2, 3, 4),
x = c(4, 3, 3, 4, 1, 2, 1, 2),
y = c(1, 2, 1, 2, 4, 3, 2, 1),
ID = c(1, 2, 3, 1, 2, 3, 1, 2)
)
demean(dat, select = c("a", "x*y"), by = "ID")
# or in formula-notation
demean(dat, select = ~ a + x * y, by = ~ID)
}
\references{
\itemize{
\item Bafumi J, Gelman A. 2006. Fitting Multilevel Models When Predictors
and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
American Political Science Association.
\item Bell A, Fairbrother M, Jones K. 2019. Fixed and Random Effects
Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
\item Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects
Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
Research and Methods, 3(1), 133–153.
\item Brincks, A. M., Enders, C. K., Llabre, M. M., Bulotsky-Shearer, R. J.,
Prado, G., and Feaster, D. J. (2017). Centering Predictor Variables in
Three-Level Contextual Models. Multivariate Behavioral Research, 52(2),
149–163. https://doi.org/10.1080/00273171.2016.1256753
\item Gelman A, Hill J. 2007. Data Analysis Using Regression and
Multilevel/Hierarchical Models. Analytical Methods for Social Research.
Cambridge, New York: Cambridge University Press
\item Giesselmann M, Schmidt-Catran, AW. 2020. Interactions in fixed
effects regression models. Sociological Methods & Research, 1–28.
https://doi.org/10.1177/0049124120914934
\item Guo Y, Dhaliwal J, Rights JD. 2024. Disaggregating level-specific effects
in cross-classified multilevel models. Behavior Research Methods, 56(4),
3023–3057.
\item Heisig JP, Schaeffer M, Giesecke J. 2017. The Costs of Simplicity:
Why Multilevel Models May Benefit from Accounting for Cross-Cluster
Differences in the Effects of Controls. American Sociological Review 82
(4): 796–827.
\item Hoffman L. 2015. Longitudinal analysis: modeling within-person
fluctuation and change. New York: Routledge
}
}
\seealso{
If grand-mean centering (instead of centering within-clusters)
is required, see \code{\link[=center]{center()}}. See \code{\link[performance:check_heterogeneity_bias]{performance::check_heterogeneity_bias()}}
to check for heterogeneity bias.
}
|