1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
|
\chapter{Frequently Asked Questions}
\section{How do I Cite this Work?}
If you use \MatchIt, please cite\nocite{HoImaKin07,HoImaKin07a}
\begin{verse}
Daniel Ho; Kosuke Imai; Gary King; and Elizabeth Stuart (2007),
``Matching as Nonparametric Preprocessing for Reducing Model
Dependence in Parametric Causal Inference,'' \emph{Political
Analysis} 15(3): 199-236,
\url{http://gking.harvard.edu/files/abs/matchp-abs.shtml}.
and
Daniel Ho; Kosuke Imai; Gary King; and Elizabeth Stuart (2007b)
``Matchit: Nonparametric Preprocessing for Parametric Causal
Inference,'' \emph{Journal of Statistical Software},
\url{http://gking.harvard.edu/matchit/}.
\end{verse}
In addition, the {\tt convex.hull} discard option is implemented via
the {\tt WhatIf} package \citep{KinZen06,KinZen07,StoKinZen05}.
Generalized linear distance measures are implemented via the {\tt
stats} package \citep{VenRip02}. Generalized additive distance
measures are implemented via the {\tt mcgv} package \citep{HasTib90}.
The neural network distance measure is implemented via the {\tt nnet}
package \citep{Ripley96}. The classification trees distance measure
is implemented via the {\tt rpart} package \citep{BreFriOls84}. Full
and optimal matching are implemented via the {\tt optmatch} package
\citep{Hansen04}. Genetic matching is implemented via the {\tt
Matching} package \citep{DiaSek05}. Coarsened exact matching is
implemented via the \texttt{cem} package
\citep{IacKinPor08,IacKinPor08b}.
\section{What if My datasets Are Big and Are Taking Up
Too Much Memory?}
{\tt matchit()} does not save the data set in its output object, but
it does save a matrix of the covariates. {\tt match.data()} will
create a matched data set. One can eliminate the original data set to
save memory in R by {\tt rm(name)}, where {\tt name} is the name of
the data set, after calling {\tt match.data()}.
%\section{Can I use a Difference-in-Difference Estimator for Matched
% Data?}
%
%A difference-in-differences (DID) analysis can be easily conducted
%with \MatchIt. If we were interested in the DID matching estimate in
%the Lalonde data, we could simply include {\texttt re75} as a
%covariate in the preprocessing step. Then the analysis can be
%performed on the change in income from 1975 to 1978: {\tt re78}-{\tt
% re75}. Time-varying covariates (of which none exist in the Lalonde
%data) should of course also be differenced for the DID estimator.
%** we should show how to do this with zelig
\section{How Exactly are the Weights Created?}
\label{subsec:weights}
Each type of matching method can be thought of as creating groups of
units with at least one treated unit and at least one control unit in
each. In exact matching, subclassification, or full matching, these
groups are the subclasses formed, and the number of treated and
control units will vary quite a bit across subclasses. In nearest
neighbor or optimal matching, the groups are the pairs (or sets) of
treated and control units matched. In 1:1 nearest neighbor matching
there will be one treated unit and one control unit in each group. In
2:1 nearest neighbor matching there will be one treated unit and two
control units in each group. Unmatched units receive a weight of 0.
All matched treated units receive a weight of 1. These weights are constructed
to estimate the average treatment effect on the treated, with the control group
essentially weighted to look like the treated group.
The weights for matched control units are formed as follows:
\begin{enumerate}
\item Within each group, each control unit is given a preliminary
weight of $n_{ti}/n_{ci}$, where $n_{ti}$ and $n_{ci}$ are the
number of treated and control units in group $i$, respectively.
\item If matching is done with replacement, each control unit's weight
is added up across the groups in which it was matched.
\item The control group weights are scaled to sum to the number of
uniquely matched control units.
\end{enumerate}
With subclassification, when the analysis is done separately within
each subclass and then aggregated up across the subclasses, these
weights will generally not be used, but they may be used for full
matching or nearest neighbor matching if the number of control units
matched to each treated unit varies.
\section{How Do I Create Observation Names?}
\label{rnames}
Since the diagnostics often make use of the observation names of the
data frame, you may find it helpful to specify observation names for
the data input. Use the \texttt{row.names} command to achieve this.
For example, to assign the names ``Dan'', ``Kosuke'', ``Liz'' and
``Gary'' to a data frame with the first four observations in the
Lalonde data, type:
\begin{verbatim}
> test <- lalonde[1:4, ]
> row.names(test) <- c("Dan", "Kosuke", "Liz", "Gary")
> print(test)
age educ black hisp married nodegr re74 re75 re78 u74 u75 treat
Dan 37 11 1 0 1 1 0 0 9930 1 1 1
Kosuke 22 9 0 1 0 1 0 0 3596 1 1 1
Liz 30 12 1 0 0 0 0 0 24910 1 1 1
Gary 27 11 1 0 0 1 0 0 7506 1 1 1
\end{verbatim}
\section{How Can I See Outcomes of Matched Pairs?}
To obtain outcomes of matched pairs, recall that the original dataset has unique row names corresponding to each of
the observations. The row names of \texttt{match.matrix} correspond to the names of the treated, and each
of the cells corresponds to a name of matched controls. So to obtain matched outcomes, you can use:
\begin{verbatim}
cbind(lalonde[row.names(foo$match.matrix),"re78"], lalonde[foo$match.matrix,"re78"])
\end{verbatim}
\section{How Do I Ensure Replicability As \MatchIt\ Versions Develop?}
\label{subsec:vercontrol}
As the literature on matching techniques is rapidly evolving,
\MatchIt\ will strive to incorporate new developments. \MatchIt\ is
thereby an evolving program. Users may be concerned that analysis
written in a particular version may not be compatible with newer
versions of the program. The primary way to ensure that replication
archives remain valid is to record the version of \MatchIt\ that was
used in the analysis. Our website maintains binaries of all public
release versions, so that researchers can replicate results exactly
with the appropriate version (for Unix-based platforms, see
\hlink{http://gking.harvard.edu/src/contrib/}{http://gking.harvard.edu/src/contrib/};
for windows, see
\hlink{http://gking.harvard.edu/bin/windows/contrib/}{http://gking.harvard.edu/bin/windows/contrib/}).
In addition, users may find it helpful to install packages with
version control, using the {\tt installWithVers} command with {\tt
install.packages}. So for example, in the windows R console, users
may download the appropriate version from our website and install the
package with version control by:
\begin{verbatim}
install.packages(choose.files('',filters=Filters[c('zip','All'),]),
.libPaths()[1],installWithVers=T,CRAN=NULL)
\end{verbatim}
{\tt R CMD INSTALL} similarly permits users to specify this version
using the \\ {\tt --with-package-versions} option. After having
specified version control, different versions of the program may be
called as necessary. Similar advice may also be appropriate for
version control for R more generally.
\section{How Do I Use My Own Distance Measure with \MatchIt\,?}
A vector of your own distance measure can be used by specifying it as
the input for {\tt distance} option in {\tt matchit()}.
\section{What Do I Do about Missing Data?}
\MatchIt\ requires complete data sets, with no missing values (other
than potential outcomes of course). If there are missing values in
the data set, imputation techniques should be used first to fill in
(``impute'') the missing values (both covariates and outcomes), or the
analysis should be done using only complete cases (which we do not in
general recommend). For imputation software, see Amelia at
(\hlink{http://gking.harvard.edu/stats.shtml}{http://gking.harvard.edu/stats.shtml})
or other programs at
\hlink{http://www.multiple-imputation.com}{http://www.multiple-imputation.com}.
For more information on missing data and imputation methods, see
\cite{KinHonJos01}.
\section{Why Preprocessing?}
The purpose of matching is to approximate an experimental template,
where the matching procedure approximates blocking prior to random
treatment assignment in order to balance covariates between treatment
and control groups. Separation of the estimation procedure into two
steps simulates the research design of an experiment, where no
information on outcomes is known at the point of experimental design
and randomization. The separation of the balancing process in
\MatchIt\ from the analysis process afterward helps keep clear the
goal of balancing control and treatment groups and makes it less
likely that the user will inadvertently cook the books in his or her
favor.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End:
|