1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
|
\chapter{Quantile regression}
\label{chap:quantreg}
\section{Introduction}
\label{sec:rq-intro}
In Ordinary Least Squares (OLS) regression, the fitted values,
$\hat{y}_i = X_i\hat{\beta}$, represent the \emph{conditional mean} of
the dependent variable---conditional, that is, on the regression
function and the values of the independent variables. In median
regression, by contrast and as the name implies, fitted values
represent the \emph{conditional median} of the dependent variable. It
turns out that the principle of estimation for median regression is
easily stated (though not so easily computed), namely, choose
$\hat{\beta}$ so as to minimize the sum of absolute residuals. Hence
the method is known as Least Absolute Deviations or LAD. While the OLS
problem has a straightforward analytical solution, LAD is a linear
programming problem.
Quantile regression is a generalization of median regression: the
regression function predicts the conditional $\tau$-quantile of the
dependent variable---for example the first quartile ($\tau = .25$)
or the ninth decile ($\tau = .90$).
If the classical conditions for the validity of OLS are
satisfied---that is, if the error term is independently and
identically distributed, conditional on $X$---then quantile regression
is redundant: all the conditional quantiles of the dependent variable
will march in lockstep with the conditional mean. Conversely, if
quantile regression reveals that the conditional quantiles behave in a
manner quite distinct from the conditional mean, this suggests that
OLS estimation is problematic.
As of version 1.7.5, gretl offers quantile regression
functionality (in addition to basic LAD regression, which has been
available since early in gretl's history via the \texttt{lad}
command).\footnote{We gratefully acknowledge our borrowing from the
\texttt{quantreg} package for GNU \textsf{R} (version 4.17). The
core of the \texttt{quantreg} package is composed of Fortran code
written by Roger Koenker; this is accompanied by various driver and
auxiliary functions written in the \textsf{R} language by Koenker
and Martin M\"achler. The latter functions have been re-worked in C
for gretl. We have added some guards against potential
numerical problems in small samples.}
\section{Basic syntax}
The basic invocation of quantile regression is
\vspace{1em}
\noindent
\qquad \texttt{quantreg} \textsl{tau} \textsl{reglist}
\vspace{1em}
where
\begin{itemize}
\item \textsl{reglist} is a standard \textsf{gretl} regression list
(dependent variable followed by regressors, including the constant
if an intercept is wanted); and
\item \textsl{tau} is the desired conditional quantile, in the range
0.01 to 0.99, given either as a numerical value or the name of a
pre-defined scalar variable (but see below for a further option).
\end{itemize}
Estimation is via the Frisch--Newton interior point solver
\citep{portnoy97}, which is substantially faster than the
``traditional'' Barrodale--Roberts \citeyearpar{barrodale74} simplex
approach for large problems.
By default, standard errors are computed according to the asymptotic
formula given by \cite{koenker-bassett78}. Alternatively, if the
\option{robust} option is given, we use the sandwich estimator
developed in \cite{koenker-zhao94}.\footnote{These correspond to the
\texttt{iid} and \texttt{nid} options in \textsf{R}'s
\texttt{quantreg} package, respectively.}
\section{Confidence intervals}
An option \option{intervals} is available. When this is given we
print confidence intervals for the parameter estimates instead of
standard errors. These intervals are computed using the rank
inversion method and in general they are asymmetrical about the point
estimates---that is, they are not simply ``plus or minus so many
standard errors''. The specifics of the calculation are inflected by
the \option{robust} option: without this, the intervals are computed
on the assumption of IID errors \citep{koenker94}; with it, they use
the heteroskedasticity-robust estimator developed by
\cite{koenker-machado99}.
By default, 90 percent intervals are produced. You can change this by
appending a confidence value (expressed as a decimal fraction) to the
intervals option, as in
\vspace{1em}
\noindent
\qquad \texttt{quantreg} \textsl{tau} \textsl{reglist} \verb|--intervals=.95|
\vspace{1em}
When the confidence intervals option is selected, the parameter
estimates are calculated using the Barrodale--Roberts method. This is
simply because the Frisch--Newton code does not currently support the
calculation of confidence intervals.
Two further details. First, the mechanisms for generating confidence
intervals for quantile estimates require that the model has at least
two regressors (including the constant). If the \option{intervals}
option is given for a model containing only one regressor, an error is
flagged. Second, when a model is estimated in this mode, you can
retrieve the confidence intervals using the accessor \dollar{coeff\_ci}.
This produces a $k \times 2$ matrix, where $k$ is the number of
regressors. The lower bounds are in the first column, the upper
bounds in the second. See also section~\ref{sec:bigdata} below.
\section{Multiple quantiles}
As a further option, you can give \textsl{tau} as a matrix---either
the name of a predefined matrix or in numerical form, as in
\verb+{.05, .25, .5, .75, .95}+. The given model is estimated for all
the $\tau$ values and the results are printed in a special form, as
shown below (in this case the \option{intervals} option was also
given).
{\small
\begin{verbatim}
Model 1: Quantile estimates using the 235 observations 1-235
Dependent variable: foodexp
With 90 percent confidence intervals
VARIABLE TAU COEFFICIENT LOWER UPPER
const 0.05 124.880 98.3021 130.517
0.25 95.4835 73.7861 120.098
0.50 81.4822 53.2592 114.012
0.75 62.3966 32.7449 107.314
0.95 64.1040 46.2649 83.5790
income 0.05 0.343361 0.343327 0.389750
0.25 0.474103 0.420330 0.494329
0.50 0.560181 0.487022 0.601989
0.75 0.644014 0.580155 0.690413
0.95 0.709069 0.673900 0.734441
\end{verbatim}
}
The gretl GUI has an entry for Quantile Regression (under
\textsf{/Model/Robust estimation}), and you can select multiple
quantiles there too. In that context, just give space-separated
numerical values (as per the predefined options, shown in a drop-down
list).
When you estimate a model in this way most of the standard menu items
in the model window are disabled, but one extra item is
available---graphs showing the $\tau$ sequence for a given coefficient
in comparison with the OLS coefficient. An example is shown in
Figure~\ref{fig:tau}. This sort of graph provides a simple means of
judging whether quantile regression is redundant (OLS is fine) or
informative.
In the example shown---based on data on household income and food
expenditure gathered by Ernst Engel (1821--1896)---it seems clear
that simple OLS regression is potentially misleading. The
``crossing'' of the OLS estimate by the quantile estimates is very
marked.
\begin{figure}
\centering
\includegraphics{figures/tau-sequence}
\caption{Regression of food expenditure on income; Engel's data}
\label{fig:tau}
\end{figure}
However, it is not always clear what implications should be drawn from
this sort of conflict. With the Engel data there are two issues to
consider. First, Engel's famous ``law'' claims an income-elasticity
of food consumption that is less than one, and talk of elasticities
suggests a logarithmic formulation of the model. Second, there are
two apparently anomalous observations in the data set: household 105
has the third-highest income but unexpectedly low expenditure on food
(as judged from a simple scatter plot), while household 138 (which
also has unexpectedly low food consumption) has much the highest
income, almost twice that of the next highest.
With $n = 235$ it seems reasonable to consider dropping these
observations. If we do so, and adopt a log--log formulation, we get
the plot shown in Figure~\ref{fig:tau2}. The quantile estimates still
cross the OLS estimate, but the ``evidence against OLS'' is much less
compelling: the 90 percent confidence bands of the respective
estimates overlap at all the quantiles considered.
\begin{figure}
\centering
\includegraphics{figures/tau-sequence2}
\caption{Log--log regression; 2 observations dropped from full Engel data
set.}
\label{fig:tau2}
\end{figure}
\section{Large datasets}
\label{sec:bigdata}
As noted above, when you give the \option{intervals} option with the
\texttt{quantreg} command, which calls for estimation of confidence
intervals via rank inversion, gretl switches from the default
Frisch--Newton algorithm to the Barrodale--Roberts simplex method.
This is OK for moderately large datasets (up to, say, a few thousand
observations) but on very large problems the simplex algorithm may
become seriously bogged down. For example, \cite{koenker-hallock01}
present an analysis of the determinants of birth weights, using
198377 observations and with 15 regressors. Generating confidence
intervals via Barrodale--Roberts for a single value of $\tau$ took
about half an hour on a Lenovo Thinkpad T60p with 1.83GHz Intel Core 2
processor.
If you want confidence intervals in such cases, you are advised not to
use the \option{intervals} option, but to compute them using the
method of ``plus or minus so many standard errors''. (One
Frisch--Newton run took about 8 seconds on the same machine, showing
the superiority of the interior point method.) The script below
illustrates:
%
\begin{code}
quantreg .10 y 0 xlist
scalar crit = qnorm(.95)
matrix ci = $coeff - crit * $stderr
ci = ci~($coeff + crit * $stderr)
print ci
\end{code}
%
The matrix \texttt{ci} will contain the lower and upper bounds of the
(symmetrical) 90 percent confidence intervals.
To avoid a situation where gretl becomes unresponsive for a very
long time we have set the maximum number of iterations for the
Borrodale--Roberts algorithm to the (somewhat arbitrary) value of
1000. We will experiment further with this, but for the meantime if
you really want to use this method on a large dataset, and don't mind
waiting for the results, you can increase the limit using the
\texttt{set} command with parameter \verb|rq_maxiter|, as in
\begin{code}
set rq_maxiter 5000
\end{code}
|