1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
|
\chapter{Degrees of freedom correction}
\label{chap:df}
\section{Introduction}
This chapter gives a brief account of the issue of correction for
degrees of freedom in the context of econometric modeling, leading
up to a discussion of the policies adopted in \app{gretl} in this
regard. We also explain how to supplement the results produced
automatically by \app{gretl} if you want to apply such a correction
where \app{gretl} does not, or vice versa.
\section{Back to basics}
It's well known that given a sample, $\{x_i\}$, of size $n$ from a
normally distributed population, the Maximum Likelihood (ML) estimator
of the population variance, $\sigma^2$, is
%
\begin{equation}
\label{eq:sigma-hat}
\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2
\end{equation}
%
where $\bar{x}$ is the sample mean, $n^{-1} \sum_{i=1}^n x_i$.
It's also well known that $\hat{\sigma}^2$, while consistent,
is a biased estimator, and is commonly replaced by the ``sample
variance'', namely,
%
\begin{equation}
\label{eq:sample-variance}
s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2
\end{equation}
The intuition behind the bias in (\ref{eq:sigma-hat}) is quite
straightforward. First, the quantity we seek to estimate is defined
as
%
\[
\sigma^2 = \frac{1}{N} \sum_{i=1}^n (x_i - \mu)^2
\]
%
for a population of size $N$ with mean $\mu$. Second, $\bar{x}$ is an
estimator of $\mu$; it is easily shown to be the least-squares
estimator, and also (assuming normality) the ML estimator. It is
unbiased, but is of course subject to sampling error; in any given
sample it is highly unlikely that $\bar{x} = \mu$. Given that
$\bar{x}$ is the least-squares estimator, the sum of squared
deviations of the $x_i$ from \textit{any} value other than $\bar{x}$
must be greater than the summation in (\ref{eq:sigma-hat}). But since
$\mu$ is almost certainly not equal to $\bar{x}$, the sum of squared
deviations of the $x_i$ from $\mu$ will surely be greater than the sum
of squared deviations in (\ref{eq:sigma-hat}). It follows that the
expected value of $\hat{\sigma}^2$ falls short of the
population variance.
The proof that $s^2$ is indeed the unbiased estimator can be found in
any good statistics textbook, where we also learn that the magnitude
$n-1$ in (\ref{eq:sample-variance}) can be brought under a general
description as the ``degrees of freedom'' of the calculation at
hand. Given $\bar{x}$, the $n$ sample values provide only $n-1$ items
of information since the $n$th value can always be deduced via the
formula for $\bar{x}$.
\section{Application to OLS regression}
The argument above carries over into the usual calculation of standard
errors in the context of OLS regression as applied to the linear model
$y = X\beta + u$. If the disturbances, $u$, are assumed to be
independently and identically distributed (IID), then the variance of
the OLS estimator, $\hat\beta$, is given by
%
\[
\mbox{Var}\left(\hat\beta\right) = \sigma^2 (X'X)^{-1}
\]
%
where $\sigma^2$ is the variance of the error term and $X$ is an
$n\times k$ matrix of regressors. But how should the unknown
$\sigma^2$ be estimated? The ML estimator is
%
\begin{equation}
\label{eq:ols-sigma2}
\hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n \hat{u}^2_i
\end{equation}
%
where the $\hat{u}^2_i$ are squared residuals, $u_i = y_i - X_i\beta$.
But this estimator is biased and we typically use the unbiased
counterpart
%
\begin{equation}
\label{eq:ols-s2}
s^2 = \frac{1}{n-k} \sum_{i=1}^n \hat{u}_i^2
\end{equation}
%
in which $n - k$ is the number of degrees of freedom given $n$
residuals from a regression where $k$ parameters are estimated.
The standard estimator of the variance of $\hat\beta$ in the
context of OLS is then $V = s^2 (X'X)^{-1}$. And the standard errors
of the individual parameter estimates, $s_{\hat{\beta}_i}$, being the
square roots of the diagonal elements of $V$, inherit a ``degrees of
freedom correction'' or df correction from the estimator $s^2$.
Going one step further, consider hypothesis testing in the context of
OLS. Since the variance of $\hat\beta$ is unknown and must itself be
estimated, the sampling distribution of the OLS coefficients is not,
strictly speaking, normal. But if the \textit{disturbances} are
normally distributed (besides being IID) then, even in small samples,
the parameter estimates will follow a distribution that can be
specified exactly, namely the Student's $t$ distribution with degrees
of freedom equal to the value given above, $\nu = n-k$.
That is, besides using a df correction in computing the standard
errors of the OLS coefficients, one uses the same $\nu$ in selecting
the particular distribution to which the ``$t$-ratio'',
$(\hat{\beta}_i-\beta^0)/s_{\hat{\beta}_i}$, should be referred in
order to determine the marginal significance level or $p$-value for
the null hypothesis $H_0: \beta_i = \beta^0$. This is the real
payoff to df correction: we get test statistics that follow a
known distribution in small samples. (In big enough samples
the point is moot, since the quantitative distinction between
$\hat\sigma^2$ and $s^2$ vanishes.)
So far, so good. Everyone expects df correction in OLS standard
errors, just as we expect division by $n-1$ in the sample variance.
And users of econometric software expect that the $p$-values reported
for OLS coefficients will be based on the $t(\nu)$ distribution
(although they are not always sufficiently aware that the validity
of such statistics in small samples depends on the assumption
of normally distributed errors).
\section{Beyond OLS}
The situation is different when we move beyond estimation of
the classical linear model via OLS. We may wish to estimate nonlinear
models (sometimes by least squares), and many models of interest to
econometricians are commonly estimated via maximization of a
likelihood function, or via the generalized method of moments (GMM).
In such cases we do not, in general, have exact small-sample results
to rely upon; in particular, we cannot assume that coefficient
estimates follow the $t$ distribution. Rather, we typically appeal to
asymptotic results in statistical theory. We seek \textit{consistent}
estimators which, although they may be biased, nonetheless converge in
probability to the corresponding parameter values as the sample size
goes to infinity. Under the right conditions, laws of large numbers
and central limit theorems entitle us to expect that test statistics
will converge to the normal distribution, or the $\chi^2$ distribution
for multivariate tests, given big enough samples.
\subsection{To ``correct'' or not?}
The question arises, should we or should we not apply a df
``correction'' in reporting variance estimates and standard errors
for models that depart from the classical linear specification?
One argument against applying df adjustment is that it lacks a
theoretical basis in relation to hypothesis testing, in that it does
not produce test statistics that follow any known distribution in
small samples. In addition, if parameter estimates are obtained via
ML, it makes sense to report ML estimates of variances even if these
are biased, since it is the ML quantities that are used in computing
the criterion function and in forming likelihood-ratio tests.
On the other hand, pragmatic arguments for doing df adjustment are (a)
that it makes for closer comparability between regular OLS estimates
and nonlinear ones, and (b) that it provides a ``pinch of salt'' in
relation to small-sample results, even if it lacks rigorous
justification.
Note that even for fairly small samples, the difference between the
biased and unbiased estimators in equations (\ref{eq:sigma-hat}) and
(\ref{eq:sample-variance}) above will be small. For example, if
$n=30$ then $s^2 = \frac{30}{29} \hat\sigma^2$. In econometric
modelling proper, however, the difference can be quite substantial.
If $n=50$ and $k=10$, the $s^2$ defined in (\ref{eq:ols-s2}) will be
$50/40 = 1.25$ as large as the $\hat\sigma^2$ in
(\ref{eq:ols-sigma2}), and standard errors will be about 12 percent
larger.\footnote{A fairly typical situation in time-series
macroeconometrics would be have between 100 and 200 quarterly
observations, and to be estimating up to maybe 30 parameters
including lags. In this case df correction would make a difference
to standard errors on the order of 10 percent.} One might therefore
make a case for ``inflating'' the standard errors obtained via
nonlinear estimators as a precaution against taking results to be
``more precise than they really are''.
In rejoinder to the last point, one might equally say that savvy
econometricians should know to apply a discount factor (albeit an
imprecise one) to small-sample estimates outside of the classical,
normal linear model---or even that they should distrust such results
and insist on large samples before making inferences. This line of
thinking suggests that test statistics such as $z =
\hat{\beta}_i/\hat\sigma_{\hat{\beta}_i}$ should be referred to the
distribution to which they conform asymptotically---in this case
$N(0,1)$ for $H_0: \beta_i = 0$---if and only if the conditions for
appealing to asymptotic results can be taken to be met. From this
point of view df adjustment may be seen as providing a false sense of
security.
\section{Consistency and ``grey'' cases}
Consistency (in the ordinary sense of uniformity of treatment) is a
bugbear when dealing with this issue. To give a simple example,
suppose an econometrics program follows the policy of applying df
correction for OLS estimation but not for ML estimation. One is, of
course, free to estimate the classical, normal linear model via ML, in
which case $\hat\beta$ should be numerically identical to that
obtained via OLS. But the user of the software will obtain two
different sets of standard errors depending on the estimation method.
This example is not very troublesome, however; presumably one would
apply ML to the classical linear model only to make a pedagogical
point.
Here is a more awkward case. An unrestricted vector autoregression
(VAR) is a system of equations, but the ML estimate of this system,
given normal errors, is equivalent to equation-by-equation OLS.
Should df correction be applied to VARs? Consistency with OLS argues
Yes. But wait\dots{} a popular extension of the VAR methodology is the
vector error-correction model (VECM). VECMs are closely related to
VARs and one might well be interested in making comparisons across the
two, but a VECM is a nonlinear system and the cointegrating vectors
that lie at the heart of this model must be estimated via ML. So
perhaps VAR results should \textit{not} be df adjusted, for
comparability with VECMs.
Another ``grey area'' is the class of single-equation Feasible
Generalized Least Squares (FGLS) estimators---for example, weighted
least squares following the estimation of a skedastic function, or
estimators designed to handle first-order autocorrelation such as
Cochrane--Orcutt. These depart from the classical linear model, and
the theoreretical basis for inference in such models is asymptotic,
yet according to econometric tradition standard errors are generally
df adjusted.
\section{What gretl does}
To be written.
|