File: autoregressive_distributed_lag.py

package info (click to toggle)
statsmodels 0.14.6%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 49,956 kB
  • sloc: python: 254,365; f90: 612; sh: 560; javascript: 337; asm: 156; makefile: 145; ansic: 32; xml: 9
file content (432 lines) | stat: -rw-r--r-- 15,308 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
#!/usr/bin/env python
# coding: utf-8

# DO NOT EDIT
# Autogenerated from the notebook autoregressive_distributed_lag.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# # Autoregressive Distributed Lag (ARDL) models
#
#
# ## ARDL Models
#
# Autoregressive Distributed Lag (ARDL) models extend Autoregressive
# models with lags of explanatory variables. While ARDL models are
# technically AR-X models, the key difference is that ARDL models focus on
# the exogenous variables and selecting the correct lag structure from both
# the endogenous variable and the exogenous variables.  ARDL models are also
# closely related to Vector Autoregressions, and a single ARDL is
# effectively one row of a VAR.  The key distinction is that an ARDL assumes
# that the exogenous variables are exogenous in the sense that it is not
# necessary to include the endogenous variable as a predictor of the
# exogenous variables.
#
# The full specification of ARDL models is
#
# $$
# Y_t = \underset{\text{Constant and Trend}}{\underbrace{\delta_0 +
# \delta_1 t + \ldots + \delta_k t^k}}
#       + \underset{\text{Seasonal}}{\underbrace{\sum_{i=0}^{s-1} \gamma_i
# S_i}}
#       + \underset{\text{Autoregressive}}{\underbrace{\sum_{p=1}^P \phi_p
# Y_{t-p}}}
#       + \underset{\text{Distributed Lag}}{\underbrace{\sum_{k=1}^M
# \sum_{j=0}^{Q_k} \beta_{k,j} X_{k, t-j}}}
#       + \underset{\text{Fixed}}{\underbrace{Z_t \Gamma}} + \epsilon_t
# $$
#
# The terms in the model are:
#
# * $\delta_i$: constant and deterministic time regressors.  Set using
# `trend`.
# * $S_i$ are seasonal dummies which are included if `seasonal=True`.
# * $X_{k,t-j}$ are the exogenous regressors. There are a number of
# formats that can be used to specify which lags are included. Note that the
# included lag lengths do no need to be the same. If `causal=True`, then the
# lags start with lag 1. Otherwise lags begin with 0 so that the model
# included the contemporaneous relationship between $Y_t$ and $X_t$.
# * $Z_t$ are any other fixed regressors that are not part of the
# distributed lag specification. In practice these regressors may be
# included when they do no contribute to the long run-relationship between
# $Y_t$ and the vector of exogenous variables $X_t$.
# * $\{\epsilon_t\}$ is assumed to be a White Noise process

import numpy as np
import pandas as pd
import seaborn as sns

sns.set_style("darkgrid")
sns.mpl.rc("figure", figsize=(16, 6))
sns.mpl.rc("font", size=14)

# ### Data
#
# This notebook makes use of money demand data from Denmark, as first used
# in  S. Johansen and K. Juselius (1990).  The key variables are:
#
# * `lrm`: Log of real money measured using M2
# * `lry`: Log of real income
# * `ibo`: Interest rate on bonds
# * `ide`: Interest rate of bank deposits
#
# The standard model uses `lrm` as the dependent variable and the other
# three as exogenous drivers.
#
# Johansen, S. and Juselius, K. (1990), Maximum Likelihood Estimation and
# Inference on Cointegration – with Applications to the Demand for Money,
# Oxford Bulletin of Economics and Statistics, 52, 2, 169–210.
#
# We start by loading the data and examining it.

from statsmodels.datasets.danish_data import load
from statsmodels.tsa.api import ARDL
from statsmodels.tsa.ardl import ardl_select_order

data = load().data
data = data[["lrm", "lry", "ibo", "ide"]]
data.tail()

# We plot the demeaned data so that all series appear on the same scale.
# The `lrm` series appears to be non-stationary, as does `lry`. The
# stationarity of the other two is less obvious.

_ = (data - data.mean()).plot()

# ### Model Selection
#
# `ardl_select_order` can be used to automatically select the order. Here
# we use min the minimum AIC among all modes that consider up to 3 lags of
# the endogenous variable and 3 lags of each exogenous variable. `trend="c"`
# indicates that a constant should be included in the model.

sel_res = ardl_select_order(data.lrm,
                            3,
                            data[["lry", "ibo", "ide"]],
                            3,
                            ic="aic",
                            trend="c")
print(f"The optimal order is: {sel_res.model.ardl_order}")

# The optimal order is returned as the number of lags of the endogenous
# variable followed by each of the exogenous regressors.  The attribute
# `model` on `sel_res` contains the model `ARDL` specification which can be
# used to call `fit`. Here we look at the summary where the `L#` indicates
# that lag length (e.g., `L0` is no lag, i.e., $X_{k,t}$, `L2` is 2 lags,
# i.e., $X_{k,t-2}$).

res = sel_res.model.fit()
res.summary()

# ### Global searches
#
# The selection criteria can be switched the BIC which chooses a smaller
# model. Here we also use the `glob=True` option to perform a global search
# which considers models with any subset of lags up to the maximum lag
# allowed (3 here).  This option lets the model selection choose non-
# contiguous lag specifications.

sel_res = ardl_select_order(data.lrm,
                            3,
                            data[["lry", "ibo", "ide"]],
                            3,
                            ic="bic",
                            trend="c",
                            glob=True)
sel_res.model.ardl_order

# While the `ardl_order` shows the largest included lag of each variable,
# `ar_lags` and `dl_lags` show the specific lags included.  The AR component
# is regular in the sense that all 3 lags are included.  The DL component is
# not since `ibo` selects only lags 0 and 3 and ide selects only lags 2.

sel_res.model.ar_lags

sel_res.model.dl_lags

# We can take a look at the best performing models according to the BIC
# which are stored in the `bic` property. `ibo` at lags 0 and 3 is
# consistently selected, as is `ide` at either lag 2 or 3, and `lry` at lag
# 0. The selected AR lags vary more, although all of the best specifications
# select some.

for i, val in enumerate(sel_res.bic.head(10)):
    print(f"{i+1}: {val}")

# ### Direct Parameterization
#
# ARDL models can be directly specified using the `ARDL` class.  The first
# argument is the endogenous variable ($Y_t$). The second is the AR lags. It
# can be a constant, in which case lags 1, 2, ..., $P$ are included, or a
# list of specific lags indices to include (e.g., `[1, 4]`).  The third are
# the exogenous variables, and the fourth is the list of lags to include.
# This can be one of
#
# * An `int`: Include lags 0, 1, ..., Q
# * A dict with column names when `exog` is a `DataFrame` or numeric
# column locations when `exog` is a NumPy array (e.g., `{0:1, 1: 2, 2:3}`,
# would match the specification below if a NumPy array was used.
# * A dict with column names (DataFrames) or integers (NumPy arrays) that
# contains a list of specific lags to include (e.g., `{"lry":[0,2],
# "ibo":[1,2]}`).
#
# The specification below matches that model selected by
# `ardl_select_order`.

res = ARDL(data.lrm,
           2,
           data[["lry", "ibo", "ide"]], {
               "lry": 1,
               "ibo": 2,
               "ide": 3
           },
           trend="c").fit()
res.summary()

# ### NumPy Data
#
# Below we see how the specification of ARDL models differs when using
# NumPy arrays.  The key difference is that the keys in the dictionary are
# now integers which indicate the column of `x` to use. This model is
# identical to the previously fit model and all key value match exactly
# (e.g., Log Likelihood).

y = np.asarray(data.lrm)
x = np.asarray(data[["lry", "ibo", "ide"]])
res = ARDL(y, 2, x, {0: 1, 1: 2, 2: 3}, trend="c").fit()
res.summary()

# ### Causal models
#
# Using the `causal=True` flag eliminates lag 0 from the DL components, so
# that all variables included in the model are known at time $t-1$ when
# modeling $Y_t$.

res = ARDL(
    data.lrm,
    2,
    data[["lry", "ibo", "ide"]],
    {
        "lry": 1,
        "ibo": 2,
        "ide": 3
    },
    trend="c",
    causal=True,
).fit()
res.summary()

# ## Unconstrained Error Correction Models (UECM)
#
# Unconstrained Error Correction Models reparameterize ARDL model to focus
# on the long-run component of a time series.  The reparameterized model is
#
# $$
# \Delta Y_t = \underset{\text{Constant and Trend}}{\underbrace{\delta_0 +
# \delta_1 t + \ldots + \delta_k t^k}}
#       + \underset{\text{Seasonal}}{\underbrace{\sum_{i=0}^{s-1} \gamma_i
# S_i}}
#       + \underset{\text{Long-Run}}{\underbrace{\lambda_0 Y_{t-1} +
# \sum_{b=1}^M \lambda_i  X_{b,t-1}}}
#       + \underset{\text{Autoregressive}}{\underbrace{\sum_{p=1}^P \phi_p
# \Delta Y_{t-p}}}
#       + \underset{\text{Distributed Lag}}{\underbrace{\sum_{k=1}^M
# \sum_{j=0}^{Q_k} \beta_{k,j} \Delta X_{k, t-j}}}
#       + \underset{\text{Fixed}}{\underbrace{Z_t \Gamma}} + \epsilon_t
# $$
#
#
# Most of the components are the same.  The key differences are:
#
# * The levels only enter at lag 1
# * All other lags of $Y_t$ or $X_{k,t}$ are differenced
#
# Due to their structure, UECM models _do not_ support irregular lag
# specifications, and so lags specifications must be integers. The AR lag
# length must be an integer or `None`, while the DL lag specification can be
# an integer or a dictionary of integers.  Other options such as `trend`,
# `seasonal`, and `causal` are identical.
#
# Below we select a model and then using the class method `from_ardl` to
# construct the UECM.  The parameter estimates prefixed with `D.` are
# differences.

from statsmodels.tsa.api import UECM

sel_res = ardl_select_order(data.lrm,
                            3,
                            data[["lry", "ibo", "ide"]],
                            3,
                            ic="aic",
                            trend="c")

ecm = UECM.from_ardl(sel_res.model)
ecm_res = ecm.fit()
ecm_res.summary()

# ### Cointegrating Relationships
#
# Because the focus is on the long-run relationship, the results of UECM
# model fits contains a number of properties that focus on the long-run
# relationship. These are all prefixed `ci_`, for cointegrating.
# `ci_summary` contains the normalized estimates of the cointegrating
# relationship and associated estimated values.

ecm_res.ci_summary()

# `ci_resids` contains the long-run residual, which is the error the
# drives figure changes in $\Delta Y_t$.

_ = ecm_res.ci_resids.plot(title="Cointegrating Error")

# ### Seasonal Dummies
#
# Here we add seasonal terms, which appear to be statistically
# significant.

ecm = UECM(data.lrm, 2, data[["lry", "ibo", "ide"]], 2, seasonal=True)
seasonal_ecm_res = ecm.fit()
seasonal_ecm_res.summary()

# All deterministic terms are included in the `ci_` prefixed terms.  Here
# we see the normalized seasonal effects in the summary.

seasonal_ecm_res.ci_summary()

# The residuals are somewhat more random in appearance.

_ = seasonal_ecm_res.ci_resids.plot(
    title="Cointegrating Error with Seasonality")

# ## The relationship between Consumption and Growth
#
# Here we look at an example from Greene's _Econometric analysis_ which
# focuses on teh long-run relationship between consumption and growth. We
# start by downloading the raw data.
#
# Greene, W. H. (2000). Econometric analysis 4th edition. International
# edition, New Jersey: Prentice Hall, 201-215.

greene = pd.read_fwf(
    "http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF5-2.txt")
greene.head()

# We then transform the index to be a pandas `DatetimeIndex` so that we
# can easily use seasonal terms.

index = pd.to_datetime(
    greene.Year.astype("int").astype("str") + "Q" +
    greene.qtr.astype("int").astype("str"))
greene.index = index
greene.index.freq = greene.index.inferred_freq
greene.head()

# We defined `g` as the log of real gdp and `c` as the log of real
# consumption.

greene["c"] = np.log(greene.realcons)
greene["g"] = np.log(greene.realgdp)

# ### Lag Length Selection
#
# The selected model contains 5 lags of consumption and 2 of growth (0 and
# 1). Here we include seasonal terms although these are not significant.

sel_res = ardl_select_order(greene.c,
                            8,
                            greene[["g"]],
                            8,
                            trend="c",
                            seasonal=True,
                            ic="aic")
ardl = sel_res.model
ardl.ardl_order

res = ardl.fit(use_t=True)
res.summary()

# `from_ardl` is a simple way to get the equivalent UECM specification.
# Here we rerun the selection without the seasonal terms.

sel_res = ardl_select_order(greene.c, 8, greene[["g"]], 8, trend="c", ic="aic")

uecm = UECM.from_ardl(sel_res.model)
uecm_res = uecm.fit()
uecm_res.summary()

# We see that for every % increase in consumption, we need a 1.05%
# increase in gdp. In other words, the saving rate is estimated to be around
# 5%.

uecm_res.ci_summary()

_ = uecm_res.ci_resids.plot(title="Cointegrating Error")

# ### Direct Specification of `UECM` models
#
# `UECM` can be used to directly specify model lag lengths.

uecm = UECM(greene.c, 2, greene[["g"]], 1, trend="c")
uecm_res = uecm.fit()
uecm_res.summary()

# The changes in the lag structure make little difference in the estimated
# long-run relationship.

uecm_res.ci_summary()

# ## Bounds Testing
#
# `UECMResults` expose the bounds test of Pesaran, Shin, and Smith (2001).
# This test facilitates testing whether there is a level relationship
# between a set of variables without identifying which variables are I(1).
# This test provides two sets of critical and p-values.  If the test
# statistic is below the critical value for the lower bound, then there
# appears to be no levels relationship irrespective of the order or
# integration in the $X$ variables.  If it is above the upper bound, then
# there appears to be a levels relationship again, irrespective of the order
# of integration of the $X$ variables. There are 5 cases covered in the
# paper that include different combinations of deterministic regressors in
# the model or the test.
#
#
# $$\Delta Y_{t}=\delta_{0} + \delta_{1}t + Z_{t-1}\beta +
# \sum_{j=0}^{P}\Delta X_{t-j}\Gamma + \epsilon_{t}$$
#
# where $Z_{t-1}$ includes both $Y_{t-1}$ and $X_{t-1}$.
#
# The cases determine which deterministic terms are included in the model
# and which are tested as part of the test.
#
# 1. No deterministic terms
# 2. Constant included in both the model and the test
# 3. Constant included in the model but not in the test
# 4. Constant and trend included in the model, only trend included in the
# test
# 5. Constant and trend included in the model, neither included in the
# test
#
# Here we run the test on the Danish money demand data set. Here we see
# the test statistic is above the 95% critical value for both the lower and
# upper.
#
#
# Pesaran, M. H., Shin, Y., & Smith, R. J. (2001). Bounds testing
# approaches to the analysis of level relationships. Journal of applied
# econometrics, 16(3), 289-326.

ecm = UECM(data.lrm, 3, data[["lry", "ibo", "ide"]], 3, trend="c")
ecm_fit = ecm.fit()
bounds_test = ecm_fit.bounds_test(case=4)
bounds_test

bounds_test.crit_vals

# Case 3 also rejects the null of no levels relationship.

ecm = UECM(data.lrm, 3, data[["lry", "ibo", "ide"]], 3, trend="c")
ecm_fit = ecm.fit()
bounds_test = ecm_fit.bounds_test(case=3)
bounds_test