#!/usr/bin/env python
# coding: utf-8

# DO NOT EDIT
# Autogenerated from the notebook autoregressive_distributed_lag.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# # Autoregressive Distributed Lag (ARDL) models
#
#
# ## ARDL Models
#
# Autoregressive Distributed Lag (ARDL) models extend Autoregressive
# models with lags of explanatory variables. While ARDL models are
# technically AR-X models, the key difference is that ARDL models focus on
# the exogenous variables and selecting the correct lag structure from both
# the endogenous variable and the exogenous variables.  ARDL models are also
# closely related to Vector Autoregressions, and a single ARDL is
# effectively one row of a VAR.  The key distinction is that an ARDL assumes
# that the exogenous variables are exogenous in the sense that it is not
# necessary to include the endogenous variable as a predictor of the
# exogenous variables.
#
# The full specification of ARDL models is
#
# $$
# Y_t = \underset{\text{Constant and Trend}}{\underbrace{\delta_0 +
# \delta_1 t + \ldots + \delta_k t^k}}
#       + \underset{\text{Seasonal}}{\underbrace{\sum_{i=0}^{s-1} \gamma_i
# S_i}}
#       + \underset{\text{Autoregressive}}{\underbrace{\sum_{p=1}^P \phi_p
# Y_{t-p}}}
#       + \underset{\text{Distributed Lag}}{\underbrace{\sum_{k=1}^M
# \sum_{j=0}^{Q_k} \beta_{k,j} X_{k, t-j}}}
#       + \underset{\text{Fixed}}{\underbrace{Z_t \Gamma}} + \epsilon_t
# $$
#
# The terms in the model are:
#
# * $\delta_i$: constant and deterministic time regressors.  Set using
# `trend`.
# * $S_i$ are seasonal dummies which are included if `seasonal=True`.
# * $X_{k,t-j}$ are the exogenous regressors. There are a number of
# formats that can be used to specify which lags are included. Note that the
# included lag lengths do no need to be the same. If `causal=True`, then the
# lags start with lag 1. Otherwise lags begin with 0 so that the model
# included the contemporaneous relationship between $Y_t$ and $X_t$.
# * $Z_t$ are any other fixed regressors that are not part of the
# distributed lag specification. In practice these regressors may be
# included when they do no contribute to the long run-relationship between
# $Y_t$ and the vector of exogenous variables $X_t$.
# * $\{\epsilon_t\}$ is assumed to be a White Noise process

import numpy as np
import pandas as pd
import seaborn as sns

sns.set_style("darkgrid")
sns.mpl.rc("figure", figsize=(16, 6))
sns.mpl.rc("font", size=14)

# ### Data
#
# This notebook makes use of money demand data from Denmark, as first used
# in  S. Johansen and K. Juselius (1990).  The key variables are:
#
# * `lrm`: Log of real money measured using M2
# * `lry`: Log of real income
# * `ibo`: Interest rate on bonds
# * `ide`: Interest rate of bank deposits
#
# The standard model uses `lrm` as the dependent variable and the other
# three as exogenous drivers.
#
# Johansen, S. and Juselius, K. (1990), Maximum Likelihood Estimation and
# Inference on Cointegration – with Applications to the Demand for Money,
# Oxford Bulletin of Economics and Statistics, 52, 2, 169–210.
#
# We start by loading the data and examining it.

from statsmodels.datasets.danish_data import load
from statsmodels.tsa.api import ARDL
from statsmodels.tsa.ardl import ardl_select_order

data = load().data
data = data[["lrm", "lry", "ibo", "ide"]]
data.tail()

# We plot the demeaned data so that all series appear on the same scale.
# The `lrm` series appears to be non-stationary, as does `lry`. The
# stationarity of the other two is less obvious.

_ = (data - data.mean()).plot()

# ### Model Selection
#
# `ardl_select_order` can be used to automatically select the order. Here
# we use min the minimum AIC among all modes that consider up to 3 lags of
# the endogenous variable and 3 lags of each exogenous variable. `trend="c"`
# indicates that a constant should be included in the model.

sel_res = ardl_select_order(data.lrm,
                            3,
                            data[["lry", "ibo", "ide"]],
                            3,
                            ic="aic",
                            trend="c")
print(f"The optimal order is: {sel_res.model.ardl_order}")

# The optimal order is returned as the number of lags of the endogenous
# variable followed by each of the exogenous regressors.  The attribute
# `model` on `sel_res` contains the model `ARDL` specification which can be
# used to call `fit`. Here we look at the summary where the `L#` indicates
# that lag length (e.g., `L0` is no lag, i.e., $X_{k,t}$, `L2` is 2 lags,
# i.e., $X_{k,t-2}$).

res = sel_res.model.fit()
res.summary()

# ### Global searches
#
# The selection criteria can be switched the BIC which chooses a smaller
# model. Here we also use the `glob=True` option to perform a global search
# which considers models with any subset of lags up to the maximum lag
# allowed (3 here).  This option lets the model selection choose non-
# contiguous lag specifications.

sel_res = ardl_select_order(data.lrm,
                            3,
                            data[["lry", "ibo", "ide"]],
                            3,
                            ic="bic",
                            trend="c",
                            glob=True)
sel_res.model.ardl_order

# While the `ardl_order` shows the largest included lag of each variable,
# `ar_lags` and `dl_lags` show the specific lags included.  The AR component
# is regular in the sense that all 3 lags are included.  The DL component is
# not since `ibo` selects only lags 0 and 3 and ide selects only lags 2.

sel_res.model.ar_lags

sel_res.model.dl_lags

# We can take a look at the best performing models according to the BIC
# which are stored in the `bic` property. `ibo` at lags 0 and 3 is
# consistently selected, as is `ide` at either lag 2 or 3, and `lry` at lag
# 0. The selected AR lags vary more, although all of the best specifications
# select some.

for i, val in enumerate(sel_res.bic.head(10)):
    print(f"{i+1}: {val}")

# ### Direct Parameterization
#
# ARDL models can be directly specified using the `ARDL` class.  The first
# argument is the endogenous variable ($Y_t$). The second is the AR lags. It
# can be a constant, in which case lags 1, 2, ..., $P$ are included, or a
# list of specific lags indices to include (e.g., `[1, 4]`).  The third are
# the exogenous variables, and the fourth is the list of lags to include.
# This can be one of
#
# * An `int`: Include lags 0, 1, ..., Q
# * A dict with column names when `exog` is a `DataFrame` or numeric
# column locations when `exog` is a NumPy array (e.g., `{0:1, 1: 2, 2:3}`,
# would match the specification below if a NumPy array was used.
# * A dict with column names (DataFrames) or integers (NumPy arrays) that
# contains a list of specific lags to include (e.g., `{"lry":[0,2],
# "ibo":[1,2]}`).
#
# The specification below matches that model selected by
# `ardl_select_order`.

res = ARDL(data.lrm,
           2,
           data[["lry", "ibo", "ide"]], {
               "lry": 1,
               "ibo": 2,
               "ide": 3
           },
           trend="c").fit()
res.summary()

# ### NumPy Data
#
# Below we see how the specification of ARDL models differs when using
# NumPy arrays.  The key difference is that the keys in the dictionary are
# now integers which indicate the column of `x` to use. This model is
# identical to the previously fit model and all key value match exactly
# (e.g., Log Likelihood).

y = np.asarray(data.lrm)
x = np.asarray(data[["lry", "ibo", "ide"]])
res = ARDL(y, 2, x, {0: 1, 1: 2, 2: 3}, trend="c").fit()
res.summary()

# ### Causal models
#
# Using the `causal=True` flag eliminates lag 0 from the DL components, so
# that all variables included in the model are known at time $t-1$ when
# modeling $Y_t$.

res = ARDL(
    data.lrm,
    2,
    data[["lry", "ibo", "ide"]],
    {
        "lry": 1,
        "ibo": 2,
        "ide": 3
    },
    trend="c",
    causal=True,
).fit()
res.summary()

# ## Unconstrained Error Correction Models (UECM)
#
# Unconstrained Error Correction Models reparameterize ARDL model to focus
# on the long-run component of a time series.  The reparameterized model is
#
# $$
# \Delta Y_t = \underset{\text{Constant and Trend}}{\underbrace{\delta_0 +
# \delta_1 t + \ldots + \delta_k t^k}}
#       + \underset{\text{Seasonal}}{\underbrace{\sum_{i=0}^{s-1} \gamma_i
# S_i}}
#       + \underset{\text{Long-Run}}{\underbrace{\lambda_0 Y_{t-1} +
# \sum_{b=1}^M \lambda_i  X_{b,t-1}}}
#       + \underset{\text{Autoregressive}}{\underbrace{\sum_{p=1}^P \phi_p
# \Delta Y_{t-p}}}
#       + \underset{\text{Distributed Lag}}{\underbrace{\sum_{k=1}^M
# \sum_{j=0}^{Q_k} \beta_{k,j} \Delta X_{k, t-j}}}
#       + \underset{\text{Fixed}}{\underbrace{Z_t \Gamma}} + \epsilon_t
# $$
#
#
# Most of the components are the same.  The key differences are:
#
# * The levels only enter at lag 1
# * All other lags of $Y_t$ or $X_{k,t}$ are differenced
#
# Due to their structure, UECM models _do not_ support irregular lag
# specifications, and so lags specifications must be integers. The AR lag
# length must be an integer or `None`, while the DL lag specification can be
# an integer or a dictionary of integers.  Other options such as `trend`,
# `seasonal`, and `causal` are identical.
#
# Below we select a model and then using the class method `from_ardl` to
# construct the UECM.  The parameter estimates prefixed with `D.` are
# differences.

from statsmodels.tsa.api import UECM

sel_res = ardl_select_order(data.lrm,
                            3,
                            data[["lry", "ibo", "ide"]],
                            3,
                            ic="aic",
                            trend="c")

ecm = UECM.from_ardl(sel_res.model)
ecm_res = ecm.fit()
ecm_res.summary()

# ### Cointegrating Relationships
#
# Because the focus is on the long-run relationship, the results of UECM
# model fits contains a number of properties that focus on the long-run
# relationship. These are all prefixed `ci_`, for cointegrating.
# `ci_summary` contains the normalized estimates of the cointegrating
# relationship and associated estimated values.

ecm_res.ci_summary()

# `ci_resids` contains the long-run residual, which is the error the
# drives figure changes in $\Delta Y_t$.

_ = ecm_res.ci_resids.plot(title="Cointegrating Error")

# ### Seasonal Dummies
#
# Here we add seasonal terms, which appear to be statistically
# significant.

ecm = UECM(data.lrm, 2, data[["lry", "ibo", "ide"]], 2, seasonal=True)
seasonal_ecm_res = ecm.fit()
seasonal_ecm_res.summary()

# All deterministic terms are included in the `ci_` prefixed terms.  Here
# we see the normalized seasonal effects in the summary.

seasonal_ecm_res.ci_summary()

# The residuals are somewhat more random in appearance.

_ = seasonal_ecm_res.ci_resids.plot(
    title="Cointegrating Error with Seasonality")

# ## The relationship between Consumption and Growth
#
# Here we look at an example from Greene's _Econometric analysis_ which
# focuses on teh long-run relationship between consumption and growth. We
# start by downloading the raw data.
#
# Greene, W. H. (2000). Econometric analysis 4th edition. International
# edition, New Jersey: Prentice Hall, 201-215.

greene = pd.read_fwf(
    "http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF5-2.txt")
greene.head()

# We then transform the index to be a pandas `DatetimeIndex` so that we
# can easily use seasonal terms.

index = pd.to_datetime(
    greene.Year.astype("int").astype("str") + "Q" +
    greene.qtr.astype("int").astype("str"))
greene.index = index
greene.index.freq = greene.index.inferred_freq
greene.head()

# We defined `g` as the log of real gdp and `c` as the log of real
# consumption.

greene["c"] = np.log(greene.realcons)
greene["g"] = np.log(greene.realgdp)

# ### Lag Length Selection
#
# The selected model contains 5 lags of consumption and 2 of growth (0 and
# 1). Here we include seasonal terms although these are not significant.

sel_res = ardl_select_order(greene.c,
                            8,
                            greene[["g"]],
                            8,
                            trend="c",
                            seasonal=True,
                            ic="aic")
ardl = sel_res.model
ardl.ardl_order

res = ardl.fit(use_t=True)
res.summary()

# `from_ardl` is a simple way to get the equivalent UECM specification.
# Here we rerun the selection without the seasonal terms.

sel_res = ardl_select_order(greene.c, 8, greene[["g"]], 8, trend="c", ic="aic")

uecm = UECM.from_ardl(sel_res.model)
uecm_res = uecm.fit()
uecm_res.summary()

# We see that for every % increase in consumption, we need a 1.05%
# increase in gdp. In other words, the saving rate is estimated to be around
# 5%.

uecm_res.ci_summary()

_ = uecm_res.ci_resids.plot(title="Cointegrating Error")

# ### Direct Specification of `UECM` models
#
# `UECM` can be used to directly specify model lag lengths.

uecm = UECM(greene.c, 2, greene[["g"]], 1, trend="c")
uecm_res = uecm.fit()
uecm_res.summary()

# The changes in the lag structure make little difference in the estimated
# long-run relationship.

uecm_res.ci_summary()

# ## Bounds Testing
#
# `UECMResults` expose the bounds test of Pesaran, Shin, and Smith (2001).
# This test facilitates testing whether there is a level relationship
# between a set of variables without identifying which variables are I(1).
# This test provides two sets of critical and p-values.  If the test
# statistic is below the critical value for the lower bound, then there
# appears to be no levels relationship irrespective of the order or
# integration in the $X$ variables.  If it is above the upper bound, then
# there appears to be a levels relationship again, irrespective of the order
# of integration of the $X$ variables. There are 5 cases covered in the
# paper that include different combinations of deterministic regressors in
# the model or the test.
#
#
# $$\Delta Y_{t}=\delta_{0} + \delta_{1}t + Z_{t-1}\beta +
# \sum_{j=0}^{P}\Delta X_{t-j}\Gamma + \epsilon_{t}$$
#
# where $Z_{t-1}$ includes both $Y_{t-1}$ and $X_{t-1}$.
#
# The cases determine which deterministic terms are included in the model
# and which are tested as part of the test.
#
# 1. No deterministic terms
# 2. Constant included in both the model and the test
# 3. Constant included in the model but not in the test
# 4. Constant and trend included in the model, only trend included in the
# test
# 5. Constant and trend included in the model, neither included in the
# test
#
# Here we run the test on the Danish money demand data set. Here we see
# the test statistic is above the 95% critical value for both the lower and
# upper.
#
#
# Pesaran, M. H., Shin, Y., & Smith, R. J. (2001). Bounds testing
# approaches to the analysis of level relationships. Journal of applied
# econometrics, 16(3), 289-326.

ecm = UECM(data.lrm, 3, data[["lry", "ibo", "ide"]], 3, trend="c")
ecm_fit = ecm.fit()
bounds_test = ecm_fit.bounds_test(case=4)
bounds_test

bounds_test.crit_vals

# Case 3 also rejects the null of no levels relationship.

ecm = UECM(data.lrm, 3, data[["lry", "ibo", "ide"]], 3, trend="c")
ecm_fit = ecm.fit()
bounds_test = ecm_fit.bounds_test(case=3)
bounds_test
