File: statespace_forecasting.py

package info (click to toggle)
statsmodels 0.14.6%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 49,956 kB
  • sloc: python: 254,365; f90: 612; sh: 560; javascript: 337; asm: 156; makefile: 145; ansic: 32; xml: 9
file content (456 lines) | stat: -rw-r--r-- 15,791 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
#!/usr/bin/env python

# DO NOT EDIT
# Autogenerated from the notebook statespace_forecasting.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# # Forecasting in statsmodels
#
# This notebook describes forecasting using time series models in
# statsmodels.
#
# **Note**: this notebook applies only to the state space model classes,
# which are:
#
# - `sm.tsa.SARIMAX`
# - `sm.tsa.UnobservedComponents`
# - `sm.tsa.VARMAX`
# - `sm.tsa.DynamicFactor`

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

macrodata = sm.datasets.macrodata.load_pandas().data
macrodata.index = pd.period_range('1959Q1', '2009Q3', freq='Q')

# ## Basic example
#
# A simple example is to use an AR(1) model to forecast inflation. Before
# forecasting, let's take a look at the series:

endog = macrodata['infl']
endog.plot(figsize=(15, 5))

# ### Constructing and estimating the model

# The next step is to formulate the econometric model that we want to use
# for forecasting. In this case, we will use an AR(1) model via the
# `SARIMAX` class in statsmodels.
#
# After constructing the model, we need to estimate its parameters. This
# is done using the `fit` method. The `summary` method produces several
# convenient tables showing the results.

# Construct the model
mod = sm.tsa.SARIMAX(endog, order=(1, 0, 0), trend='c')
# Estimate the parameters
res = mod.fit()

print(res.summary())

# ### Forecasting

# Out-of-sample forecasts are produced using the `forecast` or
# `get_forecast` methods from the results object.
#
# The `forecast` method gives only point forecasts.

# The default is to get a one-step-ahead forecast:
print(res.forecast())

# The `get_forecast` method is more general, and also allows constructing
# confidence intervals.

# Here we construct a more complete results object.
fcast_res1 = res.get_forecast()

# Most results are collected in the `summary_frame` attribute.
# Here we specify that we want a confidence level of 90%
print(fcast_res1.summary_frame(alpha=0.10))

# The default confidence level is 95%, but this can be controlled by
# setting the `alpha` parameter, where the confidence level is defined as
# $(1 - \alpha) \times 100\%$. In the example above, we specified a
# confidence level of 90%, using `alpha=0.10`.

# ### Specifying the number of forecasts
#
# Both of the functions `forecast` and `get_forecast` accept a single
# argument indicating how many forecasting steps are desired. One option for
# this argument is always to provide an integer describing the number of
# steps ahead you want.

print(res.forecast(steps=2))

fcast_res2 = res.get_forecast(steps=2)
# Note: since we did not specify the alpha parameter, the
# confidence level is at the default, 95%
print(fcast_res2.summary_frame())

# However, **if your data included a Pandas index with a defined
# frequency** (see the section at the end on Indexes for more information),
# then you can alternatively specify the date through which you want
# forecasts to be produced:

print(res.forecast('2010Q2'))

fcast_res3 = res.get_forecast('2010Q2')
print(fcast_res3.summary_frame())

# ### Plotting the data, forecasts, and confidence intervals
#
# Often it is useful to plot the data, the forecasts, and the confidence
# intervals. There are many ways to do this, but here's one example

fig, ax = plt.subplots(figsize=(15, 5))

# Plot the data (here we are subsetting it to get a better look at the
# forecasts)
endog.loc['1999':].plot(ax=ax)

# Construct the forecasts
fcast = res.get_forecast('2011Q4').summary_frame()
fcast['mean'].plot(ax=ax, style='k--')
ax.fill_between(fcast.index,
                fcast['mean_ci_lower'],
                fcast['mean_ci_upper'],
                color='k',
                alpha=0.1)

# ### Note on what to expect from forecasts
#
# The forecast above may not look very impressive, as it is almost a
# straight line. This is because this is a very simple, univariate
# forecasting model. Nonetheless, keep in mind that these simple forecasting
# models can be extremely competitive.

# ## Prediction vs Forecasting
#
# The results objects also contain two methods that all for both in-sample
# fitted values and out-of-sample forecasting. They are `predict` and
# `get_prediction`. The `predict` method only returns point predictions
# (similar to `forecast`), while the `get_prediction` method also returns
# additional results (similar to `get_forecast`).
#
# In general, if your interest is out-of-sample forecasting, it is easier
# to stick to the `forecast` and `get_forecast` methods.

# ## Cross validation
#
# **Note**: some of the functions used in this section were first
# introduced in statsmodels v0.11.0.
#
# A common use case is to cross-validate forecasting methods by performing
# h-step-ahead forecasts recursively using the following process:
#
# 1. Fit model parameters on a training sample
# 2. Produce h-step-ahead forecasts from the end of that sample
# 3. Compare forecasts against test dataset to compute error rate
# 4. Expand the sample to include the next observation, and repeat
#
# Economists sometimes call this a pseudo-out-of-sample forecast
# evaluation exercise, or time-series cross-validation.

# ### Example

# We will conduct a very simple exercise of this sort using the inflation
# dataset above. The full dataset contains 203 observations, and for
# expositional purposes we'll use the first 80% as our training sample and
# only consider one-step-ahead forecasts.

# A single iteration of the above procedure looks like the following:

# Step 1: fit model parameters w/ training sample
training_obs = int(len(endog) * 0.8)

training_endog = endog[:training_obs]
training_mod = sm.tsa.SARIMAX(training_endog, order=(1, 0, 0), trend='c')
training_res = training_mod.fit()

# Print the estimated parameters
print(training_res.params)

# Step 2: produce one-step-ahead forecasts
fcast = training_res.forecast()

# Step 3: compute root mean square forecasting error
true = endog.reindex(fcast.index)
error = true - fcast

# Print out the results
print(
    pd.concat(
        [true.rename('true'),
         fcast.rename('forecast'),
         error.rename('error')],
        axis=1))

# To add on another observation, we can use the `append` or `extend`
# results methods. Either method can produce the same forecasts, but they
# differ in the other results that are available:
#
# - `append` is the more complete method. It always stores results for all
# training observations, and it optionally allows refitting the model
# parameters given the new observations (note that the default is *not* to
# refit the parameters).
# - `extend` is a faster method that may be useful if the training sample
# is very large. It *only* stores results for the new observations, and it
# does not allow refitting the model parameters (i.e. you have to use the
# parameters estimated on the previous sample).
#
# If your training sample is relatively small (less than a few thousand
# observations, for example) or if you want to compute the best possible
# forecasts, then you should use the `append` method. However, if that
# method is infeasible (for example, because you have a very large training
# sample) or if you are okay with slightly suboptimal forecasts (because the
# parameter estimates will be slightly stale), then you can consider the
# `extend` method.

# A second iteration, using the `append` method and refitting the
# parameters, would go as follows (note again that the default for `append`
# does not refit the parameters, but we have overridden that with the
# `refit=True` argument):

# Step 1: append a new observation to the sample and refit the parameters
append_res = training_res.append(endog[training_obs:training_obs + 1],
                                 refit=True)

# Print the re-estimated parameters
print(append_res.params)

# Notice that these estimated parameters are slightly different than those
# we originally estimated. With the new results object, `append_res`, we can
# compute forecasts starting from one observation further than the previous
# call:

# Step 2: produce one-step-ahead forecasts
fcast = append_res.forecast()

# Step 3: compute root mean square forecasting error
true = endog.reindex(fcast.index)
error = true - fcast

# Print out the results
print(
    pd.concat(
        [true.rename('true'),
         fcast.rename('forecast'),
         error.rename('error')],
        axis=1))

# Putting it altogether, we can perform the recursive forecast evaluation
# exercise as follows:

# Setup forecasts
nforecasts = 3
forecasts = {}

# Get the number of initial training observations
nobs = len(endog)
n_init_training = int(nobs * 0.8)

# Create model for initial training sample, fit parameters
init_training_endog = endog.iloc[:n_init_training]
mod = sm.tsa.SARIMAX(training_endog, order=(1, 0, 0), trend='c')
res = mod.fit()

# Save initial forecast
forecasts[training_endog.index[-1]] = res.forecast(steps=nforecasts)

# Step through the rest of the sample
for t in range(n_init_training, nobs):
    # Update the results by appending the next observation
    updated_endog = endog.iloc[t:t + 1]
    res = res.append(updated_endog, refit=False)

    # Save the new set of forecasts
    forecasts[updated_endog.index[0]] = res.forecast(steps=nforecasts)

# Combine all forecasts into a dataframe
forecasts = pd.concat(forecasts, axis=1)

print(forecasts.iloc[:5, :5])

# We now have a set of three forecasts made at each point in time from
# 1999Q2 through 2009Q3. We can construct the forecast errors by subtracting
# each forecast from the actual value of `endog` at that point.

# Construct the forecast errors
forecast_errors = forecasts.apply(lambda column: endog - column).reindex(
    forecasts.index)

print(forecast_errors.iloc[:5, :5])

# To evaluate our forecasts, we often want to look at a summary value like
# the root mean square error. Here we can compute that for each horizon by
# first flattening the forecast errors so that they are indexed by horizon
# and then computing the root mean square error fore each horizon.


# Reindex the forecasts by horizon rather than by date
def flatten(column):
    return column.dropna().reset_index(drop=True)


flattened = forecast_errors.apply(flatten)
flattened.index = (flattened.index + 1).rename('horizon')

print(flattened.iloc[:3, :5])

# Compute the root mean square error
rmse = (flattened**2).mean(axis=1)**0.5

print(rmse)

# #### Using `extend`
#
# We can check that we get similar forecasts if we instead use the
# `extend` method, but that they are not exactly the same as when we use
# `append` with the `refit=True` argument. This is because `extend` does not
# re-estimate the parameters given the new observation.

# Setup forecasts
nforecasts = 3
forecasts = {}

# Get the number of initial training observations
nobs = len(endog)
n_init_training = int(nobs * 0.8)

# Create model for initial training sample, fit parameters
init_training_endog = endog.iloc[:n_init_training]
mod = sm.tsa.SARIMAX(training_endog, order=(1, 0, 0), trend='c')
res = mod.fit()

# Save initial forecast
forecasts[training_endog.index[-1]] = res.forecast(steps=nforecasts)

# Step through the rest of the sample
for t in range(n_init_training, nobs):
    # Update the results by appending the next observation
    updated_endog = endog.iloc[t:t + 1]
    res = res.extend(updated_endog)

    # Save the new set of forecasts
    forecasts[updated_endog.index[0]] = res.forecast(steps=nforecasts)

# Combine all forecasts into a dataframe
forecasts = pd.concat(forecasts, axis=1)

print(forecasts.iloc[:5, :5])

# Construct the forecast errors
forecast_errors = forecasts.apply(lambda column: endog - column).reindex(
    forecasts.index)

print(forecast_errors.iloc[:5, :5])


# Reindex the forecasts by horizon rather than by date
def flatten(column):
    return column.dropna().reset_index(drop=True)


flattened = forecast_errors.apply(flatten)
flattened.index = (flattened.index + 1).rename('horizon')

print(flattened.iloc[:3, :5])

# Compute the root mean square error
rmse = (flattened**2).mean(axis=1)**0.5

print(rmse)

# By not re-estimating the parameters, our forecasts are slightly worse
# (the root mean square error is higher at each horizon). However, the
# process is faster, even with only 200 datapoints. Using the `%%timeit`
# cell magic on the cells above, we found a runtime of 570ms using `extend`
# versus 1.7s using `append` with `refit=True`. (Note that using `extend` is
# also faster than using `append` with `refit=False`).

# ## Indexes
#
# Throughout this notebook, we have been making use of Pandas date indexes
# with an associated frequency. As you can see, this index marks our data as
# at a quarterly frequency, between 1959Q1 and 2009Q3.

print(endog.index)

# In most cases, if your data has an associated data/time index with a
# defined frequency (like quarterly, monthly, etc.), then it is best to make
# sure your data is a Pandas series with the appropriate index. Here are
# three examples of this:

# Annual frequency, using a PeriodIndex
index = pd.period_range(start='2000', periods=4, freq='A')
endog1 = pd.Series([1, 2, 3, 4], index=index)
print(endog1.index)

# Quarterly frequency, using a DatetimeIndex
index = pd.date_range(start='2000', periods=4, freq='QS')
endog2 = pd.Series([1, 2, 3, 4], index=index)
print(endog2.index)

# Monthly frequency, using a DatetimeIndex
index = pd.date_range(start='2000', periods=4, freq='ME')
endog3 = pd.Series([1, 2, 3, 4], index=index)
print(endog3.index)

# In fact, if your data has an associated date/time index, it is best to
# use that even if does not have a defined frequency. An example of that
# kind of index is as follows - notice that it has `freq=None`:

index = pd.DatetimeIndex([
    '2000-01-01 10:08am', '2000-01-01 11:32am', '2000-01-01 5:32pm',
    '2000-01-02 6:15am'
])
endog4 = pd.Series([0.2, 0.5, -0.1, 0.1], index=index)
print(endog4.index)

# You can still pass this data to statsmodels' model classes, but you will
# get the following warning, that no frequency data was found:

mod = sm.tsa.SARIMAX(endog4)
res = mod.fit()

# What this means is that you cannot specify forecasting steps by dates,
# and the output of the `forecast` and `get_forecast` methods will not have
# associated dates. The reason is that without a given frequency, there is
# no way to determine what date each forecast should be assigned to. In the
# example above, there is no pattern to the date/time stamps of the index,
# so there is no way to determine what the next date/time should be (should
# it be in the morning of 2000-01-02? the afternoon? or maybe not until
# 2000-01-03?).
#
# For example, if we forecast one-step-ahead:

res.forecast(1)

# The index associated with the new forecast is `4`, because if the given
# data had an integer index, that would be the next value. A warning is
# given letting the user know that the index is not a date/time index.
#
# If we try to specify the steps of the forecast using a date, we will get
# the following exception:
#
#     KeyError: 'The `end` argument could not be matched to a location
# related to the index of the data.'
#

# Here we'll catch the exception to prevent printing too much of
# the exception trace output in this notebook
try:
    res.forecast('2000-01-03')
except KeyError as e:
    print(e)

# Ultimately there is nothing wrong with using data that does not have an
# associated date/time frequency, or even using data that has no index at
# all, like a Numpy array. However, if you can use a Pandas series with an
# associated frequency, you'll have more options for specifying your
# forecasts and get back results with a more useful index.