File: statespace_seasonal.py

package info (click to toggle)
statsmodels 0.13.5%2Bdfsg-7
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 46,912 kB
  • sloc: python: 240,079; f90: 612; sh: 467; javascript: 337; asm: 156; makefile: 131; ansic: 16; xml: 9
file content (402 lines) | stat: -rw-r--r-- 16,067 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
#!/usr/bin/env python
# coding: utf-8

# DO NOT EDIT
# Autogenerated from the notebook statespace_seasonal.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# # Seasonality in time series data
#
# Consider the problem of modeling time series data with multiple seasonal
# components with different periodicities.  Let us take the time series
# $y_t$ and decompose it explicitly to have a level component and two
# seasonal components.
#
# $$
# y_t = \mu_t + \gamma^{(1)}_t + \gamma^{(2)}_t
# $$
#
# where $\mu_t$ represents the trend or level, $\gamma^{(1)}_t$ represents
# a seasonal component with a relatively short period, and $\gamma^{(2)}_t$
# represents another seasonal component of longer period. We will have a
# fixed intercept term for our level and consider both $\gamma^{(2)}_t$ and
# $\gamma^{(2)}_t$ to be stochastic so that the seasonal patterns can vary
# over time.
#
# In this notebook, we will generate synthetic data conforming to this
# model and showcase modeling of the seasonal terms in a few different ways
# under the unobserved components modeling framework.

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

# ### Synthetic data creation
#
# We will create data with multiple seasonal patterns by following
# equations (3.7) and (3.8) in Durbin and Koopman (2012).  We will simulate
# 300 periods and two seasonal terms parametrized in the frequency domain
# having periods 10 and 100, respectively, and 3 and 2 number of harmonics,
# respectively.  Further, the variances of their stochastic parts are 4 and
# 9, respectively.


# First we'll simulate the synthetic data
def simulate_seasonal_term(periodicity,
                           total_cycles,
                           noise_std=1.,
                           harmonics=None):
    duration = periodicity * total_cycles
    assert duration == int(duration)
    duration = int(duration)
    harmonics = harmonics if harmonics else int(np.floor(periodicity / 2))

    lambda_p = 2 * np.pi / float(periodicity)

    gamma_jt = noise_std * np.random.randn((harmonics))
    gamma_star_jt = noise_std * np.random.randn((harmonics))

    total_timesteps = 100 * duration  # Pad for burn in
    series = np.zeros(total_timesteps)
    for t in range(total_timesteps):
        gamma_jtp1 = np.zeros_like(gamma_jt)
        gamma_star_jtp1 = np.zeros_like(gamma_star_jt)
        for j in range(1, harmonics + 1):
            cos_j = np.cos(lambda_p * j)
            sin_j = np.sin(lambda_p * j)
            gamma_jtp1[j - 1] = (gamma_jt[j - 1] * cos_j +
                                 gamma_star_jt[j - 1] * sin_j +
                                 noise_std * np.random.randn())
            gamma_star_jtp1[j - 1] = (-gamma_jt[j - 1] * sin_j +
                                      gamma_star_jt[j - 1] * cos_j +
                                      noise_std * np.random.randn())
        series[t] = np.sum(gamma_jtp1)
        gamma_jt = gamma_jtp1
        gamma_star_jt = gamma_star_jtp1
    wanted_series = series[-duration:]  # Discard burn in

    return wanted_series


duration = 100 * 3
periodicities = [10, 100]
num_harmonics = [3, 2]
std = np.array([2, 3])
np.random.seed(8678309)

terms = []
for ix, _ in enumerate(periodicities):
    s = simulate_seasonal_term(periodicities[ix],
                               duration / periodicities[ix],
                               harmonics=num_harmonics[ix],
                               noise_std=std[ix])
    terms.append(s)
terms.append(np.ones_like(terms[0]) * 10.)
series = pd.Series(np.sum(terms, axis=0))
df = pd.DataFrame(data={
    'total': series,
    '10(3)': terms[0],
    '100(2)': terms[1],
    'level': terms[2]
})
h1, = plt.plot(df['total'])
h2, = plt.plot(df['10(3)'])
h3, = plt.plot(df['100(2)'])
h4, = plt.plot(df['level'])
plt.legend(['total', '10(3)', '100(2)', 'level'])
plt.show()

# ### Unobserved components (frequency domain modeling)
#
# The next method is an unobserved components model, where the trend is
# modeled as a fixed intercept and the seasonal components are modeled using
# trigonometric functions with primary periodicities of 10 and 100,
# respectively, and number of harmonics 3 and 2, respectively.  Note that
# this is the correct, generating model. The process for the time series can
# be written as:
#
# $$
# \begin{align}
# y_t & = \mu_t + \gamma^{(1)}_t + \gamma^{(2)}_t + \epsilon_t\\
# \mu_{t+1} & = \mu_t \\
# \gamma^{(1)}_{t} &= \sum_{j=1}^2 \gamma^{(1)}_{j, t} \\
# \gamma^{(2)}_{t} &= \sum_{j=1}^3 \gamma^{(2)}_{j, t}\\
# \gamma^{(1)}_{j, t+1} &= \gamma^{(1)}_{j, t}\cos(\lambda_j) + \gamma^{*,
# (1)}_{j, t}\sin(\lambda_j) + \omega^{(1)}_{j,t}, ~j = 1, 2, 3\\
# \gamma^{*, (1)}_{j, t+1} &= -\gamma^{(1)}_{j, t}\sin(\lambda_j) +
# \gamma^{*, (1)}_{j, t}\cos(\lambda_j) + \omega^{*, (1)}_{j, t}, ~j = 1, 2,
# 3\\
# \gamma^{(2)}_{j, t+1} &= \gamma^{(2)}_{j, t}\cos(\lambda_j) + \gamma^{*,
# (2)}_{j, t}\sin(\lambda_j) + \omega^{(2)}_{j,t}, ~j = 1, 2\\
# \gamma^{*, (2)}_{j, t+1} &= -\gamma^{(2)}_{j, t}\sin(\lambda_j) +
# \gamma^{*, (2)}_{j, t}\cos(\lambda_j) + \omega^{*, (2)}_{j, t}, ~j = 1,
# 2\\
# \end{align}
# $$
#
#
# where $\epsilon_t$ is white noise, $\omega^{(1)}_{j,t}$ are i.i.d. $N(0,
# \sigma^2_1)$, and  $\omega^{(2)}_{j,t}$ are i.i.d. $N(0, \sigma^2_2)$,
# where $\sigma_1 = 2.$

model = sm.tsa.UnobservedComponents(series.values,
                                    level='fixed intercept',
                                    freq_seasonal=[{
                                        'period': 10,
                                        'harmonics': 3
                                    }, {
                                        'period': 100,
                                        'harmonics': 2
                                    }])
res_f = model.fit(disp=False)
print(res_f.summary())
# The first state variable holds our estimate of the intercept
print("fixed intercept estimated as {0:.3f}".format(
    res_f.smoother_results.smoothed_state[0, -1:][0]))

res_f.plot_components()
plt.show()

model.ssm.transition[:, :, 0]

# Observe that the fitted variances are pretty close to the true variances
# of 4 and 9.  Further, the individual seasonal components look pretty close
# to the true seasonal components.  The smoothed level term is kind of close
# to the true level of 10.  Finally, our diagnostics look solid; the test
# statistics are small enough to fail to reject our three tests.

# ### Unobserved components (mixed time and frequency domain modeling)
#
# The second method is an unobserved components model, where the trend is
# modeled as a fixed intercept and the seasonal components are modeled using
# 10 constants summing to 0 and trigonometric functions with a primary
# periodicities of 100 with 2 harmonics total.  Note that this is not the
# generating model, as it presupposes that there are more state errors for
# the shorter seasonal component than in reality. The process for the time
# series can be written as:
#
# $$
# \begin{align}
# y_t & = \mu_t + \gamma^{(1)}_t + \gamma^{(2)}_t + \epsilon_t\\
# \mu_{t+1} & = \mu_t \\
# \gamma^{(1)}_{t + 1} &= - \sum_{j=1}^9 \gamma^{(1)}_{t + 1 - j} +
# \omega^{(1)}_t\\
# \gamma^{(2)}_{j, t+1} &= \gamma^{(2)}_{j, t}\cos(\lambda_j) + \gamma^{*,
# (2)}_{j, t}\sin(\lambda_j) + \omega^{(2)}_{j,t}, ~j = 1, 2\\
# \gamma^{*, (2)}_{j, t+1} &= -\gamma^{(2)}_{j, t}\sin(\lambda_j) +
# \gamma^{*, (2)}_{j, t}\cos(\lambda_j) + \omega^{*, (2)}_{j, t}, ~j = 1,
# 2\\
# \end{align}
# $$
#
# where $\epsilon_t$ is white noise, $\omega^{(1)}_{t}$ are i.i.d. $N(0,
# \sigma^2_1)$, and  $\omega^{(2)}_{j,t}$ are i.i.d. $N(0, \sigma^2_2)$.

model = sm.tsa.UnobservedComponents(series,
                                    level='fixed intercept',
                                    seasonal=10,
                                    freq_seasonal=[{
                                        'period': 100,
                                        'harmonics': 2
                                    }])
res_tf = model.fit(disp=False)
print(res_tf.summary())
# The first state variable holds our estimate of the intercept
print("fixed intercept estimated as {0:.3f}".format(
    res_tf.smoother_results.smoothed_state[0, -1:][0]))

fig = res_tf.plot_components()
fig.tight_layout(pad=1.0)

# The plotted components look good.  However, the estimated variance of
# the second seasonal term is inflated from reality.  Additionally, we
# reject the Ljung-Box statistic, indicating we may have remaining
# autocorrelation after accounting for our components.

# ### Unobserved components (lazy frequency domain modeling)
#
# The third method is an unobserved components model with a fixed
# intercept and one seasonal component, which is modeled using trigonometric
# functions with primary periodicity 100 and 50 harmonics. Note that this is
# not the generating model, as it presupposes that there are more harmonics
# then in reality.  Because the variances are tied together, we are not able
# to drive the estimated covariance of the non-existent harmonics to 0.
# What is lazy about this model specification is that we have not bothered
# to specify the two different seasonal components and instead chosen to
# model them using a single component with enough harmonics to cover both.
# We will not be able to capture any differences in variances between the
# two true components.  The process for the time series can be written as:
#
# $$
# \begin{align}
# y_t & = \mu_t + \gamma^{(1)}_t + \epsilon_t\\
# \mu_{t+1} &= \mu_t\\
# \gamma^{(1)}_{t} &= \sum_{j=1}^{50}\gamma^{(1)}_{j, t}\\
# \gamma^{(1)}_{j, t+1} &= \gamma^{(1)}_{j, t}\cos(\lambda_j) + \gamma^{*,
# (1)}_{j, t}\sin(\lambda_j) + \omega^{(1}_{j,t}, ~j = 1, 2, \dots, 50\\
# \gamma^{*, (1)}_{j, t+1} &= -\gamma^{(1)}_{j, t}\sin(\lambda_j) +
# \gamma^{*, (1)}_{j, t}\cos(\lambda_j) + \omega^{*, (1)}_{j, t}, ~j = 1, 2,
# \dots, 50\\
# \end{align}
# $$
#
# where $\epsilon_t$ is white noise, $\omega^{(1)}_{t}$ are i.i.d. $N(0,
# \sigma^2_1)$.

model = sm.tsa.UnobservedComponents(series,
                                    level='fixed intercept',
                                    freq_seasonal=[{
                                        'period': 100
                                    }])
res_lf = model.fit(disp=False)
print(res_lf.summary())
# The first state variable holds our estimate of the intercept
print("fixed intercept estimated as {0:.3f}".format(
    res_lf.smoother_results.smoothed_state[0, -1:][0]))

fig = res_lf.plot_components()
fig.tight_layout(pad=1.0)

# Note that one of our diagnostic tests would be rejected at the .05
# level.

# ### Unobserved components (lazy time domain seasonal modeling)
#
# The fourth method is an unobserved components model with a fixed
# intercept and a single seasonal component modeled using a time-domain
# seasonal model of 100 constants. The process for the time series can be
# written as:
#
# $$
# \begin{align}
# y_t & =\mu_t + \gamma^{(1)}_t + \epsilon_t\\
# \mu_{t+1} &= \mu_{t} \\
# \gamma^{(1)}_{t + 1} &= - \sum_{j=1}^{99} \gamma^{(1)}_{t + 1 - j} +
# \omega^{(1)}_t\\
# \end{align}
# $$
#
# where $\epsilon_t$ is white noise, $\omega^{(1)}_{t}$ are i.i.d. $N(0,
# \sigma^2_1)$.

model = sm.tsa.UnobservedComponents(series,
                                    level='fixed intercept',
                                    seasonal=100)
res_lt = model.fit(disp=False)
print(res_lt.summary())
# The first state variable holds our estimate of the intercept
print("fixed intercept estimated as {0:.3f}".format(
    res_lt.smoother_results.smoothed_state[0, -1:][0]))

fig = res_lt.plot_components()
fig.tight_layout(pad=1.0)

# The seasonal component itself looks good--it is the primary signal.  The
# estimated variance of the seasonal term is very high ($>10^5$), leading to
# a lot of uncertainty in our one-step-ahead predictions and slow
# responsiveness to new data, as evidenced by large errors in one-step ahead
# predictions and observations. Finally, all three of our diagnostic tests
# were rejected.

# ### Comparison of filtered estimates
#
# The plots below show that explicitly modeling the individual components
# results in the filtered state being close to the true state within roughly
# half a period.  The lazy models took longer (almost a full period) to do
# the same on the combined true state.

# Assign better names for our seasonal terms
true_seasonal_10_3 = terms[0]
true_seasonal_100_2 = terms[1]
true_sum = true_seasonal_10_3 + true_seasonal_100_2

time_s = np.s_[:50]  # After this they basically agree
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
idx = np.asarray(series.index)
h1, = ax1.plot(idx[time_s],
               res_f.freq_seasonal[0].filtered[time_s],
               label='Double Freq. Seas')
h2, = ax1.plot(idx[time_s],
               res_tf.seasonal.filtered[time_s],
               label='Mixed Domain Seas')
h3, = ax1.plot(idx[time_s],
               true_seasonal_10_3[time_s],
               label='True Seasonal 10(3)')
plt.legend([h1, h2, h3],
           ['Double Freq. Seasonal', 'Mixed Domain Seasonal', 'Truth'],
           loc=2)
plt.title('Seasonal 10(3) component')
plt.show()

time_s = np.s_[:50]  # After this they basically agree
fig2 = plt.figure()
ax2 = fig2.add_subplot(111)
h21, = ax2.plot(idx[time_s],
                res_f.freq_seasonal[1].filtered[time_s],
                label='Double Freq. Seas')
h22, = ax2.plot(idx[time_s],
                res_tf.freq_seasonal[0].filtered[time_s],
                label='Mixed Domain Seas')
h23, = ax2.plot(idx[time_s],
                true_seasonal_100_2[time_s],
                label='True Seasonal 100(2)')
plt.legend([h21, h22, h23],
           ['Double Freq. Seasonal', 'Mixed Domain Seasonal', 'Truth'],
           loc=2)
plt.title('Seasonal 100(2) component')
plt.show()

time_s = np.s_[:100]

fig3 = plt.figure()
ax3 = fig3.add_subplot(111)
h31, = ax3.plot(idx[time_s],
                res_f.freq_seasonal[1].filtered[time_s] +
                res_f.freq_seasonal[0].filtered[time_s],
                label='Double Freq. Seas')
h32, = ax3.plot(idx[time_s],
                res_tf.freq_seasonal[0].filtered[time_s] +
                res_tf.seasonal.filtered[time_s],
                label='Mixed Domain Seas')
h33, = ax3.plot(idx[time_s], true_sum[time_s], label='True Seasonal 100(2)')
h34, = ax3.plot(idx[time_s],
                res_lf.freq_seasonal[0].filtered[time_s],
                label='Lazy Freq. Seas')
h35, = ax3.plot(idx[time_s],
                res_lt.seasonal.filtered[time_s],
                label='Lazy Time Seas')

plt.legend([h31, h32, h33, h34, h35], [
    'Double Freq. Seasonal', 'Mixed Domain Seasonal', 'Truth',
    'Lazy Freq. Seas', 'Lazy Time Seas'
],
           loc=1)
plt.title('Seasonal components combined')
plt.tight_layout(pad=1.0)

# ##### Conclusions
#
# In this notebook, we simulated a time series with two seasonal
# components of different periods.  We modeled them using structural time
# series models with (a) two frequency domain components of correct periods
# and numbers of harmonics, (b) time domain seasonal component for the
# shorter term and a frequency domain term with correct period and number of
# harmonics, (c) a single frequency domain term with the longer period and
# full number of harmonics, and (d) a single time domain term with the
# longer period.  We saw a variety of diagnostic results, with only the
# correct generating model, (a), failing to reject any of the tests.  Thus,
# more flexible seasonal modeling allowing for multiple components with
# specifiable harmonics can be a useful tool for time series modeling.
# Finally, we can represent seasonal components with fewer total states in
# this way, allowing for the user to attempt to make the bias-variance
# trade-off themselves instead of being forced to choose "lazy" models,
# which use a large number of states and incur additional variance as a
# result.