File: plot_estimate_gpd_dowjones.py

package info (click to toggle)
openturns 1.26-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 67,708 kB
  • sloc: cpp: 261,605; python: 67,030; ansic: 4,378; javascript: 406; sh: 185; xml: 164; makefile: 101
file content (214 lines) | stat: -rw-r--r-- 8,488 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
"""
Estimate a GPD on the Dow Jones Index data
==========================================
"""

# %%
# In this example, we illustrate various techniques of extreme value modeling applied
# to the 5-year series of daily Dow Jones Index closing prices.
# Readers should refer to [coles2001]_ example 1.8 to get more details.
import math as m
import openturns as ot
import openturns.viewer as otv
from openturns.usecases import coles
import pandas as pd

# %%
# First, we load the Dow Jones dataset and plot it through time.
# We can see that the process is non-stationary.
full = pd.read_csv(coles.Coles().dowjones, index_col=0, parse_dates=True)
print(full[:10])
graph = ot.Graph(
    "Daily closing prices of the Dow Jones Index", "Day index", "Index", True, ""
)
# Care: to get the real period range, multiply by a factor to account for non-working days missing values!
size = len(full)
days = ot.Sample([[i] for i in range(size)])
dataDowJones = ot.Sample.BuildFromDataFrame(full)
curve = ot.Curve(days, dataDowJones)
curve.setColor("red")
graph.add(curve)
graph.setIntegerXTick(True)
view = otv.View(graph)

# %%
# In that example, the time dependence can
# not be explained by trends or seasonal cycles. Many empirical
# studies have advised to consider the logarithms of ratios of
# successive observations to get an approximation to stationarity.
# We apply that transformation:
#
# .. math::
#     \tilde{X}_i = \log X_i - \log X_{i-1}.
#
# The resulting time series appears to be reasonably close to stationarity.
transfDataDJ = ot.Sample(
    [
        [m.log(dataDowJones[i, 0]) - m.log(dataDowJones[i - 1, 0])]
        for i in range(1, size)
    ]
)
curve = ot.Curve(days[:-1], transfDataDJ)
graph = ot.Graph(
    "Log-daily returns of the Dow Jones Index", "Day index", "Index", True, ""
)
graph.add(curve)
view = otv.View(graph)


# %%
# For convenience of presentation, we rescale the data:
#
# .. math::
#    \tilde{X}_i \rightarrow 100\tilde{X}_i.
#
scalTransfDataDJ = transfDataDJ * 100.0
size = len(scalTransfDataDJ)

# %%
# In order to select a threshold upon which the GPD model will be fitted, we draw
# the mean residual life plot with approximate :math:`95\%` confidence interval.
# The curve is initially linear and shows significant curvature for
# :math:`u \in [-1, 2]`. Then for :math:`u \geq 2`, the curve is considered as
# reasonably linear when judged to confidence intervals. Hence, we
# choose the threshold :math:`u=2`. There are 37 exceedances of :math:`u`.
factory = ot.GeneralizedParetoFactory()
graph = factory.drawMeanResidualLife(scalTransfDataDJ)
view = otv.View(graph)
u = 2.0

# %%
# **Stationary GPD modeling assuming independence in data**
#
# We first assume that the dependence between the transformed data
# is negligible, so we first consider the data as
# independent observations over the observation period.
# We estimate the parameters of the GPD distribution modeling the excesses
# above :math:`u=2` by maximizing the log-likelihood of the excesses.
result_LL = factory.buildMethodOfLikelihoodMaximizationEstimator(scalTransfDataDJ, u)

# %%
# We get the fitted GPD and the estimated parameters :math:`(\hat{\sigma}, \hat{\xi})`.
fitted_GPD = result_LL.getDistribution()
desc = fitted_GPD.getParameterDescription()
param = fitted_GPD.getParameter()
print(", ".join([f"{p}: {value:.3f}" for p, value in zip(desc, param)]))
print("log-likelihood = ", result_LL.getLogLikelihood())

# %%
# We get the asymptotic distribution of the estimator :math:`(\hat{\sigma}, \hat{\xi})`.
# The threshold :math:`u` has not been estimated to ensure the regularity
# of the model and then the asymptotic properties of the maximum likelihood
# estimators. This is the reason why it appears as a Dirac distribution centered on
# the chosen threshold.
# In that case, the asymptotic distribution of :math:`(\hat{\sigma}, \hat{\xi})`
# is normal.
parameterEstimate = result_LL.getParameterDistribution()
print("Asymptotic distribution of the estimator : ")
print(parameterEstimate)

# %%
# We get the covariance matrix and the standard deviation of
# :math:`(\hat{\sigma}, \hat{\xi}, u)` where :math:`u` is deterministic.
print("Cov matrix = \n", parameterEstimate.getCovariance())
print("Standard dev = ", parameterEstimate.getStandardDeviation())

# %%
# We get the marginal confidence intervals of order 0.95 of
# :math:`(\hat{\sigma}, \hat{\xi})`.
order = 0.95
for i in range(2):  # exclude u parameter (fixed)
    ci = parameterEstimate.getMarginal(i).computeBilateralConfidenceInterval(order)
    print(desc[i] + ":", ci)

# %%
# At last, we can check the quality of the inference thanks to the 4 usual
# diagnostic plots:
#
# - the probability-probability pot,
# - the quantile-quantile pot,
# - the return level plot,
# - the data histogram and the density of the fitted model.
#
# We conclude that the goodness-of-fit in the quantile plots seems unconvincing,
# even if the other plots appear to be reasonable. This is due to the fact that
# the excesses can not be considered as independent: the transformed series
# :math:`\tilde{X}_i` has a rich structure of temporal dependence.
validation = ot.GeneralizedParetoValidation(result_LL, scalTransfDataDJ)
graph = validation.drawDiagnosticPlot()
view = otv.View(graph)

# %%
# **Stationary GPD modeling taking into account the dependence in data**
#
# We illustrate the fact that the excesses of the transformed series happen in
# groups. Hence we use the declustering method
# which filters the dependent observations exceeding a given threshold to obtain a
# set of threshold excesses that can be assumed as independent.
#
# Consecutive exceedances of :math:`u` belong to the same cluster. Two distinct
# clusters are separated by :math:`r` consecutive observations under the
# threshold. Within each cluster, we select the maximum value that will be used to
# infer the GPD distribution. The cluster maxima are assumed to be independent.
#
# On the graph, we show the clusters over the threshold :math:`u=2`
# and all the maxima selected within each cluster.
# It is possible to extract the data belonging to the same cluster and the
# cluster maximum series.
# We denote by :math:`n_c` the number of clusters and
# by :math:`n_u` the number of exceedances above :math:`u`.
part = ot.SamplePartition(scalTransfDataDJ)
r = 3
peaks, clusters = part.getPeakOverThreshold(u, r)
nc = len(peaks)
nu = sum([1 if scalTransfDataDJ[i, 0] > u else 0 for i in range(size)])
print(f"nc={nc} nu={u} theta={nc / nu:.3f}")
graph = clusters.draw(u)
graph.setTitle(
    "Threshold exceedances and clusters by transformed Dow Jones Index series"
)
view = otv.View(graph)

# %%
# We estimate a stationary GPD on the clusters maxima which are independent
# with the :math:`95\%` confidence interval of each parameter.
result_LL = factory.buildMethodOfLikelihoodMaximizationEstimator(peaks, u)
sigma, xi, _ = result_LL.getParameterDistribution().getMean()
sigma_stddev, xi_stddev, _ = result_LL.getParameterDistribution().getStandardDeviation()
print(
    f"u={u} r={r} nc={nc} sigma={sigma:.2f} ({sigma_stddev:.2f}) xi={xi:.2f} ({xi_stddev:.2f})",
    end=" ",
)

# %%
# We evaluate the :math:`T=100`-year return level which corresponds to the
# :math:`m`-observation return level, where :math:`m = T*n_y` with :math:`n_y`
# the number of observations per year. Here, we have daily observations, hence
# :math:`n_y = 365`. To calculate it, we evaluate the extremal index
# :math:`\theta` which is the inverse of the mean length of the clusters,
# estimated by the ratio between the number of clusters and the number
# of exceedances of :math:`u`.
theta = nc / nu
ny = 365
T = 100
xm_100 = factory.buildReturnLevelEstimator(result_LL, scalTransfDataDJ, T * ny, theta)
print(f"x100={xm_100.getMean()} ({xm_100.getStandardDeviation()}) theta={theta:.3f}")

# %%
# We plot the return level for the new fitted model that takes into account
# dependence between excesses.
# We can see the fitted model works well. However, the large return level confidence intervals
# obtained for extreme return levels makes it difficult to make reliable
# predictions with any degree of certainty.
validation = ot.GeneralizedParetoValidation(result_LL, peaks)
grid = validation.drawDiagnosticPlot()
rlPlot = grid.getGraph(1, 0)
rlPlot.setTitle(rlPlot.getTitle() + f" (u={u} r={r})")
view = otv.View(rlPlot)

# %%
# We plot the whole series of validation graphs of the new fitted model.
view = otv.View(grid)

# %%
otv.View.ShowAll()