1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403
|
.. _sgd:
===========================
Stochastic Gradient Descent
===========================
.. currentmodule:: sklearn.linear_model
**Stochastic Gradient Descent (SGD)** is a simple yet very efficient
approach to discriminative learning of linear classifiers under
convex loss functions such as (linear) `Support Vector Machines
<http://en.wikipedia.org/wiki/Support_vector_machine>`_ and `Logistic
Regression <http://en.wikipedia.org/wiki/Logistic_regression>`_.
Even though SGD has been around in the machine learning community for
a long time, it has received a considerable amount of attention just
recently in the context of large-scale learning.
SGD has been successfully applied to large-scale and sparse machine
learning problems often encountered in text classification and natural
language processing. Given that the data is sparse, the classifiers
in this module easily scale to problems with more than 10^5 training
examples and more than 10^5 features.
The advantages of Stochastic Gradient Descent are:
+ Efficiency.
+ Ease of implementation (lots of opportunities for code tuning).
The disadvantages of Stochastic Gradient Descent include:
+ SGD requires a number of hyperparameters such as the regularization
parameter and the number of iterations.
+ SGD is sensitive to feature scaling.
Classification
==============
.. warning::
Make sure you permute (shuffle) your training data before fitting the
model or use `shuffle=True` to shuffle after each iterations.
The class :class:`SGDClassifier` implements a plain stochastic gradient
descent learning routine which supports different loss functions and
penalties for classification.
.. figure:: ../auto_examples/linear_model/images/plot_sgd_separating_hyperplane_1.png
:target: ../auto_examples/linear_model/plot_sgd_separating_hyperplane.html
:align: center
:scale: 75
As other classifiers, SGD has to be fitted with two arrays: an array X
of size [n_samples, n_features] holding the training samples, and an
array Y of size [n_samples] holding the target values (class labels)
for the training samples::
>>> from sklearn.linear_model import SGDClassifier
>>> X = [[0., 0.], [1., 1.]]
>>> y = [0, 1]
>>> clf = SGDClassifier(loss="hinge", penalty="l2")
>>> clf.fit(X, y)
SGDClassifier(alpha=0.0001, class_weight=None, eta0=0.0, fit_intercept=True,
learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
penalty='l2', power_t=0.5, rho=0.85, seed=0, shuffle=False,
verbose=0, warm_start=False)
After being fitted, the model can then be used to predict new values::
>>> clf.predict([[2., 2.]])
array([1])
SGD fits a linear model to the training data. The member `coef_` holds
the model parameters::
>>> clf.coef_
array([[ 9.90090187, 9.90090187]])
Member `intercept_` holds the intercept (aka offset or bias)::
>>> clf.intercept_ # doctest: +ELLIPSIS
array([-9.990...])
Whether or not the model should use an intercept, i.e. a biased
hyperplane, is controlled by the parameter `fit_intercept`.
To get the signed distance to the hyperplane use `decision_function`::
>>> clf.decision_function([[2., 2.]])
array([ 29.61357756])
The concrete loss function can be set via the `loss`
parameter. :class:`SGDClassifier` supports the following loss functions:
* `loss="hinge"`: (soft-margin) linear Support Vector Machine.
* `loss="modified_huber"`: smoothed hinge loss.
* `loss="log"`: Logistic Regression
The first two loss functions are lazy, they only update the model
parameters if an example violates the margin constraint, which makes
training very efficient. Log loss, on the other hand, provides
probability estimates.
In the case of binary classification and `loss="log"` you get a
probability estimate P(y=C|x) using `predict_proba`, where `C` is the
largest class label::
>>> clf = SGDClassifier(loss="log").fit(X, y)
>>> clf.predict_proba([[1., 1.]])
array([ 0.99999949])
The concrete penalty can be set via the `penalty` parameter. `SGD`
supports the following penalties:
* `penalty="l2"`: L2 norm penalty on `coef_`.
* `penalty="l1"`: L1 norm penalty on `coef_`.
* `penalty="elasticnet"`: Convex combination of L2 and L1; `rho * L2 + (1 - rho) * L1`.
The default setting is `penalty="l2"`. The L1 penalty leads to sparse
solutions, driving most coefficients to zero. The Elastic Net solves
some deficiencies of the L1 penalty in the presence of highly correlated
attributes. The parameter `rho` has to be specified by the user.
:class:`SGDClassifier` supports multi-class classification by combining
multiple binary classifiers in a "one versus all" (OVA) scheme. For each
of the `K` classes, a binary classifier is learned that discriminates
between that and all other `K-1` classes. At testing time, we compute the
confidence score (i.e. the signed distances to the hyperplane) for each
classifier and choose the class with the highest confidence. The Figure
below illustrates the OVA approach on the iris dataset. The dashed
lines represent the three OVA classifiers; the background colors show
the decision surface induced by the three classifiers.
.. figure:: ../auto_examples/linear_model/images/plot_sgd_iris_1.png
:target: ../auto_examples/linear_model/plot_sgd_iris.html
:align: center
:scale: 75
In the case of multi-class classification `coef_` is a two-dimensionaly
array of shape [n_classes, n_features] and `intercept_` is a one
dimensional array of shape [n_classes]. The i-th row of `coef_` holds
the weight vector of the OVA classifier for the i-th class; classes are
indexed in ascending order (see attribute `classes`).
:class:`SGDClassifier` supports both weighted classes and weighted
instances via the fit parameters `class_weight` and `sample_weight`. See
the examples below and the doc string of :meth:`SGDClassifier.fit` for
further information.
.. topic:: Examples:
- :ref:`example_linear_model_plot_sgd_separating_hyperplane.py`,
- :ref:`example_linear_model_plot_sgd_iris.py`
- :ref:`example_linear_model_plot_sgd_weighted_classes.py`
- :ref:`example_linear_model_plot_sgd_weighted_samples.py`
Regression
==========
The class :class:`SGDRegressor` implements a plain stochastic gradient
descent learning routine which supports different loss functions and
penalties to fit linear regression models. :class:`SGDRegressor` is
well suited for regression problems with a large number of training
samples (> 10.000), for other problems we recommend :class:`Ridge`,
:class:`Lasso`, or :class:`ElasticNet`.
.. figure:: ../auto_examples/linear_model/images/plot_sgd_ols_1.png
:target: ../auto_examples/linear_model/plot_sgd_ols.html
:align: center
:scale: 75
The concrete loss function can be set via the `loss`
parameter. :class:`SGDRegressor` supports the following loss functions:
* `loss="squared_loss"`: Ordinary least squares.
* `loss="huber"`: Huber loss for robust regression.
The Huber loss function is an epsilon insensitive loss function for
robust regression. The width of the insensitive region has to be
specified via the parameter `epsilon`.
.. topic:: Examples:
- :ref:`example_linear_model_plot_sgd_ols.py`,
.. currentmodule:: sklearn.linear_model.sparse
Stochastic Gradient Descent for sparse data
===========================================
.. note:: The sparse implementation produces slightly different results
than the dense implementation due to a shrunk learning rate for the
intercept.
There is built-in support for sparse data given in any matrix in a format
supported by scipy.sparse. For maximum efficiency, however, use the CSR
matrix format as defined in `scipy.sparse.csr_matrix
<http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html>`_.
.. topic:: Examples:
- :ref:`example_document_classification_20newsgroups.py`
Complexity
==========
The major advantage of SGD is its efficiency, which is basically
linear in the number of training examples. If X is a matrix of size (n, p)
training has a cost of :math:`O(k n \bar p)`, where k is the number
of iterations (epochs) and :math:`\bar p` is the average number of
non-zero attributes per sample.
Recent theoretical results, however, show that the runtime to get some
desired optimization accuracy does not increase as the training set size increases.
Tips on Practical Use
=====================
* Stochastic Gradient Descent is sensitive to feature scaling, so it
is highly recommended to scale your data. For example, scale each
attribute on the input vector X to [0,1] or [-1,+1], or standardize
it to have mean 0 and variance 1. Note that the *same* scaling
must be applied to the test vector to obtain meaningful
results. This can be easily done using :class:`Scaler`::
from sklearn.preprocessing import Scaler
scaler = Scaler()
scaler.fit(X_train) # Don't cheat - fit only on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data
If your attributes have an intrinsic scale (e.g. word frequencies or
indicator features) scaling is not needed.
* Finding a reasonable regularization term :math:`\alpha` is
best done using :class:`GridSearchCV`, usually in the
range `10.0**-np.arange(1,7)`.
* Empirically, we found that SGD converges after observing
approx. 10^6 training samples. Thus, a reasonable first guess
for the number of iterations is `n_iter = np.ceil(10**6 / n)`,
where `n` is the size of the training set.
* If you apply SGD to features extracted using PCA we found that
it is often wise to scale the feature values by some constant `c`
such that the average L2 norm of the training data equals one.
.. topic:: References:
* `"Efficient BackProp" <yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf>`_
Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks
of the Trade 1998.
.. _sgd_mathematical_formulation:
Mathematical formulation
========================
Given a set of training examples :math:`(x_1, y_1), \ldots, (x_n, y_n)` where
:math:`x_i \in \mathbf{R}^n` and :math:`y_i \in \{-1,1\}`, our goal is to
learn a linear scoring function :math:`f(x) = w^T x + b` with model parameters
:math:`w \in \mathbf{R}^m` and intercept :math:`b \in \mathbf{R}`. In order
to make predictions, we simply look at the sign of :math:`f(x)`.
A common choice to find the model parameters is by minimizing the regularized
training error given by
.. math::
E(w,b) = \sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w)
where :math:`L` is a loss function that measures model (mis)fit and
:math:`R` is a regularization term (aka penalty) that penalizes model
complexity; :math:`\alpha > 0` is a non-negative hyperparameter.
Different choices for :math:`L` entail different classifiers such as
- Hinge: (soft-margin) Support Vector Machines.
- Log: Logistic Regression.
- Least-Squares: Ridge Regression.
All of the above loss functions can be regarded as an upper bound on the
misclassification error (Zero-one loss) as shown in the Figure below.
.. figure:: ../auto_examples/linear_model/images/plot_sgd_loss_functions_1.png
:align: center
:scale: 75
Popular choices for the regularization term :math:`R` include:
- L2 norm: :math:`R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2`,
- L1 norm: :math:`R(w) := \sum_{i=1}^{n} |w_i|`, which leads to sparse
solutions.
- Elastic Net: :math:`R(w) := \rho \frac{1}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i|`, a convex combination of L2 and L1.
The Figure below shows the contours of the different regularization terms
in the parameter space when :math:`R(w) = 1`.
.. figure:: ../auto_examples/linear_model/images/plot_sgd_penalties_1.png
:align: center
:scale: 75
SGD
---
Stochastic gradient descent is an optimization method for unconstrained
optimization problems. In contrast to (batch) gradient descent, SGD
approximates the true gradient of :math:`E(w,b)` by considering a
single training example at a time.
The class :class:`SGDClassifier` implements a first-order SGD learning
routine. The algorithm iterates over the training examples and for each
example updates the model parameters according to the update rule given by
.. math::
w \leftarrow w - \eta (\alpha \frac{\partial R(w)}{\partial w}
+ \frac{\partial L(w^T x_i + b, y_i)}{\partial w})
where :math:`\eta` is the learning rate which controls the step-size in
the parameter space. The intercept :math:`b` is updated similarly but
without regularization.
The learning rate :math:`\eta` can be either constant or gradually decaying. For
classification, the default learning rate schedule (`learning_rate='optimal'`)
is given by
.. math::
\eta^{(t)} = \frac {1}{\alpha (t_0 + t)}
where :math:`t` is the time step (there are a total of `n_samples * epochs`
time steps), :math:`t_0` is determined based on a heuristic proposed by Léon Bottou
such that the expected initial updates are comparable with the expected
size of the weights (this assuming that the norm of the training samples is
approx. 1). See `"The Tradeoffs of Large Scale Machine Learning" <http://leon.bottou.org/slides/largescale/lstut.pdf>`_ by Léon Bottou for further details.
For regression, the default learning rate schedule, inverse scaling
(`learning_rate='invscaling'`), is given by
.. math::
\eta^{(t)} = \frac{eta_0}{t^{power\_t}}
where :math:`eta_0` and :math:`power\_t` are hyperparameters choosen by the
user via `eta0` and `power_t`, resp.
For a constant learning rate use `learning_rate='constant'` and use `eta0`
to specify the learning rate.
The model parameters can be accessed through the members `coef\_` and
`intercept\_`:
- Member `coef\_` holds the weights :math:`w`
- Member `intercept\_` holds :math:`b`
.. topic:: References:
* `"Solving large scale linear prediction problems using stochastic
gradient descent algorithms"
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7377>`_
T. Zhang - In Proceedings of ICML '04.
* `"Regularization and variable selection via the elastic net"
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.4696>`_
H. Zou, T. Hastie - Journal of the Royal Statistical Society Series B,
67 (2), 301-320.
Implementation details
======================
The implementation of SGD is influenced by the `Stochastic Gradient SVM
<http://leon.bottou.org/projects/sgd>`_ of Léon Bottou. Similar to SvmSGD,
the weight vector is represented as the product of a scalar and a vector
which allows an efficient weight update in the case of L2 regularization.
In the case of sparse feature vectors, the intercept is updated with a
smaller learning rate (multiplied by 0.01) to account for the fact that
it is updated more frequently. Training examples are picked up sequentially
and the learning rate is lowered after each observed example. We adopted the
learning rate schedule from Shalev-Shwartz et al. 2007.
For multi-class classification, a "one versus all" approach is used.
We use the truncated gradient algorithm proposed by Tsuruoka et al. 2009
for L1 regularization (and the Elastic Net).
The code is written in Cython.
.. topic:: References:
* `"Stochastic Gradient Descent" <http://leon.bottou.org/projects/sgd>`_ L. Bottou - Website, 2010.
* `"The Tradeoffs of Large Scale Machine Learning" <http://leon.bottou.org/slides/largescale/lstut.pdf>`_ L. Bottou - Website, 2011.
* `"Pegasos: Primal estimated sub-gradient solver for svm"
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.74.8513>`_
S. Shalev-Shwartz, Y. Singer, N. Srebro - In Proceedings of ICML '07.
* `"Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty"
<http://www.aclweb.org/anthology/P/P09/P09-1054.pdf>`_
Y. Tsuruoka, J. Tsujii, S. Ananiadou - In Proceedings of the AFNLP/ACL '09.
|