1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
|
==========================================================================
Statistical learning: the setting and the estimator object in scikit-learn
==========================================================================
Datasets
=========
Scikit-learn deals with learning information from one or more
datasets that are represented as 2D arrays. They can be understood as a
list of multi-dimensional observations. We say that the first axis of
these arrays is the **samples** axis, while the second is the
**features** axis.
.. topic:: A simple example shipped with scikit-learn: iris dataset
::
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)
It is made of 150 observations of irises, each described by 4
features: their sepal and petal length and width, as detailed in
``iris.DESCR``.
When the data is not initially in the ``(n_samples, n_features)`` shape, it
needs to be preprocessed in order to be used by scikit-learn.
.. topic:: An example of reshaping data would be the digits dataset
.. image:: /auto_examples/datasets/images/sphx_glr_plot_digits_last_image_001.png
:target: ../../auto_examples/datasets/plot_digits_last_image.html
:align: right
:scale: 60
The digits dataset is made of 1797 8x8 images of hand-written
digits ::
>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> import matplotlib.pyplot as plt #doctest: +SKIP
>>> plt.imshow(digits.images[-1], cmap=plt.cm.gray_r) #doctest: +SKIP
<matplotlib.image.AxesImage object at ...>
To use this dataset with scikit-learn, we transform each 8x8 image into a
feature vector of length 64 ::
>>> data = digits.images.reshape((digits.images.shape[0], -1))
Estimators objects
===================
.. Some code to make the doctests run
>>> from sklearn.base import BaseEstimator
>>> class Estimator(BaseEstimator):
... def __init__(self, param1=0, param2=0):
... self.param1 = param1
... self.param2 = param2
... def fit(self, data):
... pass
>>> estimator = Estimator()
**Fitting data**: the main API implemented by scikit-learn is that of the
`estimator`. An estimator is any object that learns from data;
it may be a classification, regression or clustering algorithm or
a *transformer* that extracts/filters useful features from raw data.
All estimator objects expose a ``fit`` method that takes a dataset
(usually a 2-d array):
>>> estimator.fit(data)
**Estimator parameters**: All the parameters of an estimator can be set
when it is instantiated or by modifying the corresponding attribute::
>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
1
**Estimated parameters**: When data is fitted with an estimator,
parameters are estimated from the data at hand. All the estimated
parameters are attributes of the estimator object ending by an
underscore::
>>> estimator.estimated_param_ #doctest: +SKIP
|