File: metadata_routing.rst

package info (click to toggle)
scikit-learn 1.4.2%2Bdfsg-8
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 25,036 kB
  • sloc: python: 201,105; cpp: 5,790; ansic: 854; makefile: 304; sh: 56; javascript: 20
file content (302 lines) | stat: -rw-r--r-- 13,102 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
.. currentmodule:: sklearn

.. TODO: update doc/conftest.py once document is updated and examples run.

.. _metadata_routing:

Metadata Routing
================

.. note::
  The Metadata Routing API is experimental, and is not implemented yet for many
  estimators. Please refer to the :ref:`list of supported and unsupported
  models <metadata_routing_models>` for more information. It may change without
  the usual deprecation cycle. By default this feature is not enabled. You can
  enable this feature  by setting the ``enable_metadata_routing`` flag to
  ``True``::

    >>> import sklearn
    >>> sklearn.set_config(enable_metadata_routing=True)

This guide demonstrates how metadata such as ``sample_weight`` can be routed
and passed along to estimators, scorers, and CV splitters through
meta-estimators such as :class:`~pipeline.Pipeline` and
:class:`~model_selection.GridSearchCV`. In order to pass metadata to a method
such as ``fit`` or ``score``, the object consuming the metadata, must *request*
it. For estimators and splitters, this is done via ``set_*_request`` methods,
e.g. ``set_fit_request(...)``, and for scorers this is done via the
``set_score_request`` method. For grouped splitters such as
:class:`~model_selection.GroupKFold`, a ``groups`` parameter is requested by
default. This is best demonstrated by the following examples.

If you are developing a scikit-learn compatible estimator or meta-estimator,
you can check our related developer guide:
:ref:`sphx_glr_auto_examples_miscellaneous_plot_metadata_routing.py`.

.. note::
  Note that the methods and requirements introduced in this document are only
  relevant if you want to pass :term:`metadata` (e.g. ``sample_weight``) to a method.
  If you're only passing ``X`` and ``y`` and no other parameter / metadata to
  methods such as :term:`fit`, :term:`transform`, etc, then you don't need to set
  anything.

Usage Examples
**************
Here we present a few examples to show different common use-cases. The examples
in this section require the following imports and data::

  >>> import numpy as np
  >>> from sklearn.metrics import make_scorer, accuracy_score
  >>> from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
  >>> from sklearn.model_selection import cross_validate, GridSearchCV, GroupKFold
  >>> from sklearn.feature_selection import SelectKBest
  >>> from sklearn.pipeline import make_pipeline
  >>> n_samples, n_features = 100, 4
  >>> rng = np.random.RandomState(42)
  >>> X = rng.rand(n_samples, n_features)
  >>> y = rng.randint(0, 2, size=n_samples)
  >>> my_groups = rng.randint(0, 10, size=n_samples)
  >>> my_weights = rng.rand(n_samples)
  >>> my_other_weights = rng.rand(n_samples)

Weighted scoring and fitting
----------------------------

Here :class:`~model_selection.GroupKFold` requests ``groups`` by default. However, we
need to explicitly request weights for our scorer and the internal cross validation of
:class:`~linear_model.LogisticRegressionCV`. Both of these *consumers* know how to use
metadata called ``sample_weight``::

  >>> weighted_acc = make_scorer(accuracy_score).set_score_request(
  ...     sample_weight=True
  ... )
  >>> lr = LogisticRegressionCV(
  ...     cv=GroupKFold(), scoring=weighted_acc,
  ... ).set_fit_request(sample_weight=True)
  >>> cv_results = cross_validate(
  ...     lr,
  ...     X,
  ...     y,
  ...     params={"sample_weight": my_weights, "groups": my_groups},
  ...     cv=GroupKFold(),
  ...     scoring=weighted_acc,
  ... )

Note that in this example, ``my_weights`` is passed to both the scorer and
:class:`~linear_model.LogisticRegressionCV`.

Error handling: if ``params={"sample_weigh": my_weights, ...}`` were passed
(note the typo), :func:`~model_selection.cross_validate` would raise an error,
since ``sample_weigh`` was not requested by any of its underlying objects.

Weighted scoring and unweighted fitting
---------------------------------------

When passing metadata such as ``sample_weight`` around, all ``sample_weight``
:term:`consumers <consumer>` require weights to be either explicitly requested
or not requested (i.e. ``True`` or ``False``) when used in another
:term:`router` such as a :class:`~pipeline.Pipeline` or a ``*GridSearchCV``. To
perform an unweighted fit, we need to configure
:class:`~linear_model.LogisticRegressionCV` to not request sample weights, so
that :func:`~model_selection.cross_validate` does not pass the weights along::

  >>> weighted_acc = make_scorer(accuracy_score).set_score_request(
  ...     sample_weight=True
  ... )
  >>> lr = LogisticRegressionCV(
  ...     cv=GroupKFold(), scoring=weighted_acc,
  ... ).set_fit_request(sample_weight=False)
  >>> cv_results = cross_validate(
  ...     lr,
  ...     X,
  ...     y,
  ...     cv=GroupKFold(),
  ...     params={"sample_weight": my_weights, "groups": my_groups},
  ...     scoring=weighted_acc,
  ... )

If :meth:`linear_model.LogisticRegressionCV.set_fit_request` has not
been called, :func:`~model_selection.cross_validate` will raise an
error because ``sample_weight`` is passed in but
:class:`~linear_model.LogisticRegressionCV` would not be explicitly configured
to recognize the weights.

Unweighted feature selection
----------------------------

Setting request values for metadata are only required if the object, e.g. estimator,
scorer, etc., is a consumer of that metadata Unlike
:class:`~linear_model.LogisticRegressionCV`, :class:`~feature_selection.SelectKBest`
doesn't consume weights and therefore no request value for ``sample_weight`` on its
instance is set and ``sample_weight`` is not routed to it::

  >>> weighted_acc = make_scorer(accuracy_score).set_score_request(
  ...     sample_weight=True
  ... )
  >>> lr = LogisticRegressionCV(
  ...     cv=GroupKFold(), scoring=weighted_acc,
  ... ).set_fit_request(sample_weight=True)
  >>> sel = SelectKBest(k=2)
  >>> pipe = make_pipeline(sel, lr)
  >>> cv_results = cross_validate(
  ...     pipe,
  ...     X,
  ...     y,
  ...     cv=GroupKFold(),
  ...     params={"sample_weight": my_weights, "groups": my_groups},
  ...     scoring=weighted_acc,
  ... )

Advanced: Different scoring and fitting weights
-----------------------------------------------

Despite :func:`~metrics.make_scorer` and
:class:`~linear_model.LogisticRegressionCV` both expecting the key
``sample_weight``, we can use aliases to pass different weights to different
consumers. In this example, we pass ``scoring_weight`` to the scorer, and
``fitting_weight`` to :class:`~linear_model.LogisticRegressionCV`::

  >>> weighted_acc = make_scorer(accuracy_score).set_score_request(
  ...    sample_weight="scoring_weight"
  ... )
  >>> lr = LogisticRegressionCV(
  ...     cv=GroupKFold(), scoring=weighted_acc,
  ... ).set_fit_request(sample_weight="fitting_weight")
  >>> cv_results = cross_validate(
  ...     lr,
  ...     X,
  ...     y,
  ...     cv=GroupKFold(),
  ...     params={
  ...         "scoring_weight": my_weights,
  ...         "fitting_weight": my_other_weights,
  ...         "groups": my_groups,
  ...     },
  ...     scoring=weighted_acc,
  ... )

API Interface
*************

A :term:`consumer` is an object (estimator, meta-estimator, scorer, splitter)
which accepts and uses some :term:`metadata` in at least one of its methods
(``fit``, ``predict``, ``inverse_transform``, ``transform``, ``score``,
``split``). Meta-estimators which only forward the metadata to other objects
(the child estimator, scorers, or splitters) and don't use the metadata
themselves are not consumers. (Meta-)Estimators which route metadata to other
objects are :term:`routers <router>`. A(n) (meta-)estimator can be a
:term:`consumer` and a :term:`router` at the same time. (Meta-)Estimators and
splitters expose a ``set_*_request`` method for each method which accepts at
least one metadata. For instance, if an estimator supports ``sample_weight`` in
``fit`` and ``score``, it exposes
``estimator.set_fit_request(sample_weight=value)`` and
``estimator.set_score_request(sample_weight=value)``. Here ``value`` can be:

- ``True``: method requests a ``sample_weight``. This means if the metadata is
  provided, it will be used, otherwise no error is raised.
- ``False``: method does not request a ``sample_weight``.
- ``None``: router will raise an error if ``sample_weight`` is passed. This is
  in almost all cases the default value when an object is instantiated and
  ensures the user sets the metadata requests explicitly when a metadata is
  passed. The only exception are ``Group*Fold`` splitters.
- ``"param_name"``: if this estimator is used in a meta-estimator, the
  meta-estimator should forward ``"param_name"`` as ``sample_weight`` to this
  estimator. This means the mapping between the metadata required by the
  object, e.g. ``sample_weight`` and what is provided by the user, e.g.
  ``my_weights`` is done at the router level, and not by the object, e.g.
  estimator, itself.

Metadata are requested in the same way for scorers using ``set_score_request``.

If a metadata, e.g. ``sample_weight``, is passed by the user, the metadata
request for all objects which potentially can consume ``sample_weight`` should
be set by the user, otherwise an error is raised by the router object. For
example, the following code raises an error, since it hasn't been explicitly
specified whether ``sample_weight`` should be passed to the estimator's scorer
or not::

    >>> param_grid = {"C": [0.1, 1]}
    >>> lr = LogisticRegression().set_fit_request(sample_weight=True)
    >>> try:
    ...     GridSearchCV(
    ...         estimator=lr, param_grid=param_grid
    ...     ).fit(X, y, sample_weight=my_weights)
    ... except ValueError as e:
    ...     print(e)
    [sample_weight] are passed but are not explicitly set as requested or not for
    LogisticRegression.score

The issue can be fixed by explicitly setting the request value::

    >>> lr = LogisticRegression().set_fit_request(
    ...     sample_weight=True
    ... ).set_score_request(sample_weight=False)

At the end we disable the configuration flag for metadata routing::

    >>> sklearn.set_config(enable_metadata_routing=False)

.. _metadata_routing_models:

Metadata Routing Support Status
*******************************
All consumers (i.e. simple estimators which only consume metadata and don't
route them) support metadata routing, meaning they can be used inside
meta-estimators which support metadata routing. However, development of support
for metadata routing for meta-estimators is in progress, and here is a list of
meta-estimators and tools which support and don't yet support metadata routing.


Meta-estimators and functions supporting metadata routing:

- :class:`sklearn.calibration.CalibratedClassifierCV`
- :class:`sklearn.compose.ColumnTransformer`
- :class:`sklearn.feature_selection.SelectFromModel`
- :class:`sklearn.linear_model.ElasticNetCV`
- :class:`sklearn.linear_model.LarsCV`
- :class:`sklearn.linear_model.LassoCV`
- :class:`sklearn.linear_model.LassoLarsCV`
- :class:`sklearn.linear_model.LogisticRegressionCV`
- :class:`sklearn.linear_model.MultiTaskElasticNetCV`
- :class:`sklearn.linear_model.MultiTaskLassoCV`
- :class:`sklearn.model_selection.GridSearchCV`
- :class:`sklearn.model_selection.HalvingGridSearchCV`
- :class:`sklearn.model_selection.HalvingRandomSearchCV`
- :class:`sklearn.model_selection.RandomizedSearchCV`
- :func:`sklearn.model_selection.cross_validate`
- :func:`sklearn.model_selection.cross_val_score`
- :func:`sklearn.model_selection.cross_val_predict`
- :class:`sklearn.multiclass.OneVsOneClassifier`
- :class:`sklearn.multiclass.OneVsRestClassifier`
- :class:`sklearn.multiclass.OutputCodeClassifier`
- :class:`sklearn.multioutput.ClassifierChain`
- :class:`sklearn.multioutput.MultiOutputClassifier`
- :class:`sklearn.multioutput.MultiOutputRegressor`
- :class:`sklearn.linear_model.OrthogonalMatchingPursuitCV`
- :class:`sklearn.multioutput.RegressorChain`
- :class:`sklearn.pipeline.Pipeline`

Meta-estimators and tools not supporting metadata routing yet:

- :class:`sklearn.compose.TransformedTargetRegressor`
- :class:`sklearn.covariance.GraphicalLassoCV`
- :class:`sklearn.ensemble.AdaBoostClassifier`
- :class:`sklearn.ensemble.AdaBoostRegressor`
- :class:`sklearn.ensemble.BaggingClassifier`
- :class:`sklearn.ensemble.BaggingRegressor`
- :class:`sklearn.ensemble.StackingClassifier`
- :class:`sklearn.ensemble.StackingRegressor`
- :class:`sklearn.ensemble.VotingClassifier`
- :class:`sklearn.ensemble.VotingRegressor`
- :class:`sklearn.feature_selection.RFE`
- :class:`sklearn.feature_selection.RFECV`
- :class:`sklearn.feature_selection.SequentialFeatureSelector`
- :class:`sklearn.impute.IterativeImputer`
- :class:`sklearn.linear_model.RANSACRegressor`
- :class:`sklearn.linear_model.RidgeClassifierCV`
- :class:`sklearn.linear_model.RidgeCV`
- :class:`sklearn.model_selection.learning_curve`
- :class:`sklearn.model_selection.permutation_test_score`
- :class:`sklearn.model_selection.validation_curve`
- :class:`sklearn.pipeline.FeatureUnion`
- :class:`sklearn.semi_supervised.SelfTrainingClassifier`