File: rf.rst

package info (click to toggle)
xgboost 3.0.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 13,796 kB
  • sloc: cpp: 67,502; python: 35,503; java: 4,676; ansic: 1,426; sh: 1,320; xml: 1,197; makefile: 204; javascript: 19
file content (107 lines) | stat: -rw-r--r-- 4,658 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
#############################
Random Forests(TM) in XGBoost
#############################

XGBoost is normally used to train gradient-boosted decision trees and other gradient
boosted models. Random Forests use the same model representation and inference, as
gradient-boosted decision trees, but a different training algorithm.  One can use XGBoost
to train a standalone random forest or use random forest as a base model for gradient
boosting.  Here we focus on training standalone random forest.

We have native APIs for training random forests since the early days, and a new
Scikit-Learn wrapper after 0.82 (not included in 0.82).  Please note that the new
Scikit-Learn wrapper is still **experimental**, which means we might change the interface
whenever needed.

*****************************************
Standalone Random Forest With XGBoost API
*****************************************

The following parameters must be set to enable random forest training.

* ``booster`` should be set to ``gbtree``, as we are training forests. Note that as this
  is the default, this parameter needn't be set explicitly.
* ``subsample`` must be set to a value less than 1 to enable random selection of training
  cases (rows).
* One of ``colsample_by*`` parameters must be set to a value less than 1 to enable random
  selection of columns. Normally, ``colsample_bynode`` would be set to a value less than 1
  to randomly sample columns at each tree split.
* ``num_parallel_tree`` should be set to the size of the forest being trained.
* ``num_boost_round`` should be set to 1 to prevent XGBoost from boosting multiple random
  forests.  Note that this is a keyword argument to ``train()``, and is not part of the
  parameter dictionary.
* ``eta`` (alias: ``learning_rate``) must be set to 1 when training random forest
  regression.
* ``random_state`` can be used to seed the random number generator.


Other parameters should be set in a similar way they are set for gradient boosting. For
instance, ``objective`` will typically be ``reg:squarederror`` for regression and
``binary:logistic`` for classification, ``lambda`` should be set according to a desired
regularization weight, etc.

If both ``num_parallel_tree`` and ``num_boost_round`` are greater than 1, training will
use a combination of random forest and gradient boosting strategy. It will perform
``num_boost_round`` rounds, boosting a random forest of ``num_parallel_tree`` trees at
each round. If early stopping is not enabled, the final model will consist of
``num_parallel_tree`` * ``num_boost_round`` trees.

Here is a sample parameter dictionary for training a random forest on a GPU using
xgboost::

  params = {
    "colsample_bynode": 0.8,
    "learning_rate": 1,
    "max_depth": 5,
    "num_parallel_tree": 100,
    "objective": "binary:logistic",
    "subsample": 0.8,
    "tree_method": "hist",
    "device": "cuda",
  }

A random forest model can then be trained as follows::

  bst = train(params, dmatrix, num_boost_round=1)


***************************************************
Standalone Random Forest With Scikit-Learn-Like API
***************************************************

``XGBRFClassifier`` and ``XGBRFRegressor`` are SKL-like classes that provide random forest
functionality. They are basically versions of ``XGBClassifier`` and ``XGBRegressor`` that
train random forest instead of gradient boosting, and have default values and meaning of
some of the parameters adjusted accordingly. In particular:

* ``n_estimators`` specifies the size of the forest to be trained; it is converted to
  ``num_parallel_tree``, instead of the number of boosting rounds
* ``learning_rate`` is set to 1 by default
* ``colsample_bynode`` and ``subsample`` are set to 0.8 by default
* ``booster`` is always ``gbtree``

For a simple example, you can train a random forest regressor with::

    from sklearn.model_selection import KFold

    # Your code ...

    kf = KFold(n_splits=2)
    for train_index, test_index in kf.split(X, y):
        xgb_model = xgb.XGBRFRegressor(random_state=42).fit(
	X[train_index], y[train_index])

Note that these classes have a smaller selection of parameters compared to using
``train()``. In particular, it is impossible to combine random forests with gradient
boosting using this API.


*******
Caveats
*******

* XGBoost uses 2nd order approximation to the objective function. This can lead to results
  that differ from a random forest implementation that uses the exact value of the
  objective function.
* XGBoost does not perform replacement when subsampling training cases. Each training case
  can occur in a subsampled set either 0 or 1 time.