1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
|
############
Tree Methods
############
For training boosted tree models, there are 2 parameters used for choosing algorithms,
namely ``updater`` and ``tree_method``. XGBoost has 3 builtin tree methods, namely
``exact``, ``approx`` and ``hist``. Along with these tree methods, there are also some
free standing updaters including ``refresh``, ``prune`` and ``sync``. The parameter
``updater`` is more primitive than ``tree_method`` as the latter is just a
pre-configuration of the former. The difference is mostly due to historical reasons that
each updater requires some specific configurations and might has missing features. As we
are moving forward, the gap between them is becoming more and more irrelevant. We will
collectively document them under tree methods.
**************
Exact Solution
**************
Exact means XGBoost considers all candidates from data for tree splitting, but underlying
the objective is still interpreted as a Taylor expansion.
1. ``exact``: The vanilla gradient boosting tree algorithm described in `reference paper
<http://arxiv.org/abs/1603.02754>`_. During split-finding, it iterates over all
entries of input data. It's more accurate (among other greedy methods) but
computationally slower in compared to other tree methods. Further more, its feature
set is limited. Features like distributed training and external memory that require
approximated quantiles are not supported. This tree method can be used with the
parameter ``tree_method`` set to ``exact``.
**********************
Approximated Solutions
**********************
As ``exact`` tree method is slow in computation performance and difficult to scale, we
often employ approximated training algorithms. These algorithms build a gradient
histogram for each node and iterate through the histogram instead of real dataset. Here
we introduce the implementations in XGBoost.
1. ``approx`` tree method: An approximation tree method described in `reference paper
<http://arxiv.org/abs/1603.02754>`_. It runs sketching before building each tree
using all the rows (rows belonging to the root). Hessian is used as weights during
sketch. The algorithm can be accessed by setting ``tree_method`` to ``approx``.
2. ``hist`` tree method: An approximation tree method used in LightGBM with slight
differences in implementation. It runs sketching before training using only user
provided weights instead of hessian. The subsequent per-node histogram is built upon
this global sketch. This is the fastest algorithm as it runs sketching only once. The
algorithm can be accessed by setting ``tree_method`` to ``hist``.
************
Implications
************
Some objectives like ``reg:squarederror`` have constant hessian. In this case, the
``hist`` should be preferred as weighted sketching doesn't make sense with constant
weights. When using non-constant hessian objectives, sometimes ``approx`` yields better
accuracy, but with slower computation performance. Most of the time using ``hist`` with
higher ``max_bin`` can achieve similar or even superior accuracy while maintaining good
performance. However, as xgboost is largely driven by community effort, the actual
implementations have some differences than pure math description. Result might be
slightly different than expectation, which we are currently trying to overcome.
**************
Other Updaters
**************
1. ``Prune``: It prunes the existing trees. ``prune`` is usually used as part of other
tree methods. To use pruner independently, one needs to set the process type to update
by: ``{"process_type": "update", "updater": "prune"}``. With this set of parameters,
during training, XGBoost will prune the existing trees according to 2 parameters
``min_split_loss (gamma)`` and ``max_depth``.
2. ``Refresh``: Refresh the statistic of built trees on a new training dataset. Like the
pruner, To use refresh independently, one needs to set the process type to update:
``{"process_type": "update", "updater": "refresh"}``. During training, the updater
will change statistics like ``cover`` and ``weight`` according to the new training
dataset. When ``refresh_leaf`` is also set to true (default), XGBoost will update the
leaf value according to the new leaf weight, but the tree structure (split condition)
itself doesn't change.
There are examples on both training continuation (adding new trees) and using update
process on ``demo/guide-python``. Also checkout the ``process_type`` parameter in
:doc:`parameter`.
3. ``Sync``: Synchronize the tree among workers when running distributed training.
****************
Removed Updaters
****************
3 Updaters were removed during development due to maintainability. We describe them here
solely for the interest of documentation.
1. Distributed colmaker, which was a distributed version of exact tree method. It
required specialization for column based splitting strategy and a different prediction
procedure. As the exact tree method is slow by itself and scaling is even less
efficient, we removed it entirely.
2. ``skmaker``. Per-node weighted sketching employed by ``grow_local_histmaker`` is slow,
the ``skmaker`` was unmaintained and seems to be a workaround trying to eliminate the
histogram creation step and uses sketching values directly during split evaluation. It
was never tested and contained some unknown bugs, we decided to remove it and focus our
resources on more promising algorithms instead. For accuracy, most of the time
``approx`` and ``hist`` are enough with some parameters tuning, so removing them don't
have any real practical impact.
3. ``grow_local_histmaker`` updater: An approximation tree method described in `reference
paper <http://arxiv.org/abs/1603.02754>`_. This updater was rarely used in practice so
it was still an updater rather than tree method. During split finding, it first runs a
weighted GK sketching for data points belong to current node to find split candidates,
using hessian as weights. The histogram is built upon this per-node sketch. It was
faster than ``exact`` in some applications, but still slow in computation. It was
removed because it depended on Rabit's customized reduction function that handles all
the data structure that can be serialized/deserialized into fixed size buffer, which is
not directly supported by NCCL or federated learning gRPC, making it hard to refactor
into a common allreducer interface.
**************
Feature Matrix
**************
Following table summarizes some differences in supported features between 4 tree methods,
`T` means supported while `F` means unsupported.
+------------------+-----------+---------------------+------------------------+---------------------+------------------------+
| | Exact | Approx | Approx (GPU) | Hist | Hist (GPU) |
+==================+===========+=====================+========================+=====================+========================+
| grow_policy | Depthwise | depthwise/lossguide | depthwise/lossguide | depthwise/lossguide | depthwise/lossguide |
+------------------+-----------+---------------------+------------------------+---------------------+------------------------+
| max_leaves | F | T | T | T | T |
+------------------+-----------+---------------------+------------------------+---------------------+------------------------+
| sampling method | uniform | uniform | gradient_based/uniform | uniform | gradient_based/uniform |
+------------------+-----------+---------------------+------------------------+---------------------+------------------------+
| categorical data | F | T | T | T | T |
+------------------+-----------+---------------------+------------------------+---------------------+------------------------+
| External memory | F | T | P | T | P |
+------------------+-----------+---------------------+------------------------+---------------------+------------------------+
| Distributed | F | T | T | T | T |
+------------------+-----------+---------------------+------------------------+---------------------+------------------------+
Features/parameters that are not mentioned here are universally supported for all 3 tree
methods (for instance, column sampling and constraints). The `P` in external memory means
special handling. Please note that both categorical data and external memory are
experimental.
|