1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
|
################
Categorical Data
################
.. note::
As of XGBoost 1.6, the feature is experimental and has limited features. Only the
Python package is fully supported.
.. versionadded:: 3.0
Support for the R package using ``factor``.
Starting from version 1.5, the XGBoost Python package has experimental support for
categorical data available for public testing. For numerical data, the split condition is
defined as :math:`value < threshold`, while for categorical data the split is defined
depending on whether partitioning or onehot encoding is used. For partition-based splits,
the splits are specified as :math:`value \in categories`, where ``categories`` is the set
of categories in one feature. If onehot encoding is used instead, then the split is
defined as :math:`value == category`. More advanced categorical split strategy is planned
for future releases and this tutorial details how to inform XGBoost about the data type.
************************************
Training with scikit-learn Interface
************************************
The easiest way to pass categorical data into XGBoost is using dataframe and the
``scikit-learn`` interface like :class:`XGBClassifier <xgboost.XGBClassifier>`. For
preparing the data, users need to specify the data type of input predictor as
``category``. For ``pandas/cudf Dataframe``, this can be achieved by
.. code:: python
X["cat_feature"].astype("category")
for all columns that represent categorical features. After which, users can tell XGBoost
to enable training with categorical data. Assuming that you are using the
:class:`XGBClassifier <xgboost.XGBClassifier>` for classification problem, specify the
parameter ``enable_categorical``:
.. code:: python
# Supported tree methods are `approx` and `hist`.
clf = xgb.XGBClassifier(tree_method="hist", enable_categorical=True, device="cuda")
# X is the dataframe we created in previous snippet
clf.fit(X, y)
# Must use JSON/UBJSON for serialization, otherwise the information is lost.
clf.save_model("categorical-model.json")
Once training is finished, most of other features can utilize the model. For instance one
can plot the model and calculate the global feature importance:
.. code:: python
# Get a graph
graph = xgb.to_graphviz(clf, num_trees=1)
# Or get a matplotlib axis
ax = xgb.plot_tree(clf, num_trees=1)
# Get feature importances
clf.feature_importances_
The ``scikit-learn`` interface from dask is similar to single node version. The basic
idea is create dataframe with category feature type, and tell XGBoost to use it by setting
the ``enable_categorical`` parameter. See :ref:`sphx_glr_python_examples_categorical.py`
for a worked example of using categorical data with ``scikit-learn`` interface with
one-hot encoding. A comparison between using one-hot encoded data and XGBoost's
categorical data support can be found :ref:`sphx_glr_python_examples_cat_in_the_dat.py`.
********************
Optimal Partitioning
********************
.. versionadded:: 1.6
Optimal partitioning is a technique for partitioning the categorical predictors for each
node split, the proof of optimality for numerical output was first introduced by `[1]
<#references>`__. The algorithm is used in decision trees `[2] <#references>`__, later
LightGBM `[3] <#references>`__ brought it to the context of gradient boosting trees and
now is also adopted in XGBoost as an optional feature for handling categorical
splits. More specifically, the proof by Fisher `[1] <#references>`__ states that, when
trying to partition a set of discrete values into groups based on the distances between a
measure of these values, one only needs to look at sorted partitions instead of
enumerating all possible permutations. In the context of decision trees, the discrete
values are categories, and the measure is the output leaf value. Intuitively, we want to
group the categories that output similar leaf values. During split finding, we first sort
the gradient histogram to prepare the contiguous partitions then enumerate the splits
according to these sorted values. One of the related parameters for XGBoost is
``max_cat_to_onehot``, which controls whether one-hot encoding or partitioning should be
used for each feature, see :ref:`cat-param` for details.
**********************
Using native interface
**********************
The ``scikit-learn`` interface is user friendly, but lacks some features that are only
available in native interface. For instance users cannot compute SHAP value directly.
Also native interface supports more data types. To use the native interface with
categorical data, we need to pass the similar parameter to :class:`~xgboost.DMatrix` or
:py:class:`~xgboost.QuantileDMatrix` and the :func:`train <xgboost.train>` function. For
dataframe input:
.. code:: python
# X is a dataframe we created in previous snippet
Xy = xgb.DMatrix(X, y, enable_categorical=True)
booster = xgb.train({"tree_method": "hist", "max_cat_to_onehot": 5}, Xy)
# Must use JSON for serialization, otherwise the information is lost
booster.save_model("categorical-model.json")
SHAP value computation:
.. code:: python
SHAP = booster.predict(Xy, pred_interactions=True)
# categorical features are listed as "c"
print(booster.feature_types)
For other types of input, like ``numpy array``, we can tell XGBoost about the feature
types by using the ``feature_types`` parameter in :class:`DMatrix <xgboost.DMatrix>`:
.. code:: python
# "q" is numerical feature, while "c" is categorical feature
ft = ["q", "c", "c"]
X: np.ndarray = load_my_data()
assert X.shape[1] == 3
Xy = xgb.DMatrix(X, y, feature_types=ft, enable_categorical=True)
For numerical data, the feature type can be ``"q"`` or ``"float"``, while for categorical
feature it's specified as ``"c"``. The Dask module in XGBoost has the same interface so
:class:`dask.Array <dask.Array>` can also be used for categorical data. Lastly, the
sklearn interface :py:class:`~xgboost.XGBRegressor` has the same parameter.
****************
Data Consistency
****************
XGBoost accepts parameters to indicate which feature is considered categorical, either through the ``dtypes`` of a dataframe or through the ``feature_types`` parameter. However, XGBoost by itself doesn't store information on how categories are encoded in the first place. For instance, given an encoding schema that maps music genres to integer codes:
.. code-block:: python
{"acoustic": 0, "indie": 1, "blues": 2, "country": 3}
XGBoost doesn't know this mapping from the input and hence cannot store it in the model. The mapping usually happens in the users' data engineering pipeline with column transformers like :py:class:`sklearn.preprocessing.OrdinalEncoder`. To make sure correct result from XGBoost, users need to keep the pipeline for transforming data consistent across training and testing data. One should watch out for errors like:
.. code-block:: python
X_train["genre"] = X_train["genre"].astype("category")
reg = xgb.XGBRegressor(enable_categorical=True).fit(X_train, y_train)
# invalid encoding
X_test["genre"] = X_test["genre"].astype("category")
reg.predict(X_test)
In the above snippet, training data and test data are encoded separately, resulting in two different encoding schemas and invalid prediction result. See :ref:`sphx_glr_python_examples_cat_pipeline.py` for a worked example using ordinal encoder.
*************
Miscellaneous
*************
By default, XGBoost assumes input categories are integers starting from 0 till the number
of categories :math:`[0, n\_categories)`. However, user might provide inputs with invalid
values due to mistakes or missing values in training dataset. It can be negative value,
integer values that can not be accurately represented by 32-bit floating point, or values
that are larger than actual number of unique categories. During training this is
validated but for prediction it's treated as the same as not-chosen category for
performance reasons.
**********
References
**********
[1] Walter D. Fisher. "`On Grouping for Maximum Homogeneity`_". Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
[2] Trevor Hastie, Robert Tibshirani, Jerome Friedman. "`The Elements of Statistical Learning`_". Springer Series in Statistics Springer New York Inc. (2001).
[3] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. "`LightGBM\: A Highly Efficient Gradient Boosting Decision Tree`_." Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149-3157.
.. _On Grouping for Maximum Homogeneity: https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479
.. _The Elements of Statistical Learning: https://link.springer.com/book/10.1007/978-0-387-84858-7
.. _LightGBM\: A Highly Efficient Gradient Boosting Decision Tree: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
|