GBDT for regression and classification

Posted May 25, 20205 min read

The boosting method actually uses an addition model(that is, a linear combination of basis functions) and a forward step algorithm. The lifting method based on decision tree as the basis function is called boosting tree. GBDT(Gradient Boosting Tree) was first proposed by Friedman in the paper "Greedy Function Approximation:A Gradient Boosting Machine", which can be used to solve classification problems(decision tree is a binary classification tree) or regression problems.(The decision tree is a binary regression tree).
The addition model of GBDT is defined as,
$$\ begin {equation} f \ _M(x) = \ sum \ _ {m = 1} ^ {M} T(x; \ Theta \ _m) \ tag {1} \ end {equation } $$
Among them, $x $is the input sample, $T(x; \ Theta \ _M) $is the decision tree, and $\ Theta \ _M $is the parameter of the decision tree.
Solve the optimal model by minimizing the loss function,
$$\ begin {equation} f \ _M ^ \ * = \ mathop \ arg \ min \ _ {f \ _M} \ sum \ _ {i = 0} ^ {N} L(y ^ {(i)}, f \ _M(x ^ {(i)})) \ tag {2} \ end {equation} $$
$f \ _M ^ \ * $is not easy to solve directly. It can solve the local optimal solution iteratively through the greedy method, that is, using the forward step algorithm.
First determine the initial lifting tree $f \ _0(x) = 0 $, the model in step $m $is,
$$\ begin {equation} f \ _m(x) = f \ _ {m-1}(x) + T(x; \ Theta \ _ {m}) \ tag {3} \ end { equation} $$
Among them, $f \ _ {m-1}(x) $is the current model, and the model at step m is equal to the model at step m-1 plus the base evaluator at step m-1. Determine the parameter $\ Theta \ _m $of the m-th decision tree through the minimization loss function,
$$\ begin {equation} \ hat {\ Theta} \ _ m = \ mathop \ arg \ min \ _ {\ Theta \ _m} \ sum \ _ {i = 1} ^ {N } L(y ^ {(i)}, f \ _ {m-1}(x ^ {(i)}) + T(x ^ {(i)}; \ Theta \ _m)) \ tag { 4} \ end {equation} $$
Expand the first order Taylor of $L(y, f \ _m(x)) $at $f \ _ {m-1}(x) $to get,
$$\ begin {equation} L(y, f \ _m(x)) \ approx L(y, f \ _ {m-1}(x)) + L ^ {'}(y, f \ _ {m-1}(x)) T(x; \ Theta \ _m) \ tag {5} \ end {equation} $$
Analogy Gradient Descent , make $L(y, f \ _m(x)) $<$L(y, f \ _ {m-1}(x)) $, You can take $T(x; \ Theta \ _m) =-\ eta L ^ {'}(y, f \ _ {m-1}(x)) $, let $\ eta = 1 $, substitution formula(3),
$$\ begin {equation} f \ _m(x) = f \ _ {m-1}(x)-L ^ {'}(y, f \ _ {m-1}(x)) \ tag {6} \ end {equation} $$
Therefore, to solve the parameter $\ Theta \ _m $of the decision tree $T(x; \ Theta \ _m) $is to use the negative gradient of the loss function $L $in the current model(the model of the mth step $f \ _ {m-1}(x) $) is used as an approximation of the residuals in the lifting tree algorithm, fitting a decision tree,
$$\ begin {equation}-\ [\ frac {\ partial L(y \ _i, f(x \ _i))} {\ partial f(x \ _i)} ]\ _ {f(x) = f \ _ {m-1}(x)} \ tag {7} \ end {equation} $$
Note that if the loss function is defined as the squared loss error, then the residual is fitted.

Regression problem

First, calculate $f \ _0(x) $,
$$\ begin {equation} f \ _0(x) = \ arg \ min \ _c \ sum \ _ {i = 1} ^ NL(y ^ {(i)}, c) \ tag { 8} \ end {equation} $$
Each round, the approximate residuals are calculated for all samples,
$$\ begin {equation} r \ _ {m, i} =-\ [\ frac {\ partial L(y ^ {(i)}, f(x ^ {(i)})))} { \ partial f(x ^ {(i)})} ]\ _ {f(x) = f \ _ {m-1}(x)} \ tag {9} \ end {equation} $$
Use all $r \ _ {m, i}, i = 1, \ cdots, N $to fit a CART, assuming that the decision tree divides the input sample into $J \ _m $regions($J \ _m $Leaf nodes), the jth leaf node area is recorded as $R \ _ {m, j}, j = 1, \ cdots, J \ _m $, and the output value is calculated for each leaf node area $c \ _ {m, j} $,
$$\ begin {equation} c \ _ {m, j} = \ arg \ min \ _c \ sum \ _ {x ^ {(i)}} ^ {R \ _ {m, j}} L(y ^ {(i)}, f \ _ {m-1}(x ^ {(i)}) + c) \ tag {10} \ end {equation} $$
Update the mth round of the model,
$$\ begin {equation} f \ _m(x) = f \ _ {m-1}(x) + \ sum \ _ {j = 1} ^ {J \ _m} c \ _ {m, j } I(x \ in R \ _ {m, j}) \ tag {11} \ end {equation} $$

Classification problem

It should be noted that GBDT solves the classification problem and uses the CART regression tree.

Multi-category

When multi-classification, assume that the total number of categories is K, and the loss function is log loss,
$$\ begin {equation} L(y, f(x)) =-\ sum \ _ {k = 1} ^ {K} y \ _k \ log p \ _k(x) \ tag {12 } \ end {equation} $$
Among them, y is one hot code, if $y \ _k = 1 $, it means that the category of the sample is k.
When multi-classification, each round will train a CART tree for each category, that is, each round will train K CART regression trees, so for each category k, there will be a prediction model $f \ _k(x) $, so the predicted probability that sample x belongs to category k is,
$$\ begin {equation} p \ _k(x) = \ frac {exp(f \ _k(x))} {\ sum \ _ {l = 1} ^ K exp(f \ _l(x))} \ tag {13} \ end {equation} $$
In the m-1th round, train the decision tree $T \ _ {m, k}(x) $about category k, and use the negative gradient of the loss function in the current model $f \ _ {k-1}(x) $, To fit the decision tree $T \ _ {m, k}(x) $, for the ith sample, the pseudo residual is,
$$\ begin {equation} \ begin {aligned} r \ _ {m, k, i} & =-\ [\ frac {\ partial L(y ^ {(i)}, f(x \ _i))} {\ partial f(x \ _i)} ]\ _ {f(x) = f \ _ {m-1, k}(x)} \\ & = \ sum \ _ {l ^ {'} = 1, l ^ {'} \ neq k} ^ {K} -y ^ {(i)} \ _ {l ^ {'}} \ frac {exp(f \ _ { m-1, k}(x \ _i))} {\ sum \ _ {l = 1} ^ {K} exp(f \ _l(x ^ {(i)})} + y \ _k ^ {(i)}(1-\ frac {exp(f \ _ {m-1, k}(x \ _i))} {\ sum \ _ {l = 1} ^ {K} exp(f \ _l(x ^ {(i)})}) \\ & = y ^ {(i)} \ _ k-p \ _ {m-1, k}(x ^ {(i)}) \ end {aligned } \ tag {14} \ end {equation} $$
Use the pseudo-residuals to fit the decision tree to get the m-th base estimator for category k $T \ _ {m, k}(x) $. The input sample is divided into $J \ _ {m, k} $areas(that is, $J \ _ {m, k} $leaf nodes), and the output value of the jth area is calculated $c \ _ {m, k , j} $,
$$\ begin {equation} c \ _ {m, k, j} = \ arg \ min \ _ {c} \ sum \ _ {x ^ {(i)} \ in R \ _ { m, k, j}} L(y ^ {(i)}, f \ _ {m-1, k}(x ^ {(i)}) + c) \ tag {15} \ end {equation } $$
The original paper gave an approximate solution(this step will not be derived),
$$\ begin {equation} c \ _ {m, k, j} = \ frac {K-1} {K} \; \ frac {\ sum \ limits \ _ {x ^ {(i)} \ in R \ _ {m, k, i}} r \ _ {m, k, j}} {\ sum \ limits \ _ {x ^ {(i)} \ in R \ _ {m, k, j}} | r \ _ {m, k, i} |(1- | r \ _ {m, k, i} |)} \ tag {16} \ end {equation} $$
Update the model of the mth round
$$\ begin {equation} f \ _ {m, k}(x) = f \ _ {m-1, k}(x) + \ sum \ _ {j = 1} ^ {J \ _ { m, k}} c \ _ {m, k, j} I(x \ in R \ _ {m, k, j}) \ tag {17} \ end {equation} $$

Binary classification

Calculate loss using log loss function
$$\ begin {equation} L(y, f(x))-y \ log \ hat {y}-(1-y) \ log(1-\ hat {y}) \ tag {18} \ end {equation} $$
Among them, $\ hat {y} = p(y = 1 | x) = \ frac {1} {1 + exp(-f(x))} $.
In round m, the pseudo-residual of sample i is $r \ _ {m, i} = y ^ {(i)}-\ hat {y} ^ {(i)} \ _ {m-1} $, Fit the decision tree of the mth round, calculate the output of the jth leaf node area, and get an approximate value,
$$\ begin {equation} c \ _ {m, j} = \ frac {\ sum \ _ {x ^ {(i)} \ in R \ _ {m, j}} r \ _ { m, i}} {\ sum \ _ {x ^ {(i)} \ in R \ _ {m, j}}(y ^ {(i)}-r \ _ {m, i})((1-y ^ {(i)}-r \ _ {m, i}))} \ tag {19} \ end {equation} $$
Note that if $y ^ {(i)} = 1 $, then $r \ _ {m, i}> 0 $, and if $y ^ {(i)} = 0 $, then $r \ _ {m , i} <0 $, so equation(19) can be written as,
$$\ begin {equation} c \ _ {m, j} = \ frac {\ sum \ _ {x ^ {(i)} \ in R \ _ {m, j}} r \ _ { m, i}} {\ sum \ _ {x ^ {(i)} \ in R \ _ {m, j}} | r \ _ {m, i} |(1- | r \ _ {m , i} |)} \ tag {20} \ end {equation} $$

  1. Friedman, JH(2001) Greedy Function Approximation:A Gradient Boosting Machine. Annals of Statistics, 29, 1189-1232. Https://doi.org/10.1214/aos/1...
  2. Li Hang, "Statistical Learning Method"
  3. https://zhuanlan.zhihu.com/p/...
  4. https://zhuanlan.zhihu.com/p/...
  5. https://zhuanlan.zhihu.com/p/...