File: CHANGES

package info (click to toggle)
r-cran-gbm 2.1.8-1
links: PTS, VCS
area: main
in suites: bullseye
size: 1,224 kB
sloc: cpp: 7,329; ansic: 265; makefile: 9
file content (341 lines) | stat: -rw-r--r-- 14,021 bytes
parent folder | download | duplicates (5)
Changes in version 2.1

- The cross-validation loop is now parallelized. The functions attempt
  to guess a sensible number of cores to use, or the user can specify
  how many through new argument n.cores.
- A fair amount of code refactoring.
- Added type='response' for predict when distribution='adaboost'.
- Fixed a bug that caused offset not to be used if the first element
  of offset was 0.
- Updated predict.gbm and plot.gbm to cope with objects created using
  gbm version 1.6.
- Changed default value of verbose to 'CV'. gbm now defaults to letting
  the user know which block of CV folds it is running. If verbose=TRUE
  is specified, the final run of the model also prints its progress
  to screen as in earlier versions.
- Fixed bug that caused predict to return wrong result when
  distribution == 'multinomial' and length(n.trees) > 1.
- Fixed bug that caused n.trees to be wrong in relative.influence
  if no CV or validation set was used.
- Relative influence was computed wrongly when distribution="multinomial". Fixed.
- Cross-validation predictions now included in the output object.
- Fixed bug in relative.influence that caused labels to be wrong when sort.=TRUE.
- Modified interact.gbm to do additional sanity check, updated help file
- Fixed bug in interact.gbm so that it now works for distribution="multinomial"
- Modified predict.gbm to improve performance on large datasets

Changes in version 2.0

Lots of new features added so it warrants a change to the first digit
of the version number.

Major changes:
- Several new distributions are now available thanks to Harry Southworth 
and Daniel Edwards: multinomial and tdist.
- New distribution 'pairwise' for Learning to Rank Applications (LambdaMART), 
  including four different ranking measures, thanks to Stefan Schroedl.

- The gbm package is now managed on R-Forge by Greg Ridgeway and 
Harry Southworth. Visit http://r-forge.r-project.org/projects/gbm/
to get the latest or to contribute to the package

Minor changes:
- the "quantile" distribution now handles weighted data

- relative.influence changed to give names to the returned vector

- Added print.gbm and show.gbm. These give basic summaries of the fitted 
model

- Added support function and reconstructGBMdata() to facilitate reconstituting
the data for certain plots and summaries

- gbm was not using the weights when using cross-validation due to 
a bug. That's been fixed (Thanks to Trevor Hastie for catching this)

- predict.gbm now tries to guess the number of trees, also defaults to using the training data if no newdata is given. 

- relative.influence has has 2 new arguments, scale. and sort. that default to FALSE. The returned vector now has names. 

- gbm now tries to guess what distribution you meant if you didn't specify. 

- gbm has a new argument, class.stratifiy.cv, to control if cross-validation is stratified by class with distribution is "bernoulli" or "multinomial". Defaults to TRUE for multinomial, FALSE for bernoulli. The purpose is to avoid unusable training sets. 

- gbm.perf now puts a vertical line at the best number of trees when method = "cv" or "test". Tries to guess what method you meant if you don't tell it. 

- .First.lib had a bug that would crash gbm if gbm was installed as a local library. Fixed. 

- plot.gbm has a new argument, type, defaulting to "link". For bernoulli, multinomial, poisson, "response" is allowed.

- models with large interactions (>24) were using up all the terminal nodes in the stack. The stack has been increased to 101 nodes allowing interaction.depth up to 49. A more graceful error is now issued if interaction.depth exceeds 49. (Thanks to Tom Dietterich for catching this).

- gbm now uses the R macro R_NaN in the C++ code rather than NAN, which would not compile on Sun OS.

- If covariates marked missing values with NaN instead of NA, the model fit would not be consistent (Thanks to JR Lockwood for noting this)

Changes in version 1.6

- Quantile regression is now available thanks to a contribution from
Brian Kriegler. Use list(name="quantile",alpha=0.05) as the
distribution parameter to construct a predictor of the 5% of the
conditional distribution

- gbm() now stores cv.folds in the returned gbm object

- Added a normalize parameter to summary.gbm that allows one to
choose whether or not to normalize the variable influence to sum to
100 or not

- Corrected a minor bug in plot.gbm that put the wrong variable
label on the x axis when plotting a numeric variable and a factor
variable

- the C function gbm_plot can now handle missing values. This does
not effect the R function plot.gbm(), but it makes gbm_plot
potentially more useful for computing partial dependence plots

- mgcv is no longer a required package, but the splines package is
needed for calibrate.plot()

- minor changes for compatibility with R 2.6.0 (thanks to Seth
Falcon)

- corrected a bug in the cox model computation when all terminal
nodes had exactly the minimum number of observations permitted,
which caused gbm and R to crash ungracefully. This was likely to
occur with small datasets (thanks to Brian Ring)

- corrected a bug in Laplace that always made the terminal node
predictions slightly larger than the median. Corrected again in a
minor release due to a bug caught by Jon McAuliffe

- corrected a bug in interact.gbm that caused it to crash for factors.
Caught by David Carslaw

- added a plot of cross-validated error to the plots generated by gbm.perf


Changes in version 1.5
- gbm would fail if there was only one x. Now drop=FALSE is set in all
data.frame subsetting (thanks to Gregg Keller for noticing this).

- Corrected gbm.perf() to check if bag.fraction=1 and skips trying to
create the OOB plots and estimates.

- Corrected a typo in the vignette specifying the gradient for the Cox
model.

- Fixed the OOB-reps.R demo. For non-Gaussian cases it was maximizing the
deviance rather than minimizing.

- Increased the largest factor variable allowed from 256 levels to 1024
levels. gbm stops if any factor variable exceeds 1024. Will try to make this
cleaner in the future.

- predict.gbm now allows n.trees to be a vector and efficiently computes
predictions for each indicated model. Avoids having to call predict.gbm
several times for different choices of n.trees.

- fixed a bug that occurred when using cross-validation for coxph. Was
computing length(y) when y is a Surv object which return 2*N rather than N.
This generated out-of-range indices for the training dataset.

- Changed the method for extracting the name of the outcome variable to work
around a change in terms.formula() when using "." in formulas.

Changes in version 1.4

- The formula interface now allows for "-x" to indicate not including certain
variables in the model fit.

- Fixed the formula interface to allow offset(). The offset argument has now
been removed from gbm().

- Added basehaz.gbm that computes the Breslow estimate of the baseline hazard.
At a later stage this will be substituted with a call to survfit, which is
much more general handling not only left-censored data.

- OOB estimator is known to be conservative. A warning is now issued when
using method="OOB" and there is no longer a default method for gbm.perf()

- cv.folds now an option to gbm and method="cv" is an option for gbm.perf.
Performs v-fold cross validation for estimating the optimal number of
iterations

- There is now a package vignette with details on the user options and the
mathematics behind the gbm engine.

Changes in version 1.3

- All likelihood based loss functions are now in terms of Deviance (-2*log
likelihood). As a result, gbm always minimizes the loss. Previous versions
minimized losses for some choices of distribution and maximized a likelihood
for other choices.

- Fixed the Poisson regression to avoid predicting +/- infinity which occurs
when a terminal node has only observations with y=0. The largest predicted
value is now +/-19, similar to what glm predicts for these extreme cases for
linear Poisson regression. The shrinkage factor will be applied to the -19
predictions so it will take 1/shrinkage gbm iterations locating pure terminal
nodes before gbm would actually return a predicted value of +/-19.

- Introduces shrink.gbm.pred() that does a lasso-style variable selection
  Consider this function as still in an experimental phase.

- Bug fix in plot.gbm

- All calls to ISNAN now call ISNA (avoids using isnan)

Changes in version 1.2

- fixed gbm.object help file and updated the function to check for missing
values to the latest R standard.

- gbm.plot now allows i.var to be the names of the variables to plot or the
index of the variables used

- gbm now requires "stats" package into which "modreg" has been merged

- documentation for predict.gbm corrected

Changes in version 1.1

- all calculations of loss functions now compute averages rather than totals.
That is, all performance measures (text of progress, gbm.perf) now report
average log-likelihood rather than total log-likelihood (e.g. mean squared
error rather than sum of squared error). A slight exception applies to
distribution="coxph". For these models the averaging pertains only to the
uncensored observations. The denominator is sum(w[i]*delta[i]) rather than the
usual sum(w[i]).

- summary.gbm now has an experimental "method" argument. The default computes
the relative influence as before. The option "method=permutation.test.gbm"
performs a permutation test for the relative influence. Give it a try and let
me know how it works. It currently is not implemented for "distribution=coxph".

- added gbm.fit, a function that avoids the model.frame call, which is
tragically slow with lots of variables. gbm is now just a formula/model.frame
wrapper for the gbm.fit function. (based on a suggestion and code from Jim
Garrett)

- corrected a bug in the use of offsets. Now the user must pass the offset
vector with the offset argument rather than in the formula. Previously, offsets
were being used once as offsets and a second time as a predictor.

- predict.gbm now has a single.tree option. When set to TRUE the function will
return predictions from only that tree. The idea is that this may be useful for
reweighting the trees using a post-model fit adjustment.

- corrected a bug in CPoisson::BagImprovement that incorrectly computed the
bagged estimate of improvement

- corrected a bug for distribution="coxph" in gbm() and gbm.more(). If there
was a single predictor the functions would drop the unused array dimension issuing
an error.

- corrected gbm() distribution="coxph" when train.fraction=1.0. The program would
set two non-existant observations in the validation set and issue a warning.

- if a predictor variable has no variation a warning (rather than an error) is
now issued

- updated the documentation for calibrate.plot to match the implementation

- changed the some of the default values in gbm(), bag.fraction=0.5,
train.fraction=1.0, and shrinkage=0.001.

- corrected a bug in predict.gbm. The C code producing the predictions would
go into an infinite loop if predicting an observation with a level of a
categorical variable not seen in the training dataset. Now the routine uses
the missing value prediction. (Feng Zeng)

- added a "type" parameter to predict.gbm. The default ("link") is the same as
before, predictions are on the canonical scale (gradient scale). The new
option ("response") converts back the same scale as the outcome (probability
for bernoulli, mean for gaussian, etc.).

- gbm and gbm.more now have verbose options which can be set to FALSE to
suppress the progress and performance indicators. (several users requested
this nice feature)

- gbm.perf no longer prints out verbose information about the best iteration
estimate. It simply returns the estimate and creates the plots if requested.

- ISNAN, since R 1.8.0, R.h changed declarations for ISNAN(). These changes
broke gbm 1.0. I added the following code to buildinfo.h to fix this
    #ifdef IEEE_754
    #undef ISNAN
    #define ISNAN(x)   R_IsNaNorNA(x)
    #endif
seems to work now but I'll look for a more elegant solution.


Changes in version 0.8

- Additional documentation about the loss functions, graphics, and methods is
now available with the package
- Fixed the initial value for the adaboost exponential loss. Prior
to version 0.8 the initial value was 0.0, now half the baseline
log-odds

- Changes in some headers and #define's to compile under gcc 3.2
(Brian Ripley)


Changes in version 0.7

- gbm.perf, the argument named best.iter.calc has been renamed
"method" for greater simplicity

- all entries in the design matrix are now coerced to doubles
(Thanks to Bonnie Ghosh)

- now checks that all predictors are either numeric, ordinal, or
factor

- summary.gbm now reports the correct relative influence when some
variables do not enter the model. (Thanks to Hugh Chipman)

- renamed several #define'd variables in buildinfo.h so they do
not conflict with standard winerror.h names.


Planned future changes

1. Add weighted median functionality to Laplace

2. Automate the fitting process, ie, selecting shrinkage and
number of iterations

3. Add overlay factor*continuous predictor plot as an option
rather than lattice plots

4. Add multinomial and ordered logistic regression procedures


Thanks to

RAND for sponsoring the development of this software through
statistical methods funding.

Kurt Hornik, Brian Ripley, and Jan De Leeuw for helping me get gbm up to the R
standard and into CRAN.

Dan McCaffrey for testing and evangelizing the utility of this
program.

Bonnie Ghosh for finding bugs.

Arnab Mukherji for testing and suggesting new features.

Daniela Golinelli for finding bugs and marrying me.

Andrew Morral for suggesting improvements and finding new
applications of the method in the evaluation of drug treatment
programs.

Katrin Hambarsoomians for finding bugs.

Hugh Chipman for finding bugs.

Jim Garrett for many suggestions and contributions.