1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
|
This represents a major revision of the rpart code, driven by the desire
to add user-written split routines.
1. Bugfix -- the "maxdepth" option needed to be in front of "xval" in the
return list from rpart.control. The xval option can be of length 1 or n,
and the C code was assuming length 1. (If xval was a vector, its second value
was used for maxdepth).
2. Changes to the rpart object
This was motivated partly by the fact that StatSci wants to incorporate rpart
into the product. We fixed a couple of design flaws now before they got
cast into stone
a. Removed the frame$splits component. The only routine that used it was
labels.rpart, and then it didn't use it by default. Now we compute the
labels as needed. Pre-computing was a bad idea when more/fewer digits were
wanted on the printout.
b. An additional component having to do with user-written split functions,
see below.
c. The component yval is now always the first component of the prediction.
If the prediction is of length >1, a second yval2 component is returned also.
For instance it is (event rate, #events) for poisson trees, and
(predicted class, class counts, class probabilities) for classification
trees, then yval2 will be a matrix containing the full response.
Before, yval2 was the number of events for poisson, and the class
counts for classification, with yet another optional vector yprob for
the class probabilities. More discussion of why we did this is below.
3. Simple printout change. Per the ongoing suggestion of Brian Ripley,
the print and summary routines now use options(digits), not digits-3.
This should be repaired in the survival routines as well; the -3 was not one
of my better ideas.
4. User-written splitting rules
A user can create their own splitting rules, and pass them to rpart as
a list of 3 functions: initialization, response, and splitting.
4a. Printing
One important side effect of this update is the printing of trees. The
print, summary, and text routines all had special if-then-else code to
treat each of the 4 current splitting methods as a special case for printout.
In order to make them extensible, this all had to go.
The initialization functions rpart.class, rpart.exp, rpart.poisson, and
rpart.anova now each return a set of formatting functions:
summary <- function(yval, dev, wt, ylevel, digits)
yval: a vector or matrix of response values
dev : a vector of deviance values
wt : a vector of weights
ylevel: if the left-hand-side of the model equation was a factor,
this contains its levels, otherwise NULL
digits: number of significant digits
The result should be a vector of character strings. For poisson splits for
instance, "events=54, estimated rate=0.057, mean deviance=1.32" is the
string that is created.
print <- function(yval, ylevel, digits)
Optional, currently only used by rpart.class. If missing the default is to
use yval as the last part of the line in print.rpart.
text <-
Not written yet
As a consequence, the summary and print routines no longer have special
code per method. (And soon text.rpart)
4b. The number of y variables needs to be passed into the routine, rather
than a part of the func_table.h file.
Solution: all of the init functions (rpart.anova, rpart.exp, etc) now
return 'numy' as a part of their list.
4c. Callback
In order to get decent speed using a user written routine, I needed to use
the "trick" found originally in glm code and then later in penalized survival.
In 3.4 the technique is completely undocumented -- but I once got to see the
C code for glm as a Bell beta tester and copied it blindly.
The code here uses the approach outlined (thinly) in the green book. The
heart of the work is found in rpartcallback.s and rpart_callback.c It is
intended that the same approach will replace what is currently used in the
survival routines. Note: I avoided the "ASSIGN_IN_FRAME" macro because of
a deficiency pointed out by Bill Dunlap.
An open question is how these can be mapped into R. I'm hoping, since
it's mostly macros, that it will be fairly easy.
Now, rpartcallback.s really isn't used. I'd like to keep its
functionality as a separate routine, particularly since the same lines
of code appear in both rpart() and xpred.rpart() (see rpart2.s for instance).
But, as soon as rpartcallback returns, some memory that I need gets released,
in particular the two expressions. I've tried putting a copy of them into
eframe, using COPY_ALL in the .c code, and a few others and nothing works.
The working code, rpart.s, has all the lines from rpartcallback copied inside
it right where the call to rpartcallback would have been. Of course, most
of the things I tried are pure guesswork, given the sparseness of the
documentation for .Call.
4d. The routines
In the test directory is a file "anovatest.s" that shows how to create
3 routines that replicate the built-in anova splitting method. It has
a fair bit of comment
4e. Speed
The last few lines of anovatest show an approx 5-fold penalty for doing
the splits outside of C. But, the ability to prototype a new idea
quickly is really nice. For a simple example see anovatest2.s.
5. Cleaned up the labels.rpart() function.
a. Nothing calls rplabel.c or prlab anymore. The first of these
had been hard to standardize across the Unix `strings' libraries because
of one of the routines that I used. Most of what these routines did
has been moved into the labels.rpart code itself, hopefully allowing for
more transparency.
b. Depreciated the "pretty" arg to labels.rpart, replacing it with
"minlength", which is much more sensibly set up (see the comments on the
head of the routine for details). Allow access to more of the arguments
of abbreviate().
c. In many places the code now makes use of a new routine formatg(),
which gives us the "g" format of printf. The routine is reminiscent of
"formatc" found on statlib, but with fewer options (and fewer checks).
If a more flexible format() appears one day in standard S we could
convert to using it.
|