File: describe.Rd

package info (click to toggle)
hmisc 4.2-0-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, buster, sid
  • size: 3,332 kB
  • sloc: asm: 27,116; fortran: 606; ansic: 411; xml: 160; makefile: 2
file content (419 lines) | stat: -rw-r--r-- 19,428 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
\name{describe}
\alias{describe}
\alias{describe.default}
\alias{describe.vector}
\alias{describe.matrix}
\alias{describe.formula}
\alias{describe.data.frame}
\alias{plot.describe}
\alias{print.describe}
\alias{print.describe.single}
\alias{[.describe}
\alias{latex.describe}
\alias{latex.describe.single}
\alias{html.describe}
\alias{html.describe.single}
\alias{formatdescribeSingle}
\title{Concise Statistical Description of a Vector, Matrix, Data Frame,
	or Formula}
\description{
\code{describe} is a generic method that invokes \code{describe.data.frame},
\code{describe.matrix}, \code{describe.vector}, or
\code{describe.formula}. \code{describe.vector} is the basic 
function for handling a single variable.
This function determines whether the variable is character, factor,
category, binary, discrete numeric, and continuous numeric, and prints
a concise statistical summary according to each. A numeric variable is
deemed discrete if it has <= 10 distinct values. In this case,
quantiles are not printed. A frequency table is printed 
for any non-binary variable if it has no more than 20 distinct
values.  For any variable for which the frequency table is not printed,
the 5 lowest and highest values are printed.  This behavior can be
overriden for long character variables with many levels using the
\code{listunique} parameter, to get a complete tabulation.

\code{describe} is especially useful for
describing data frames created by \code{*.get}, as labels, formats,
value labels, and (in the case of \code{sas.get}) frequencies of special
missing values are printed.

For a binary variable, the sum (number of 1's) and mean (proportion of
1's) are printed. If the first argument is a formula, a model frame
is created and passed to describe.data.frame.  If a variable
is of class \code{"impute"}, a count of the number of imputed values is
printed.  If a date variable has an attribute \code{partial.date}
(this is set up by \code{sas.get}), counts of how many partial dates are
actually present (missing month, missing day, missing both) are also presented.
If a variable was created by the special-purpose function \code{substi} (which
substitutes values of a second variable if the first variable is NA),
the frequency table of substitutions is also printed.

For numeric variables, \code{describe} adds an item called \code{Info}
which is a relative information measure using the relative efficiency of
a proportional odds/Wilcoxon test on the variable relative to the same
test on a variable that has no ties.  \code{Info} is related to how
continuous the variable is, and ties are less harmful the more untied
values there are.  The formula for \code{Info} is one minus the sum of
the cubes of relative frequencies of values divided by one minus the
square of the reciprocal of the sample size.  The lowest information
comes from a variable having only one distinct value following by a
highly skewed binary variable.  \code{Info} is reported to
two decimal places.

A latex method exists for converting the \code{describe} object to a
LaTeX file.  For numeric variables having more than 20 distinct values,
\code{describe} saves in its returned object the frequencies of 100
evenly spaced bins running from minimum observed value to the maximum.
When there are less than or equal to 20 distinct values, the original
values are maintained.
\code{latex} and \code{html} insert a spike histogram displaying these
frequency counts in the tabular material using the LaTeX picture
environment.  For example output see
\url{http://biostat.mc.vanderbilt.edu/wiki/pub/Main/Hmisc/counties.pdf}.
Note that the latex method assumes you have the following styles
installed in your latex installation: setspace and relsize.

The \code{html} method mimics the LaTeX output.  This is useful in the
context of Rmarkdown html and html notebook output.

The \code{plot} method is for \code{describe} objects run on data
frames.  It produces spike histograms for a graphic of
continuous variables and a dot chart for categorical variables, showing
category proportions.  The graphic format is \code{ggplot2} if the user
has not set \code{options(grType='plotly')} or has set the \code{grType}
option to something other than \code{'plotly'}.  Otherwise \code{plotly}
graphics that are interactive are produced, and these can be placed into
an Rmarkdown html notebook.  The user must install the \code{plotly}
package for this to work.  When the use hovers the mouse over a bin for
a raw data value, the actual value will pop-up (formatted using
\code{digits}).  When the user hovers over the minimum data value, most
of the information calculated by \code{describe} will pop up.  For each
variable, the number of missing values is used to assign the color to
the histogram or dot chart, and a legend is drawn.  Color is not used if
there are no missing values in any variable. For categorical variables,
hovering over the leftmost point for a variable displays details, and
for all points proportions, numerators, and denominators are displayed
in the popup.  If both continuous and categorical variables are present
and \code{which='both'} is specified, the \code{plot} method returns an
unclassed \code{list} containing two objects, named \code{'Categorical'}
and \code{'Continuous'}, in that order.

Sample weights may be specified to any of the functions, resulting
in weighted means, quantiles, and frequency tables.

Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4)
pp. 557, the term "unique" has been replaced with "distinct" in the
output (but not in parameter names).

When \code{weights} are not used, Gini's mean difference is computed for
numeric variables.  This is a robust measure of dispersion that is the
mean absolute difference between any pairs of observations.  In the
output Gini's difference is labeled \code{Gmd}.

\code{formatdescribeSingle} is a service function for \code{latex},
\code{html}, and \code{print} methods for single variables that is not
intended to be called by the user.
}
\usage{
\method{describe}{vector}(x, descript, exclude.missing=TRUE, digits=4,
         listunique=0, listnchar=12,
         weights=NULL, normwt=FALSE, minlength=NULL, \dots)
\method{describe}{matrix}(x, descript, exclude.missing=TRUE, digits=4, \dots)
\method{describe}{data.frame}(x, descript, exclude.missing=TRUE,
    digits=4, \dots)
\method{describe}{formula}(x, descript, data, subset, na.action,
    digits=4, weights, \dots)
\method{print}{describe}(x, \dots)
\method{latex}{describe}(object, title=NULL,
      file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'),
      append=FALSE, size='small', tabular=TRUE, greek=TRUE,
      spacing=0.7, lspace=c(0,0), \dots)
\method{latex}{describe.single}(object, title=NULL, vname,
      file, append=FALSE, size='small', tabular=TRUE, greek=TRUE,
      lspace=c(0,0), \dots)
\method{html}{describe}(object, size=85, tabular=TRUE,
      greek=TRUE, scroll=FALSE, rows=25, cols=100, \dots)
\method{html}{describe.single}(object, size=85,
      tabular=TRUE, greek=TRUE, \dots)
formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'),
           lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0),
           size=85, \dots)
\method{plot}{describe}(x, which=c('both', 'continuous', 'categorical'),
                          what=NULL,
                          sort=c('ascending', 'descending', 'none'),
                          n.unique=10, digits=5, \dots)
}
\arguments{
\item{x}{
  a data frame, matrix, vector, or formula.  For a data frame, the 
  \code{describe.data.frame}
  function is automatically invoked.  For a matrix, \code{describe.matrix} is
  called.  For a formula, describe.data.frame(model.frame(x))
  is invoked. The formula may or may not have a response variable.  For
  \code{print}, \code{latex}, \code{html}, or
	\code{formatdescribeSingle}, \code{x} is an object created by
  \code{describe}.
}
\item{descript}{
  optional title to print for x. The default is the name of the argument
  or the "label" attributes of individual variables. When the first argument
  is a formula, \code{descript} defaults to a character representation of
  the formula.
}
\item{exclude.missing}{
  set toTRUE to print the names of variables that contain only missing values.
  This list appears at the bottom of the printout, and no space is taken
  up for such variables in the main listing.
}
\item{digits}{
  number of significant digits to print.  For \code{plot.describe} is
	the number of significant digits to put in hover text for
	\code{plotly} when showing raw variable values.} 
\item{listunique}{
  For a character variable that is not an \code{mChoice} variable, that
  has its longest string length greater than \code{listnchar}, and that
  has no more than \code{listunique} distinct values, all values are
  listed in alphabetic order.  Any value having more than one occurrence
  has the frequency of occurrence after it, in parentheses.  Specify
  \code{listunique} equal to some value at least as large as the number
  of observations to ensure that all character variables will have all
  their values listed.  For purposes of tabulating character strings,
  multiple white spaces of any kind are translated to a single space,
  leading and trailing white space are ignored, and case is ignored.
}
\item{listnchar}{see \code{listunique}}
\item{weights}{
  a numeric vector of frequencies or sample weights.  Each observation
  will be treated as if it were sampled \code{weights} times.
}
\item{minlength}{value passed to summary.mChoice.}
\item{normwt}{
  The default, \code{normwt=FALSE} results in the use of \code{weights} as
  weights in computing various statistics.  In this case the sample size
  is assumed to be equal to the sum of \code{weights}.  Specify
  \code{normwt=TRUE} to divide 
  \code{weights} by a constant so that \code{weights} sum to the number of
  observations (length of vectors specified to \code{describe}).  In this
  case the number of observations is taken to be the actual number of
  records given to \code{describe}.
}
\item{object}{a result of \code{describe}}
\item{title}{unused}
\item{data}{
}
\item{subset}{
}
\item{na.action}{
  These are used if a formula is specified.  \code{na.action} defaults to
  \code{na.retain} which does not delete any \code{NA}s from the data frame.
  Use \code{na.action=na.omit} or \code{na.delete} to drop any observation with
  any \code{NA} before processing.
}
\item{\dots}{
  arguments passed to \code{describe.default} which are passed to calls
  to \code{format} for numeric variables.  For example if using R
  \code{POSIXct} or \code{Date} date/time formats, specifying
  \code{describe(d,format='\%d\%b\%y')} will print date/time variables as
  \code{"01Jan2000"}.  This is useful for omitting the time
  component.  See the help file for \code{format.POSIXct} or
  \code{format.Date} for more
  information.  For \code{plot} methods, \dots is ignored.
  For \code{html} and \code{latex} methods, \dots is used to pass
	optional arguments to \code{formatdescribeSingle}, especially the
	\code{condense} argument.
	}
\item{file}{
name of output file (should have a suffix of .tex).  Default name is
formed from the first word of the \code{descript} element of the
\code{describe} object, prefixed by \code{"describe"}.  Set
\code{file=""} to send LaTeX code to standard output instead of a file.
}
\item{append}{
set to \code{TRUE} to have \code{latex} append text to an existing file
named \code{file}
}
\item{size}{
LaTeX text size (\code{"small"}, the default, or \code{"normalsize"},
\code{"tiny"}, \code{"scriptsize"}, etc.) for the \code{describe} output
in LaTeX. For html is the percent of the prevailing font size to use for
the output.
}
\item{tabular}{
  set to \code{FALSE} to use verbatim rather than tabular (or html
	table) environment for the summary statistics output.  By default,
	tabular is used if the output is not too wide.}
\item{greek}{By default, the \code{latex} and \code{html} methods
  will change names of greek letters that appear in variable
  labels to appropriate LaTeX symbols in math mode, or html symbols,  unless
  \code{greek=FALSE}.}
\item{spacing}{By default, the \code{latex} method for \code{describe} run
  on a matrix or data frame uses the \code{setspace} LaTeX package with a
  line spacing of 0.7 so as to no waste space.  Specify \code{spacing=0}
  to suppress the use of the \code{setspace}'s \code{spacing} environment,
  or specify another positive value to use this environment with a
  different spacing.}
\item{lspace}{extra vertical scape, in character size units (i.e., "ex"
  as appended to the space).  When using certain font sizes, there is
  too much space left around LaTeX verbatim environments.  This
  two-vector specifies space to remove (i.e., the values are negated in
  forming the \code{vspace} command) before (first element) and after
  (second element of \code{lspace}) verbatims}
\item{scroll}{set to \code{TRUE} to create an html scrollable box for
	the html output}
\item{rows, cols}{the number of rows or columns to allocate for the
	scrollable box}
\item{vname}{unused argument in \code{latex.describe.single}}
\item{which}{specifies whether to plot numeric continuous or
	binary/categorical variables, or both.  When \code{"both"} a list with 
two elements is created.  Each element is a \code{ggplot2} or
\code{plotly} object.  If 
there are no variables of a given type, a single \code{ggplot2} or
\code{plotly} object is returned, ready to print.} 
\item{what}{character or numeric vector specifying which variables to
	plot; default is to plot all}
\item{sort}{specifies how and whether variables are sorted in order of
	the proportion of positives when \code{which="categorical"}.  Specify
	\code{sort="none"} to leave variables in the order they appear in the
	original data.}
\item{n.unique}{the minimum number of distinct values a numeric variable
	must have before \code{plot.describe} uses it in a continuous variable
	plot}
\item{condense}{specifies whether to condense the output with regard to
  the 5 lowest and highest values (\code{"extremes"}) and the frequency table
}
\item{lang}{specifies the markup language}
\item{verb}{set to 1 if a verbatim environment is already in effect for LaTeX}
}
\value{
a list containing elements \code{descript}, \code{counts},
\code{values}.  The list  is of class \code{describe}.  If the input
object was a matrix or a data 
frame, the list is a list of lists, one list for each variable
analyzed. \code{latex} returns a standard \code{latex} object.  For numeric
variables having at least 20 distinct values, an additional component
\code{intervalFreq}.  This component is a list with two elements, \code{range}
(containing two values) and \code{count}, a vector of 100 integer frequency
counts.
}
\details{
If \code{options(na.detail.response=TRUE)}
has been set and \code{na.action} is \code{"na.delete"} or
\code{"na.keep"}, summary  statistics on
the response variable are printed separately for missing and non-missing
values of each predictor.  The default summary function returns
the number of non-missing response values and the mean of the last
column of the response values, with a \code{names} attribute of
\code{c("N","Mean")}. 
When the response is a \code{Surv} object and the mean is used, this will
result in the crude proportion of events being used to summarize
the response.  The actual summary function can be designated through
\code{options(na.fun.response = "function name")}.

If you are modifying LaTex \code{parskip} or certain other parameters,
you may need to shrink the area around \code{tabular} and
\code{verbatim} environments produced by \code{latex.describe}.  You can
do this using for example
\code{\\usepackage{etoolbox}\\makeatletter\\preto{\\@verbatim}{\\topsep=-1.4pt
	\\partopsep=0pt}\\preto{\\@tabular}{\\parskip=2pt
	\\parsep=0pt}\\makeatother} in the LaTeX preamble.
}
\author{
Frank Harrell
\cr
Vanderbilt University
\cr
\email{f.harrell@vanderbilt.edu}
}
\seealso{
\code{\link{sas.get}}, \code{\link{quantile}}, \code{\link{GiniMd}},
\code{\link{table}}, \code{\link{summary}},
\code{\link{model.frame.default}}, 
\code{\link{naprint}}, \code{\link{lapply}}, \code{\link{tapply}},
\code{\link[survival]{Surv}}, \code{\link{na.delete}},
\code{\link{na.keep}}, 
\code{\link{na.detail.response}}, \code{\link{latex}}
}
\examples{
set.seed(1)
describe(runif(200),dig=2)    #single variable, continuous
                              #get quantiles .05,.10,\dots

dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
describe(dfr)

\dontrun{
options(grType='plotly')
d <- describe(mydata)
p <- plot(d)   # create plots for both types of variables
p[[1]]; p[[2]] # or p$Categorical; p$Continuous
plotly::subplot(p[[1]], p[[2]], nrows=2)  # plot both in one
plot(d, which='categorical')    # categorical ones

d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
describe(d)      #describe entire data frame
attach(d, 1)
describe(relig)  #Has special missing values .D .F .M .R .T
                 #attr(relig,"label") is "Religious preference"

#relig : Religious preference  Format:relig
#    n missing  D  F M R T distinct 
# 4038     263 45 33 7 2 1        8
#
#0:none (251, 6\%), 1:Jewish (372, 9\%), 2:Catholic (1230, 30\%) 
#3:Jehovah's Witnes (25, 1\%), 4:Christ Scientist (7, 0\%) 
#5:Seventh Day Adv (17, 0\%), 6:Protestant (2025, 50\%), 7:other (111, 3\%) 


# Method for describing part of a data frame:
 describe(death.time ~ age*sex + rcs(blood.pressure))
 describe(~ age+sex)
 describe(~ age+sex, weights=freqs)  # weighted analysis

 fit <- lrm(y ~ age*sex + log(height))
 describe(formula(fit))
 describe(y ~ age*sex, na.action=na.delete)   
# report on number deleted for each variable
 options(na.detail.response=TRUE)  
# keep missings separately for each x, report on dist of y by x=NA
 describe(y ~ age*sex)
 options(na.fun.response="quantile")
 describe(y ~ age*sex)   # same but use quantiles of y by x=NA

 d <- describe(my.data.frame)
 d$age                   # print description for just age
 d[c('age','sex')]       # print description for two variables
 d[sort(names(d))]       # print in alphabetic order by var. names
 d2 <- d[20:30]          # keep variables 20-30
 page(d2)                # pop-up window for these variables

# Test date/time formats and suppression of times when they don't vary
 library(chron)
 d <- data.frame(a=chron((1:20)+.1),
                 b=chron((1:20)+(1:20)/100),
                 d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
                               hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
                 f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
                               hour=1:20,min=1:20,sec=1:20),
                 g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
 describe(d)

# Make a function to run describe, latex.describe, and use the kdvi
# previewer in Linux to view the result and easily make a pdf file

 ldesc <- function(data) {
  options(xdvicmd='kdvi')
  d <- describe(data, desc=deparse(substitute(data)))
  dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11)
 }

 ldesc(d)
}
}
\keyword{interface}
\keyword{nonparametric}
\keyword{category}
\keyword{distribution}
\keyword{robust}
\keyword{models}
\keyword{hplot}