1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373
|
#' Blood Brain Barrier Data
#'
#' Mente and Lombardo (2005) develop models to predict the log
#' of the ratio of the concentration of a compound in the brain
#' and the concentration in blood. For each compound, they computed
#' three sets of molecular descriptors: MOE 2D, rule-of-five and
#' Charge Polar Surface Area (CPSA). In all, 134 descriptors were
#' calculated. Included in this package are 208 non-proprietary
#' literature compounds. The vector \code{logBBB} contains the concentration
#' ratio and the data fame \code{bbbDescr} contains the descriptor values.
#'
#'
#' @name BloodBrain
#' @aliases BloodBrain bbbDescr logBBB
#' @docType data
#' @return \item{bbbDescr}{data frame of chemical descriptors} \item{logBBB}{vector of assay results}
#'
#' @source Mente, S.R. and Lombardo, F. (2005). A recursive-partitioning model
#' for blood-brain barrier permeation, \emph{Journal of Computer-Aided Molecular Design},
#' Vol. 19, pg. 465-481.
#'
#' @keywords datasets
NULL
#' COX-2 Activity Data
#'
#' From Sutherland, O'Brien, and Weaver (2003): "A set of 467 cyclooxygenase-2
#' (COX-2) inhibitors has been assembled from the published work of a single
#' research group, with in vitro activities against human recombinant enzyme
#' expressed as IC50 values ranging from 1 nM to >100 uM (53 compounds have
#' indeterminate IC50 values)."
#'
#' The data are in the Supplemental Data file for the article.
#'
#' A set of 255 descriptors (MOE2D and QikProp) were generated. To classify the
#' data, we used a cutoff of $2^2.5$ to determine activity
#'
#'
#' @name cox2
#' @aliases cox2 cox2Class cox2Descr cox2IC50
#' @docType data
#' @return \item{cox2Descr}{the descriptors} \item{cox2IC50}{the IC50 data used
#' to determine activity}
#'
#' \item{cox2Class}{the categorical outcome ("Active" or "Inactive") based on
#' the $2^2.5$ cutoff}
#' @source Sutherland, J. J., O'Brien, L. A. and Weaver, D. F. (2003).
#' Spline-Fitting with a Genetic Algorithm: A Method for Developing
#' Classification Structure-Activity Relationships, \emph{Journal of Chemical
#' Information and Computer Sciences}, Vol. 43, pg. 1906-1915.
#' @keywords datasets
NULL
#' Dihydrofolate Reductase Inhibitors Data
#'
#' Sutherland and Weaver (2004) discuss QSAR models for dihydrofolate reductase
#' (DHFR) inhibition. This data set contains values for 325 compounds. For each
#' compound, 228 molecular descriptors have been calculated. Additionally, each
#' samples is designated as "active" or "inactive".
#'
#' The data frame \code{dhfr} contains a column called \code{Y} with the
#' outcome classification. The remainder of the columns are molecular
#' descriptor values.
#'
#'
#' @name dhfr
#' @docType data
#' @return \item{dhfr}{data frame of chemical descriptors and the activity
#' values}
#' @source Sutherland, J.J. and Weaver, D.F. (2004). Three-dimensional
#' quantitative structure-activity and structure-selectivity relationships of
#' dihydrofolate reductase inhibitors, \emph{Journal of Computer-Aided
#' Molecular Design}, Vol. 18, pg. 309-331.
#' @keywords datasets
NULL
#' German Credit Data
#'
#' Data from Dr. Hans Hofmann of the University of Hamburg.
#'
#' These data have two classes for the credit worthiness: good or bad. There
#' are predictors related to attributes, such as: checking account status,
#' duration, credit history, purpose of the loan, amount of the loan, savings
#' accounts or bonds, employment duration, Installment rate in percentage of
#' disposable income, personal information, other debtors/guarantors, residence
#' duration, property, age, other installment plans, housing, number of
#' existing credits, job information, Number of people being liable to provide
#' maintenance for, telephone, and foreign worker status.
#'
#' Many of these predictors are discrete and have been expanded into several
#' 0/1 indicator variables
#'
#'
#' @name GermanCredit
#' @docType data
#' @source UCI Machine Learning Repository
#' @keywords datasets
NULL
#' Multidrug Resistance Reversal (MDRR) Agent Data
#'
#' Svetnik et al. (2003) describe these data: "Bakken and Jurs studied a set of
#' compounds originally discussed by Klopman et al., who were interested in
#' multidrug resistance reversal (MDRR) agents. The original response variable
#' is a ratio measuring the ability of a compound to reverse a leukemia cell's
#' resistance to adriamycin. However, the problem was treated as a
#' classification problem, and compounds with the ratio >4.2 were considered
#' active, and those with the ratio <= 2.0 were considered inactive. Compounds
#' with the ratio between these two cutoffs were called moderate and removed
#' from the data for twoclass classification, leaving a set of 528 compounds
#' (298 actives and 230 inactives). (Various other arrangements of these data
#' were examined by Bakken and Jurs, but we will focus on this particular one.)
#' We did not have access to the original descriptors, but we generated a set
#' of 342 descriptors of three different types that should be similar to the
#' original descriptors, using the DRAGON software."
#'
#' The data and R code are in the Supplemental Data file for the article.
#'
#'
#' @name mdrr
#' @aliases mdrr mdrrClass mdrrDescr
#' @docType data
#' @return \item{mdrrDescr}{the descriptors} \item{mdrrClass}{the categorical
#' outcome ("Active" or "Inactive")}
#' @source Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P.
#' Feuston, B. P (2003). Random Forest: A Classification and Regression Tool
#' for Compound Classification and QSAR Modeling, \emph{Journal of Chemical
#' Information and Computer Sciences}, Vol. 43, pg. 1947-1958.
#' @keywords datasets
NULL
#' Fatty acid composition of commercial oils
#'
#' Fatty acid concentrations of commercial oils were measured using gas
#' chromatography. The data is used to predict the type of oil. Note that
#' only the known oils are in the data set. Also, the authors state that there
#' are 95 samples of known oils. However, we count 96 in Table 1 (pgs. 33-35).
#'
#'
#' @name oil
#' @aliases oil oilType fattyAcids
#' @docType data
#' @return \item{fattyAcids}{data frame of fatty acid compositions: Palmitic,
#' Stearic, Oleic, Linoleic, Linolenic, Eicosanoic and Eicosenoic. When values
#' fell below the lower limit of the assay (denoted as <X in the paper), the
#' limit was used. } \item{oilType}{factor of oil types: pumpkin (A), sunflower
#' (B), peanut (C), olive (D), soybean (E), rapeseed (F) and corn (G).}
#' @source Brodnjak-Voncina et al. (2005). Multivariate data analysis in
#' classification of vegetable oils characterized by the content of fatty
#' acids, \emph{Chemometrics and Intelligent Laboratory Systems}, Vol.
#' 75:31-45.
#' @keywords datasets
NULL
#' Pottery from Pre-Classical Sites in Italy
#'
#' Measurements of 58 pottery samples.
#'
#'
#' @name pottery
#' @aliases pottery potteryClass
#' @docType data
#' @return \item{pottery}{11 elemental composition measurements }
#' \item{potteryClass}{factor of pottery type: black carbon containing bulks
#' (A) and clayey (B)}
#' @source R. G. Brereton (2003). \emph{Chemometrics: Data Analysis for the
#' Laboratory and Chemical Plant}, pg. 261.
#' @keywords datasets
NULL
#' Sacramento CA Home Prices
#'
#' This data frame contains house and sale price data for 932 homes in
#' Sacramento CA. The original data were obtained from the website for the
#' SpatialKey software. From their website: "The Sacramento real estate
#' transactions file is a list of 985 real estate transactions in the
#' Sacramento area reported over a five-day period, as reported by the
#' Sacramento Bee." Google was used to fill in missing/incorrect data.
#'
#'
#' @name Sacramento
#' @docType data
#' @return \item{Sacramento}{a data frame with columns '\code{city}',
#' '\code{zip}', '\code{beds}', '\code{baths}', '\code{sqft}', '\code{type}',
#' '\code{price}', '\code{latitude}', and '\code{longitude}'}
#' @source SpatialKey website:
#' \url{https://support.spatialkey.com/spatialkey-sample-csv-data/}
#' @keywords datasets
#' @examples
#'
#' data(Sacramento)
#'
#' set.seed(955)
#' in_train <- createDataPartition(log10(Sacramento$price), p = .8, list = FALSE)
#'
#' training <- Sacramento[ in_train,]
#' testing <- Sacramento[-in_train,]
#'
#'
NULL
#' Morphometric Data on Scat
#'
#' Reid (2015) collected data on animal feses in coastal California. The data
#' consist of DNA verified species designations as well as fields related to
#' the time and place of the collection and the scat itself. The data frame
#' \code{scat_orig} contains while \code{scat} contains data on the three main
#' species.
#'
#'
#' @name scat
#' @aliases scat scat_orig
#' @docType data
#' @return \item{scat_orig}{the entire data set in the Supplemental Materials}
#' \item{scat}{data on the three main species}
#' @source Reid, R. E. B. (2015). A morphometric modeling approach to
#' distinguishing among bobcat, coyote and gray fox scats. \emph{Wildlife
#' Biology}, 21(5), 254-262
#' @keywords datasets
NULL
#' Kelly Blue Book resale data for 2005 model year GM cars
#'
#' Kuiper (2008) collected data on Kelly Blue Book resale data for 804 GM cars (2005 model year).
#'
#' @name cars
#' @docType data
#' @return \item{cars}{data frame of the suggested retail price (column \code{Price}) and various
#' characteristics of each car (columns \code{Mileage}, \code{Cylinder}, \code{Doors}, \code{Cruise},
#' \code{Sound}, \code{Leather}, \code{Buick}, \code{Cadillac}, \code{Chevy}, \code{Pontiac}, \code{Saab},
#' \code{Saturn}, \code{convertible}, \code{coupe}, \code{hatchback}, \code{sedan} and \code{wagon})}
#' @source Kuiper, S. (2008). Introduction to Multiple Regression: How Much Is Your Car Worth?,
#' \emph{Journal of Statistics Education}, Vol. 16
#' \url{http://jse.amstat.org/jse_archive.htm#2008}.
#' @keywords datasets
NULL
#' Cell Body Segmentation
#'
#' Hill, LaPan, Li and Haney (2007) develop models to predict which cells in a
#' high content screen were well segmented. The data consists of 119 imaging
#' measurements on 2019. The original analysis used 1009 for training and 1010
#' as a test set (see the column called \code{Case}).
#'
#' The outcome class is contained in a factor variable called \code{Class} with
#' levels "PS" for poorly segmented and "WS" for well segmented.
#'
#' The raw data used in the paper can be found at the Biomedcentral website.
#' Versions of caret < 4.98 contained the original data. The version now
#' contained in \code{segmentationData} is modified. First, several discrete
#' versions of some of the predictors (with the suffix "Status") were removed.
#' Second, there are several skewed predictors with minimum values of zero
#' (that would benefit from some transformation, such as the log). A constant
#' value of 1 was added to these fields: \code{AvgIntenCh2},
#' \code{FiberAlign2Ch3}, \code{FiberAlign2Ch4}, \code{SpotFiberCountCh4} and
#' \code{TotalIntenCh2}.
#'
#' A binary version of the original data is at
#' \url{http://topepo.github.io/caret/segmentationOriginal.RData}.
#'
#'
#' @name segmentationData
#' @docType data
#' @return \item{segmentationData}{data frame of cells}
#' @source Hill, LaPan, Li and Haney (2007). Impact of image segmentation on
#' high-content screening data quality for SK-BR-3 cells, \emph{BMC
#' Bioinformatics}, Vol. 8, pg. 340,
#' \url{https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-340}.
#' @keywords datasets
NULL
#' Fat, Water and Protein Content of Meat Samples
#'
#' "These data are recorded on a Tecator Infratec Food and Feed Analyzer
#' working in the wavelength range 850 - 1050 nm by the Near Infrared
#' Transmission (NIT) principle. Each sample contains finely chopped pure meat
#' with different moisture, fat and protein contents.
#'
#' If results from these data are used in a publication we want you to mention
#' the instrument and company name (Tecator) in the publication. In addition,
#' please send a preprint of your article to
#'
#' Karin Thente, Tecator AB, Box 70, S-263 21 Hoganas, Sweden
#'
#' The data are available in the public domain with no responsibility from the
#' original data source. The data can be redistributed as long as this
#' permission note is attached."
#'
#' "For each meat sample the data consists of a 100 channel spectrum of
#' absorbances and the contents of moisture (water), fat and protein. The
#' absorbance is -log10 of the transmittance measured by the spectrometer. The
#' three contents, measured in percent, are determined by analytic chemistry."
#'
#' Included here are the traning, monitoring and test sets.
#'
#'
#' @name tecator
#' @aliases tecator absorp endpoints
#' @docType data
#' @return \item{absorp}{absorbance data for 215 samples. The first 129 were
#' originally used as a training set} \item{endpoints}{the percentages of
#' water, fat and protein}
#' @keywords datasets
#' @examples
#'
#' data(tecator)
#'
#' splom(~endpoints)
#'
#' # plot 10 random spectra
#' set.seed(1)
#' inSubset <- sample(1:dim(endpoints)[1], 10)
#'
#' absorpSubset <- absorp[inSubset,]
#' endpointSubset <- endpoints[inSubset, 3]
#'
#' newOrder <- order(absorpSubset[,1])
#' absorpSubset <- absorpSubset[newOrder,]
#' endpointSubset <- endpointSubset[newOrder]
#'
#' plotColors <- rainbow(10)
#'
#' plot(absorpSubset[1,],
#' type = "n",
#' ylim = range(absorpSubset),
#' xlim = c(0, 105),
#' xlab = "Wavelength Index",
#' ylab = "Absorption")
#'
#' for(i in 1:10)
#' {
#' points(absorpSubset[i,], type = "l", col = plotColors[i], lwd = 2)
#' text(105, absorpSubset[i,100], endpointSubset[i], col = plotColors[i])
#' }
#' title("Predictor Profiles for 10 Random Samples")
#'
NULL
|