File: README.source

package info (click to toggle)
r-cran-caret 6.0-81-2
links: PTS, VCS
area: main
in suites: buster
size: 7,268 kB
sloc: ansic: 208; sh: 10; makefile: 2
file content (173 lines) | stat: -rw-r--r-- 8,065 bytes
parent folder | download | duplicates (5)
Explanation for binary files inside source package according to
  http://lists.debian.org/debian-devel/2013/09/msg00332.html

This package contains sample data files for experimenting with
the implemented algorithms.

Here comes a description of the single data files:

Files: data/*
Documentation: man/GenABEL.data-package.Rd
  GenABEL.data contains six files with data which is used by examples of GenABEL.
  These are ge03d2.clean.RData, ge03d2c.RData, ge03d2ex.clean.RData, ge03d2ex.RData, ge03d2.RData and srdta.RData.

Files: data/ge03d2.Rdata
Documentation: man/ge03d2.Rd
        A small data set (approximately 1,000 people and 8,000 SNPs) containing
        data on 3 autosomes and X chromsome. Is a good set for
        demonatration of the QC procedures (different genotyping errors
        are introduced) and GWA analysis.
        This data set was developed for the "Advances in population-

Files: data/BloodBrain.RData
Documentation: man/BloodBrain.Rd
     Mente and Lombardo (2005) develop models to predict the log of the
     ratio of the concentration of a compound in the brain and the
     concentration in blood. For each compound, they computed three
     sets of molecular descriptors: MOE 2D, rule-of-five and Charge
     Polar Surface Area (CPSA). In all, 134 descriptors were
     calculated. Included in this package are 208 non-proprietary
     literature compounds. The vector ‘logBBB’ contains the
     concentration ratio and the data fame ‘bbbDescr’ contains the
     descriptor values.

Files: data/cars.RData
Documentation: man/cars.Rd
     Kuiper (2008) collected data on Kelly Blue Book resale data for
     804 GM cars (2005 model year).

Files: data/cox2.RData
Documentation: man/cox2.Rd
     From Sutherland, O'Brien, and Weaver (2003): "A set of 467
     cyclooxygenase-2 (COX-2) inhibitors has been assembled from the
     published work of a single research group, with in vitro
     activities against human recombinant enzyme expressed as IC50
     values ranging from 1 nM to >100 uM (53 compounds have
     indeterminate IC50 values)."

     The data are in the Supplemental Data file for the article.

     A set of 255 descriptors (MOE2D and QikProp) were generated. To
     classify the data, we used a cutoff of $2^2.5$ to determine
     activity

Files: data/dhfr.RData
Documentation: man/dhfr.Rd
     Sutherland and Weaver (2004) discuss QSAR models for dihydrofolate
     reductase (DHFR) inhibition. This data set contains values for 325
     compounds. For each compound, 228 molecular descriptors have been
     calculated. Additionally, each samples is designated as "active"
     or "inactive".

     The data frame ‘dhfr’ contains a column called ‘Y’ with the
     outcome classification. The remainder of the columns are molecular
     descriptor values.

Files: data/GermanCredit.RData
Documentation: man/GermanCredit.Rd
     Data from Dr. Hans Hofmann of the University of Hamburg.

     These data have two classes for the credit worthiness: good or
     bad. There are predictors related to attributes, such as: checking
     account status, duration, credit history, purpose of the loan,
     amount of the loan, savings accounts or bonds, employment
     duration, Installment rate in percentage of disposable income,
     personal information, other debtors/guarantors, residence
     duration, property, age, other installment plans, housing, number
     of existing credits, job information, Number of people being
     liable to provide maintenance for, telephone, and foreign worker
     status.

     Many of these predictors are discrete and have been expanded into
     several 0/1 indicator variables

Files: data/mdrr.RData
Documentation: man/mdrr.Rd
     Svetnik et al. (2003) describe these data: "Bakken and Jurs
     studied a set of compounds originally discussed by Klopman et al.,
     who were interested in multidrug resistance reversal (MDRR)
     agents. The original response variable is a ratio measuring the
     ability of a compound to reverse a leukemia cell's resistance to
     adriamycin. However, the problem was treated as a classification
     problem, and compounds with the ratio >4.2 were considered active,
     and those with the ratio <= 2.0 were considered inactive.
     Compounds with the ratio between these two cutoffs were called
     moderate and removed from the data for twoclass classification,
     leaving a set of 528 compounds (298 actives and 230 inactives).
     (Various other arrangements of these data were examined by Bakken
     and Jurs, but we will focus on this particular one.) We did not
     have access to the original descriptors, but we generated a set of
     342 descriptors of three different types that should be similar to
     the original descriptors, using the DRAGON software."

     The data and R code are in the Supplemental Data file for the
     article.


Files: data/oil.RData
Documentation: man/oil.Rd
     Fatty acid concentrations of commercial oils were measured using
     gas chromatography.  The data is used to predict the type of oil.
     Note that only the known oils are in the data set. Also, the
     authors state that there are 95 samples of known oils. However, we
     count 96 in Table 1 (pgs.  33-35).

Files: data/pottery.RData
Documentation: man/pottery.Rd
     Measurements of 58 pottery samples.
Source:
     R. G. Brereton (2003). _Chemometrics: Data Analysis for the
     Laboratory and Chemical Plant_, pg. 261.

Files: data/segmentationData.RData
Documentation: man/segmentationData.Rd
     Hill, LaPan, Li and Haney (2007) develop models to predict which
     cells in a high content screen were well segmented.  The data
     consists of 119 imaging measurements on 2019. The original
     analysis used 1009 for training and 1010 as a test set (see the
     column called ‘Case’).

     The outcome class is contained in a factor variable called ‘Class’
     with levels "PS" for poorly segmented and "WS" for well segmented.

     The raw data used in the paper can be found at the Biomedcentral
     website. Versions of caret < 4.98 contained the original data. The
     version now contained in ‘segmentationData’ is modified. First,
     several discrete versions of some of the predictors (with the
     suffix "Status") were removed. Second, there are several skewed
     predictors with minimum values of zero (that would benefit from
     some transformation, such as the log). A constant value of 1 was
     added to these fields: ‘AvgIntenCh2’, ‘FiberAlign2Ch3’,
     ‘FiberAlign2Ch4’, ‘SpotFiberCountCh4’ and ‘TotalIntenCh2’.

     A binary version of the original data is at <URL:
     http://topepo.github.io/caret/segmentationOriginal.RData>.

Files: data/tecator.RData
Documentation: man/tecator.Rd
     "These data are recorded on a Tecator Infratec Food and Feed
     Analyzer working in the wavelength range 850 - 1050 nm by the Near
     Infrared Transmission (NIT) principle. Each sample contains finely
     chopped pure meat with different moisture, fat and protein
     contents.

     If results from these data are used in a publication we want you
     to mention the instrument and company name (Tecator) in the
     publication.  In addition, please send a preprint of your article
     to

     Karin Thente, Tecator AB, Box 70, S-263 21 Hoganas, Sweden

     The data are available in the public domain with no responsibility
     from the original data source. The data can be redistributed as
     long as this permission note is attached."

     "For each meat sample the data consists of a 100 channel spectrum
     of absorbances and the contents of moisture (water), fat and
     protein.  The absorbance is -log10 of the transmittance measured
     by the spectrometer. The three contents, measured in percent, are
     determined by analytic chemistry."

     Included here are the traning, monitoring and test sets.

 -- Balint Reczey <balint@balintreczey.hu>, Sat, 12 Dec 2015 11:18:34 +0100