1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
|
Explanation for binary files inside source package according to
http://lists.debian.org/debian-devel/2013/09/msg00332.html
This package contains sample data files for experimenting with
the implemented algorithms.
Here comes a description of the single data files:
Files: data/*
Documentation: man/GenABEL.data-package.Rd
GenABEL.data contains six files with data which is used by examples of GenABEL.
These are ge03d2.clean.RData, ge03d2c.RData, ge03d2ex.clean.RData, ge03d2ex.RData, ge03d2.RData and srdta.RData.
Files: data/ge03d2.Rdata
Documentation: man/ge03d2.Rd
A small data set (approximately 1,000 people and 8,000 SNPs) containing
data on 3 autosomes and X chromsome. Is a good set for
demonatration of the QC procedures (different genotyping errors
are introduced) and GWA analysis.
This data set was developed for the "Advances in population-
Files: data/BloodBrain.RData
Documentation: man/BloodBrain.Rd
Mente and Lombardo (2005) develop models to predict the log of the
ratio of the concentration of a compound in the brain and the
concentration in blood. For each compound, they computed three
sets of molecular descriptors: MOE 2D, rule-of-five and Charge
Polar Surface Area (CPSA). In all, 134 descriptors were
calculated. Included in this package are 208 non-proprietary
literature compounds. The vector ‘logBBB’ contains the
concentration ratio and the data fame ‘bbbDescr’ contains the
descriptor values.
Files: data/cars.RData
Documentation: man/cars.Rd
Kuiper (2008) collected data on Kelly Blue Book resale data for
804 GM cars (2005 model year).
Files: data/cox2.RData
Documentation: man/cox2.Rd
From Sutherland, O'Brien, and Weaver (2003): "A set of 467
cyclooxygenase-2 (COX-2) inhibitors has been assembled from the
published work of a single research group, with in vitro
activities against human recombinant enzyme expressed as IC50
values ranging from 1 nM to >100 uM (53 compounds have
indeterminate IC50 values)."
The data are in the Supplemental Data file for the article.
A set of 255 descriptors (MOE2D and QikProp) were generated. To
classify the data, we used a cutoff of $2^2.5$ to determine
activity
Files: data/dhfr.RData
Documentation: man/dhfr.Rd
Sutherland and Weaver (2004) discuss QSAR models for dihydrofolate
reductase (DHFR) inhibition. This data set contains values for 325
compounds. For each compound, 228 molecular descriptors have been
calculated. Additionally, each samples is designated as "active"
or "inactive".
The data frame ‘dhfr’ contains a column called ‘Y’ with the
outcome classification. The remainder of the columns are molecular
descriptor values.
Files: data/GermanCredit.RData
Documentation: man/GermanCredit.Rd
Data from Dr. Hans Hofmann of the University of Hamburg.
These data have two classes for the credit worthiness: good or
bad. There are predictors related to attributes, such as: checking
account status, duration, credit history, purpose of the loan,
amount of the loan, savings accounts or bonds, employment
duration, Installment rate in percentage of disposable income,
personal information, other debtors/guarantors, residence
duration, property, age, other installment plans, housing, number
of existing credits, job information, Number of people being
liable to provide maintenance for, telephone, and foreign worker
status.
Many of these predictors are discrete and have been expanded into
several 0/1 indicator variables
Files: data/mdrr.RData
Documentation: man/mdrr.Rd
Svetnik et al. (2003) describe these data: "Bakken and Jurs
studied a set of compounds originally discussed by Klopman et al.,
who were interested in multidrug resistance reversal (MDRR)
agents. The original response variable is a ratio measuring the
ability of a compound to reverse a leukemia cell's resistance to
adriamycin. However, the problem was treated as a classification
problem, and compounds with the ratio >4.2 were considered active,
and those with the ratio <= 2.0 were considered inactive.
Compounds with the ratio between these two cutoffs were called
moderate and removed from the data for twoclass classification,
leaving a set of 528 compounds (298 actives and 230 inactives).
(Various other arrangements of these data were examined by Bakken
and Jurs, but we will focus on this particular one.) We did not
have access to the original descriptors, but we generated a set of
342 descriptors of three different types that should be similar to
the original descriptors, using the DRAGON software."
The data and R code are in the Supplemental Data file for the
article.
Files: data/oil.RData
Documentation: man/oil.Rd
Fatty acid concentrations of commercial oils were measured using
gas chromatography. The data is used to predict the type of oil.
Note that only the known oils are in the data set. Also, the
authors state that there are 95 samples of known oils. However, we
count 96 in Table 1 (pgs. 33-35).
Files: data/pottery.RData
Documentation: man/pottery.Rd
Measurements of 58 pottery samples.
Source:
R. G. Brereton (2003). _Chemometrics: Data Analysis for the
Laboratory and Chemical Plant_, pg. 261.
Files: data/segmentationData.RData
Documentation: man/segmentationData.Rd
Hill, LaPan, Li and Haney (2007) develop models to predict which
cells in a high content screen were well segmented. The data
consists of 119 imaging measurements on 2019. The original
analysis used 1009 for training and 1010 as a test set (see the
column called ‘Case’).
The outcome class is contained in a factor variable called ‘Class’
with levels "PS" for poorly segmented and "WS" for well segmented.
The raw data used in the paper can be found at the Biomedcentral
website. Versions of caret < 4.98 contained the original data. The
version now contained in ‘segmentationData’ is modified. First,
several discrete versions of some of the predictors (with the
suffix "Status") were removed. Second, there are several skewed
predictors with minimum values of zero (that would benefit from
some transformation, such as the log). A constant value of 1 was
added to these fields: ‘AvgIntenCh2’, ‘FiberAlign2Ch3’,
‘FiberAlign2Ch4’, ‘SpotFiberCountCh4’ and ‘TotalIntenCh2’.
A binary version of the original data is at <URL:
http://topepo.github.io/caret/segmentationOriginal.RData>.
Files: data/tecator.RData
Documentation: man/tecator.Rd
"These data are recorded on a Tecator Infratec Food and Feed
Analyzer working in the wavelength range 850 - 1050 nm by the Near
Infrared Transmission (NIT) principle. Each sample contains finely
chopped pure meat with different moisture, fat and protein
contents.
If results from these data are used in a publication we want you
to mention the instrument and company name (Tecator) in the
publication. In addition, please send a preprint of your article
to
Karin Thente, Tecator AB, Box 70, S-263 21 Hoganas, Sweden
The data are available in the public domain with no responsibility
from the original data source. The data can be redistributed as
long as this permission note is attached."
"For each meat sample the data consists of a 100 channel spectrum
of absorbances and the contents of moisture (water), fat and
protein. The absorbance is -log10 of the transmittance measured
by the spectrometer. The three contents, measured in percent, are
determined by analytic chemistry."
Included here are the traning, monitoring and test sets.
-- Balint Reczey <balint@balintreczey.hu>, Sat, 12 Dec 2015 11:18:34 +0100
|