File: spam.Rd

package info (click to toggle)
r-cran-kernlab 0.9-33-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 2,216 kB
  • sloc: cpp: 6,011; ansic: 935; makefile: 2
file content (45 lines) | stat: -rw-r--r-- 1,804 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
\name{spam}
\alias{spam}
\title{Spam E-mail Database}
\description{A data set collected at Hewlett-Packard Labs, that classifies 4601
e-mails as spam or non-spam. In addition to this class label there are 57
variables indicating the frequency of certain words and characters in the
e-mail.}
\usage{data(spam)}
\format{A data frame with 4601 observations and 58 variables.

The first 48 variables contain the frequency of the variable name
(e.g., business) in the e-mail. If the variable name starts with num (e.g.,
num650) the it indicates the frequency of the corresponding number (e.g., 650).
The variables 49-54 indicate the frequency of the characters `;', `(', `[', `!',
`$', and `#'. The variables 55-57 contain the average, longest 
and total run-length of capital letters. Variable 58 indicates the type of the
mail and is either \code{"nonspam"} or \code{"spam"}, i.e. unsolicited
commercial e-mail.}

\details{
The data set contains 2788 e-mails classified as \code{"nonspam"} and 1813
classified as \code{"spam"}.

The ``spam'' concept is diverse: advertisements for products/web
sites, make money fast schemes, chain letters, pornography...
This collection of spam e-mails came from the collectors' postmaster and
individuals who had filed spam.  The collection of non-spam
e-mails came from filed work and personal e-mails, and hence
the word 'george' and the area code '650' are indicators of
non-spam.  These are useful when constructing a personalized
spam filter.  One would either have to blind such non-spam
indicators or get a very wide collection of non-spam to
generate a general purpose spam filter.
}
\source{
  \doi{10.24432/C53G6X}
}

\references{
  T. Hastie, R. Tibshirani, J.H. Friedman.
  \emph{The Elements of Statistical Learning}.
  Springer, 2001.
}

\keyword{datasets}