1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254
|
.\" -*- nroff -*-
.\" This manual page was written by Jim Van Zandt <jrv@debian.org>
.\" and is hereby placed in the public domain.
.TH MULTIMIX 1 "December 10, 2001"
.SH NAME
multimix, multimix-prep \- automatically discover classes in data
.SH SYNOPSIS
.B multimix
.sp
.B multimix-prep
.SH DESCRIPTION
\fBmultimix\fP fits a mixture of multivariate distributions to a set
of observations using the EM algorithm. The data file may contain both
categorical and continuous variables.
.P
\fBmultimix\fP prompts for the names of the data and parameter files.
.P
The assignment of the observations to groups and the posterior
probabilities are written to \fIGROUPS.OUT\fP. Parameter estimates,
convergence information, and group assignment probabilities are
written to \fIGENERAL.OUT\fP.
.P
If \fBmultimix\fP does not converge after \fIITER\fP=200 iterations,
the estimates of the parameters will be written to
\fIEMPARAMEST.OUT\fP. This file can then be used as the parameter
input file for \fBmultimix\fP if desired.
.P
\fBmultimix\fP is limited to a maximum of
.nf
1500 observations (\fIIOB\fP=1500)
6 groups (\fIIK6\fP=6)
15 attributes and partition cells (\fIIP15\fP=15)
10 levels of categories (\fIIM10\fP=10)
200 iterations to convergence (\fIITER\fP=200)
.fi
Recompilation is required to change these parameters.
.SH "DATA FILE"
The data file has one line for each observation. Each line has one
entry for each variable. Only the first \fINVAR\fP entries on each
line are read.
.SH "PARAMETER FILE"
The parameter file contains free field values which describe the data
and the fitting models. \fBmultimix-prep\fP will ask the user a
series of questions and write a suitable parameter file. If the
starting point for the fit is given by specifying initial group
assignments for the observations, then the user should prepare the
file of group assignments before starting \fBmultimix-prep\fP. The
file format is simple: the \fII\fPth line of the file contains an
integer between 1 and \fING\fP giving the group number of the
\fII\fPth observation. (The experienced user finds it faster to edit
old parameter files into new ones.)
.P
\fBmultimix\fP requires variables in a partition to be stored
contiguously. Hence the data is read in with the variable order being
specified by \fIJP\fP(J). \fIINTYPE\fP(J) and \fINCAT\fP(J) both refer
to the rearranged data.
.P
The first five values are
.TP
.I NG
The number of groups (distributions) in the finite mixture to be fitted.
.TP
.I NOBS
The number of observations.
.TP
.I NVAR
The number of attributes.
.TP
.I NPAR
The number of partition cells (sets of attributes associated within
each distribution).
.TP
.I ISPEC
Flag indicating how the starting point is specified for the fit:
.RS
\fB 1\fP Initial parameter estimates are specified.
.P
\fB 2\fP Observations are assigned to groups.
.RE
Next come eight arrays of data:
.TP
.I JP
.IR JP ( J )
is the column of the data array into which the
\fIJ\fPth attribute of the data file will be stored, where \fIJ\fP
varies from 1 to \fINVAR\fP. For example, suppose we want the third
attribute in the first column, attribute 4 in the second column,
attribute 7 in the 3rd column, and then attributes 1, 2, 5, and 6.
Then JP(J) = 4 5 1 2 6 7 3, for J=1,...,7.
.TP
.I IP
.IR IP ( L )
is the number of attributes in the \fIL\fPth
partition cell, \fIL\fP=1,...,\fINPAR\fP.
.TP
.I IPC
.IR IPC ( L )
is the number of continuous attributes in the
\fIL\fPth partition cell.
.TP
.I ISV
.IR ISV ( L )
gives the index \fIJ\fP of the start of partition
cell \fIL\fP. E.g. if attributes 6, 7, and 8 are in the same
partition cell \fIL\fP, then ISV(L)=6 and IEV(L)=8.
.TP
.I IEV
.IR IEV ( L )
gives the index \fIJ\fP of the end of partition cell \fIL\fP.
.TP
.I IPARTYPE
.IR IPARTYPE ( L )
is an indicator giving the type of model for partition \fIL\fP:
.RS
\fB 1\fP for a categorical model.
.P
\fB 2\fP for a multivariate normal model.
.P
\fB 3\fP for a location model.
.RE
.TP
.I IVARTYPE
.IR IVARTYPE ( J )
is an indicator for the type of attribute
\fIJ\fP:
.RS
\fB 1\fP for a categorical attribute.
.P
\fB 2\fP for a multivariate normal attribute;
.P
\fB 3\fP for a categorical attribute in a location model;
.P
\fB 4\fP for a multivariate normal attribute in a location model.
.RE
.TP
.I NCAT
.IR NCAT ( J )
is the number of categories for the \fIJ\fPth categorical attribute.
For continuous attributes, \fINCAT\fP(\fIJ\fP) should be 0.
.PP
If observations are assigned to groups (\fIISPEC\fP=2), then those
assignments are next:
.TP
.I IGRP
.IR IGRP ( I )
is the index of the group that observation \fII\fP
is in.
.PP
If observations are not assigned to groups (\fIISPEC\fP=1), then
estimates of the parameters are next:
.TP
.I PI
.IR PI ( K )
is the estimated mixing proportion for group \fIK\fP
.RI ( K "=1,...," NG ).
.PP
The parameters for each group depend on the type of attribute:
.TP
.I THETA
.IR THETA ( K , J , M )
is the estimated probability that the \fIJ\fPth categorical attribute
is at level \fIM\fP, given that in group \fIK\fP. Repeat for each
attribute,
.IR J = ISV ( L ), IEV ( L ).
\fBcategorical attributes only\fP
.TP
.I EMU
.IR EMU ( K , L , J )
is the estimated mean vector for group \fIK\fP, partition cell \fIL\fP
and attribute \fIJ\fP.
\fBmultivariate normal model only\fP
.TP
.I THETA
.IR THETA ( K , J , M )
is the estimated probability that the \fIJ\fPth categorical attribute
in the location model is at level \fIM\fP, given that in group
\fIK\fP.
\fBcategorical attributes only\fP
.TP
.I EMUL
.IR EMUL ( K , L , J , M )
is the estimated mean vector for group \fIK\fP, partition cell \fIL\fP
and attribute \fIJ\fP, at the \fIM\fPth level of the categorical
attribute in the location model.
\fBmultivariate normal model only\fP
.TP
.I VARIX
.RI (( VARIX ( K , L , I , J ), J =1, IPC ( L )), " I" =1, IPC ( L ))
An entry in \fIVARIX\fP is the estimated covariance between attributes
\fII\fP and \fIJ\fP for group \fIK\fP, partition cell \fIL\fP, where
.IR I =1,..., IPC ( L "), and " J =1,..., IPC ( L ).
.PP
The required parameters are read in for each partition cell,
.IR L =1,..., \fINPAR\fP .
For example, if the attributes within the partition cell are all
categorical, that is,
.IR ITYPE ( L )=1,
then
.IR THETA ( K , J , M ),
for
.IR M =1,..., NCAT ( J )
is required for the attribute in that partition cell.
.PP
If the attributes within the partition cell are continuous,
multivariate normal attributes, that is
.IR ITYPE ( L )=2,
then estimates of
.IR EMU ( K , L , J )
are required for each attribute.
.PP
If the attributes within the partition cell follow the location model,
that is,
.IR ITYPE ( L )=3,
then
.IR THETA ( K , J , M ), M =1,..., NCAT ( J )
is required for the categorical attribute, and
.IR EMUL ( K , L , J , M), M =1,..., IM ( L )
is required for each continuous multivariate normal attribute. (Note
that
.IR IM ( L )
is the number of categories of the categorical attribute associated
with the location model.)
.PP
The estimates are read in first for group 1, then for group 2, etc.
.SH EXAMPLES
See \fI/usr/share/doc/multimix/examples\fP.
.SH FILES
\fIGROUPS.OUT\fP \fBmultimix\fP output: the assignment of the
observations to groups and the posterior probabilities. If
observations were initially assigned to groups (\fIISPEC\fP=2), these
assignments may be different. Some are likely to be different if the
fitting distributions overlap.
.P
\fIGENERAL.OUT\fP \fBmultimix\fP output: parameter estimates,
convergence information, and group assignment probabilities.
.P
\fIEMPARAMEST.OUT\fP \fBmultimix\fP output on failure to converge:
current parameter estimates. This file can then be used as the
parameter input file for \fBmultimix\fP if desired.
.SH AUTHORS
Lynette A. Hunt <lah@waikato.ac.nz> and Murray Jorgensen
<maj@waikato.ac.nz>.
.\" This manual page was written by James R. Van Zandt
.\" <jrv@debian.org>, for the Debian GNU/Linux system (but may be
.\" used by others).
.SH "SEE ALSO"
.nf
.I /usr/share/doc/multimix/paper.ps.gz
.I /usr/share/doc/multimix/talk.ps.gz
.I /usr/share/doc/multimix/notes.ps.gz
.I /usr/share/doc/multimix/PPAPER.ps.gz
.I /usr/share/doc/multimix/alltables.ps.gz
.BR autoclass (1).
.fi
|