man multimix-prep (1): automatically discover classes in data

Other Alias

multimix

SYNOPSIS

multimix

multimix-prep

DESCRIPTION

multimix fits a mixture of multivariate distributions to a set of observations using the EM algorithm. The data file may contain both categorical and continuous variables.

multimix prompts for the names of the data and parameter files.

The assignment of the observations to groups and the posterior probabilities are written to GROUPS.OUT. Parameter estimates, convergence information, and group assignment probabilities are written to GENERAL.OUT.

If multimix does not converge after ITER=200 iterations, the estimates of the parameters will be written to EMPARAMEST.OUT. This file can then be used as the parameter input file for multimix if desired.

multimix is limited to a maximum of

       1500 observations (IOB=1500)
       6 groups (IK6=6)
       15 attributes and partition cells (IP15=15)
       10 levels of categories (IM10=10)
       200 iterations to convergence (ITER=200)

Recompilation is required to change these parameters.

DATA FILE

The data file has one line for each observation. Each line has one entry for each variable. Only the first NVAR entries on each line are read.

PARAMETER FILE

The parameter file contains free field values which describe the data and the fitting models. multimix-prep will ask the user a series of questions and write a suitable parameter file. If the starting point for the fit is given by specifying initial group assignments for the observations, then the user should prepare the file of group assignments before starting multimix-prep. The file format is simple: the Ith line of the file contains an integer between 1 and NG giving the group number of the Ith observation. (The experienced user finds it faster to edit old parameter files into new ones.)

multimix requires variables in a partition to be stored contiguously. Hence the data is read in with the variable order being specified by JP(J). INTYPE(J) and NCAT(J) both refer to the rearranged data.

The first five values are

NG

The number of groups (distributions) in the finite mixture to be fitted.

NOBS

The number of observations.

NVAR

The number of attributes.

NPAR

The number of partition cells (sets of attributes associated within each distribution).

ISPEC

Flag indicating how the starting point is specified for the fit:

: 1 Initial parameter estimates are specified.
2 Observations are assigned to groups.

Next come eight arrays of data:

JP

JP(J) is the column of the data array into which the Jth attribute of the data file will be stored, where J varies from 1 to NVAR. For example, suppose we want the third attribute in the first column, attribute 4 in the second column, attribute 7 in the 3rd column, and then attributes 1, 2, 5, and 6. Then JP(J) = 4 5 1 2 6 7 3, for J=1,...,7.

IP

IP(L) is the number of attributes in the Lth partition cell, L=1,...,NPAR.

IPC

IPC(L) is the number of continuous attributes in the Lth partition cell.

ISV

ISV(L) gives the index J of the start of partition cell L. E.g. if attributes 6, 7, and 8 are in the same partition cell L, then ISV(L)=6 and IEV(L)=8.

IEV

IEV(L) gives the index J of the end of partition cell L.

IPARTYPE

IPARTYPE(L) is an indicator giving the type of model for partition L:

1 for a categorical model.

2 for a multivariate normal model.

3 for a location model.

IVARTYPE

IVARTYPE(J) is an indicator for the type of attribute J:

1 for a categorical attribute.

2 for a multivariate normal attribute;

3 for a categorical attribute in a location model;

4 for a multivariate normal attribute in a location model.

NCAT

NCAT(J) is the number of categories for the Jth categorical attribute. For continuous attributes, NCAT(J) should be 0.

If observations are assigned to groups (ISPEC=2), then those assignments are next:

IGRP: IGRP(I) is the index of the group that observation I is in.

If observations are not assigned to groups (ISPEC=1), then estimates of the parameters are next:

PI: PI(K) is the estimated mixing proportion for group K (K=1,...,NG).

The parameters for each group depend on the type of attribute:

THETA: THETA(K,J,M) is the estimated probability that the Jth categorical attribute is at level M, given that in group K. Repeat for each attribute, J=ISV(L),IEV(L). categorical attributes only
EMU: EMU(K,L,J) is the estimated mean vector for group K, partition cell L and attribute J. multivariate normal model only
THETA: THETA(K,J,M) is the estimated probability that the Jth categorical attribute in the location model is at level M, given that in group K. categorical attributes only
EMUL: EMUL(K,L,J,M) is the estimated mean vector for group K, partition cell L and attribute J, at the Mth level of the categorical attribute in the location model. multivariate normal model only
VARIX: ((VARIX(K,L,I,J),J=1,IPC(L)), I=1,IPC(L)) An entry in VARIX is the estimated covariance between attributes I and J for group K, partition cell L, where I=1,...,IPC(L), and J=1,...,IPC(L).

The required parameters are read in for each partition cell, L=1,...,NPAR. For example, if the attributes within the partition cell are all categorical, that is, ITYPE(L)=1, then THETA(K,J,M), for M=1,...,NCAT(J) is required for the attribute in that partition cell.

If the attributes within the partition cell are continuous, multivariate normal attributes, that is ITYPE(L)=2, then estimates of EMU(K,L,J) are required for each attribute.

If the attributes within the partition cell follow the location model, that is, ITYPE(L)=3, then THETA(K,J,M),M=1,...,NCAT(J) is required for the categorical attribute, and EMUL(K,L,J,M),M=1,...,IM(L) is required for each continuous multivariate normal attribute. (Note that IM(L) is the number of categories of the categorical attribute associated with the location model.)

The estimates are read in first for group 1, then for group 2, etc.

EXAMPLES

See /usr/share/doc/multimix/examples.

FILES

GROUPS.OUT multimix output: the assignment of the observations to groups and the posterior probabilities. If observations were initially assigned to groups (ISPEC=2), these assignments may be different. Some are likely to be different if the fitting distributions overlap.

GENERAL.OUT multimix output: parameter estimates, convergence information, and group assignment probabilities.

EMPARAMEST.OUT multimix output on failure to converge: current parameter estimates. This file can then be used as the parameter input file for multimix if desired.

AUTHORS

Lynette A. Hunt <[email protected]> and Murray Jorgensen <[email protected]>.