Clustering, also known as mixture modelling or intrinsic classification, is the problem of identifying and modelling components
(or clusters, or classes) in a body of data. We consider here the application of the Minimum Message Length (MML) principle
to a clustering problem of Gaussian and t distributions. Earlier work in the MML clustering was conducted in regards to the multinomial and Gaussian distributions
(Wallace and Boulton, 1968) and in addition, the von Mises circular and Poisson distributions (Wallace and Dowe, 1994, 2000).
Our current work extends this by applying the Gaussian distribution to the more general t distribution. Point estimation of the t distribution is performed using the MML approximation proposed by Wallace and Freeman (1987). A comparison of the MML estimations
of the t distribution to those of the Maximum Likelihood (ML) method in terms of their Kullback-Leibler (KL) distances is also provided.
Within each component, our application also performs a model selection on whether a particular group of data is best modelled
as a Gaussian or a t distribution. The proposed modelling method is then applied to several artificially generated datasets. The modelling results
are compared to the results obtained when using the MML clustering of Gaussian distributions. Our modelling method compares
quite well to an alternative clustering program (EMMIX) which uses various modelling criteria such as the Akaike Information
Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC).
Keywords Clustering - Machine Learning - Knowledge Discovery - Data Mining - Unsupervised Learning - Minimum Message Length - MML - Mixture Modelling - Classification - Intrinsic Classification - Numerical Taxonomy - Information Theory - Statistical Inference