With the development of e-Science service-oriented infrastructures based on the Grid and service computing, many data grids are being created and are becoming a new potential source of information available
to scientists and data analysts. However mining distributed data sets still remains a challenge. Following the assumption that data may not be easily transfered from one site to another, or gathered at a single
location (for performance, confidentiality or security reasons), we present a framework for distributed clustering where the data set is partitioned between several sites and the output is a mixture of Gaussian models. The data providers generate locally a clustering
model using dierent classic clustering techniques (for the moment K-Means, EM) and return it to one central site using a standard PMML representation for the model. The central site uses these models as
starting observations for EM iterations to estimate the final model parameters. Although known to be slower than other simpler techniques, EM is here applied to a relatively small set of observations and provides a probabilistic framework for the model combination.
An initial version of the framework has been implemented and deployed on the Discovery Net infrastructure that provides support for data, resource management and workflow ow composition. We present
empirical results that show the advantages of this approach and show that the final model stays accurate while allowing the mining of very large distributed data sets.
pubs.doc.ic.ac.uk: built & maintained by Ashok Argent-Katwala.