Computing Publications

Publications Home » Scalable clustering on the Data Grid

Scalable clustering on the Data Grid

Patrick Wendel, Moustafa Ghanem, Yike Guo

Conference or Workshop Paper
4th UK e-Science All Hands Meeting 2005
September, 2005
ISBN 1-904425-53-4

With the development of e-Science service-oriented infrastructures based on the Grid and service computing, many data grids are being created and are becoming a new potential source of information available

to scientists and data analysts. However mining distributed data sets still remains a challenge. Following the assumption that data may not be easily transfered from one site to another, or gathered at a single

location (for performance, confidentiality or security reasons), we present a framework for distributed clustering where the data set is partitioned between several sites and the output is a mixture of Gaussian models. The data providers generate locally a clustering

model using di erent classic clustering techniques (for the moment K-Means, EM) and return it to one central site using a standard PMML representation for the model. The central site uses these models as

starting observations for EM iterations to estimate the final model parameters. Although known to be slower than other simpler techniques, EM is here applied to a relatively small set of observations and provides a probabilistic framework for the model combination.

An initial version of the framework has been implemented and deployed on the Discovery Net infrastructure that provides support for data, resource management and workflow ow composition. We present

empirical results that show the advantages of this approach and show that the final model stays accurate while allowing the mining of very large distributed data sets.

BibTEX file for the publication built & maintained by Ashok Argent-Katwala.