The parallelizing compiler community has traditionally focused its efforts on scientific applications. This paper gives an
overview of a compiler/runtime project targeting parallel and scalable execution of data mining algorithms. To the best of
our knowledge, this is the first project with such a focus.
Data mining is the process of analyzing large datasets for extracting novel and useful patterns or models. Though a lot of
effort has been put into developing parallel algorithms for data mining tasks, the expertise and effort currently required
in implementing, maintaining, and performance tuning a parallel data mining application is an impediment in the wide use of
parallel computers for data mining.
We have developed a data parallel dialect of Java that can be used for expressing common data mining algorithms at a high
level. Our compiler generates a middleware specification from this dialect of Java. The middleware supports both distributed
memory and shared memory parallelization, and performs a number of I/O optimizations to support efficient processing of disk
resident datasets. Our final goal is to start from declarative mining operators, and translate them to data parallel Java.
In this paper, we describe the commonality among different data mining algorithms, the middleware and its interface, the data
parallel dialect of Java, and the compilation techniques required for generating the middleware specification. Experimental
evaluations of the middleware and the compiler are also presented.
This work was supported by NSF grant ACR-9982097, NSF CAREER award ACI-9733520, and NSF grant CCR-980852.