Spectral Clusterer for WEKA

Overview

This is preliminary, but fully functional, implementation of the Spectral Clustering algorithm for the WEKA framework.

WEKA is an Open Source Knowledge Discovering and Data Mining system developed in Java by the University of Waikato in New Zealand It offers many algorithms and tools commonly available only in commercial systems, but, differently from the last ones, it is open and freely expandable.

The implementation relies on Java linear algebra package called COLT and written by Wolfgang Hoschek at CERN. WEKA is released under the GNU General Public License, while COLT is released under ad-hoc, but GNU GPL compatible, license, then the Spectral Clusterer is consequently released under GNU General Public License. It can be freely used, modified and distributed in conformity with the GNU General Public License.
This software is released under GNU General Public License. It can be freely used, modified and distributed in conformity with the GNU General Public License. The License Agreement is available here.
For any information about the GNU GPL and its implications please refer to GNU site.

The content of this site is provided “AS IS” without warranties of any kind either express or implied. Currently, I am working on to improve the performance (and correct its bug too). Please send me any comment or suggestion and report me any bug you notice. If you plan to employ this component in some DM or KD application, please notice it me (this is not conforming to GNU GPL, but I would know if my work is useful).

Change log

  • Upgraded to WEKA 3.6 (1.1) 22 Aug 2010
  • First release (1.0) 8 Jan 2002

Spectral Clustering

The proposed implementation is derived from [1] and [2]. In [3] is presented another version of spectral clustering algorithm, and a comparative analysis of various clustering algorithms based on spectral methods is in [4].

The spectral clustering algorithm is based on the concept of similarity between point instead of distance, as other algorithms do. The implemented algorithm is formulated as graph partition problem where the weight of each edge is the similarity between points that correspond to vertex connected by the edge.

The goal of the algorithm is find the minimum weight cuts in the graph, but this problem can be addressed by the means of linear algebra, in particular by the eigenvalue decomposition techniques, from which the term “spectral” derives.
Since the WEKA clustering framework relies on the distance between points, I have introduced a function to compute the similarity from the Euclidean distance:

s(x, y) = exp( - d(x, y) ^ 2 / (2 * sigma ^ 2))

where:

  • s(x, y) is the similarity between the points x and y;
  • d(x, y) is metric distance function (e.g., Euclidean distance);
  • sigma is an user specified scaling factor.

This a limit of the implementation, because one could need a different similarity function, i.e. to deal with nominal attribute. This goal could be achieved defining accordingly a distance function for the specific problem input space exploiting the WEKA DistanceFunction class.
More information about the algorithm can also be retrieved in source documentation.

Release notes

The component has been developed using the WEKA stable version 3.6 and the COLT version 1.2. It is no compatible with previous versions. It has been built with standard Sun JDK 1.6 compiler.

Since the algorithm relies on the similarity matrix it needs an amount of memory proportionate to square of the number of points. To reduce this amount there an option that switches to sparse matrix representation.

The time complexity of the implemented algorithm depends on the complexity of the eigenvalue decomposition algorithm (about O(n^3), where n is the number of rows/columns, or rather the number of items in the dataset), but it is possible to reduce the time complexity, since the algorithm needs certain eigenvectors only (which are corresponding to smallest eigenvalues in magnitude), I think it possible to implement an ad-hoc decomposition function based on Lanczos or sub-space iteration methods for sparse matrices.

A limit is the lack of concise representation of partitions. In this version each cluster is modelled as the set of its own points. To cluster a new instance the implementation search the set that contains the nearest point.
I am not aware of any proposal for modelling the clusters in this case, but I plan to improve the classification method by the introduction of a less naive algorithm (i.e. k-NN). In this case also I need to replace the distance function with a similarity one.

System requirements

In order to use this component, you need:

Installation instructions

The installation is a bit involved, because the WEKA environment lacks of a standard module registration procedure.

To use this module one must follow the subsequent instructions:

  1. Download and install the WEKA System. It can be downloaded from the University of Waikato website. Install it following its own instructions.
  2. Download and install the COLT numeric package. It can be downloaded from the COLT project’s website. To install it follow its own instructions.
  3. Download the Spectral Clusterer from here (The source code, according to GNU GPL, is included in the same file).
  4. To run WEKA, the Java runtime’s classpath should simply include the following jars: weka.jar, colt.jar, and weka-spectral-clusterer.jar.
    For instance, assuming that these files are in the current directory, the command to issue is: 

    java -cp weka.jar;weka-spectral-clusterer.jar;colt.jar weka.gui.GUIChooser

That’s all folk. Now you can start the WEKA knowledge explorer, the new algorithm is available in the Cluster tab of the Explorer.

Build instructions

The Spectral Clusterer component could be built from the source code available here. It should be enough adding the WEKA and COLT libraries to the compiler’s classpath, in order to compile it. Please refer to the documentation of your IDE for further details.

Documentation

For more information about the component and its options see the Javadoc file. It can be built from the source and it is available for convenience here. See the source comment and the cited references for specific details about the algorithm and its implementation.

References

[1]	Shi, J., and J. Malik (1997) "Normalized Cuts and Image Segmentation",
	in Proc. of IEEE Conf. on Comp. Vision and Pattern Recognition,
	Puerto Rico.

[2]	Kannan, R., S. Vempala, and A. Vetta (2000) "On Clusterings - Good, Bad
	and Spectral", Tech. Report, CS Dept., Yale University.

[3]	Ng, A. Y., M. I. Jordan, and Y. Weiss (2001) "On Spectral Clustering:
 	Analysis and an algorithm", in Advances in Neural Information Processing
	Systems 14.

[4]	Weiss, Y. (1999) "Segmentation using eigenvectors: a unifying view",
	Tech. Rep., CS. Dept., UC Berkeley.

 

8 Responses to “Spectral Clusterer for WEKA”


Leave a Reply to Yehonatan Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by WordPress. Designed by elogi.