Spectral Clusterer for WEKA

by Luigi Dragone [e-Mail] [Home Page]

Overview

This is preliminary, but fully functional, implementation of the Spectral Clustering algorithm for the WEKA framework.

WEKA is an Open Source Knowledge Discovering and Data Mining system developed in Java by the University of Waikato in New Zealand It offers many algorithms and tools commonly available only in commercial systems, but, differently from the last ones, it is open and freely expandable.

The implementation relies on Java linear algebra package called COLT and written by Wolfgang Hoschek at CERN. WEKA is released under the GNU General Public License, while COLT is released under ad-hoc, but GNU GPL compatible, license, then the Spectral Clusterer is consequently released under GNU General Public License. It can be freely used, modified and distributed in conformity with the GNU General Public License.
This software is released under GNU General Public License. It can be freely used, modified and distributed in conformity with the GNU General Public License. The License Agreement is available here. For any information about the GNU GPL and its implications please refer to GNU site.

The content of this site is provided "AS IS" without warranties of any kind either express or implied. Currently, I am working on to improve the performance (and correct its bug too). Please send me any comment or suggestion and report me any bug you notice. If you plan to employ this component in some DM or KD application, please notice it me (this is not conforming to GNU GPL, but I would know if my work is useful).

To contact me please send an e-mail to luigi@luigidragone.com.
Click here to go to my home page (in Italian).

Change log

Spectral Clustering

The proposed implementation is derived from [1] and [2]. In [3] is presented another version of spectral clustering algorithm, and a comparative analysis of various clustering algorithms based on spectral methods is in [4].

The spectral clustering algorithm is based on the concept of similarity between point instead of distance, as other algorithms do. The implemented algorithm is formulated as graph partition problem where the weight of each edge is the similarity between points that correspond to vertex connected by the edge.

The goal of the algorithm is find the minimum weight cuts in the graph, but this problem can be addressed by the means of linear algebra, in particular by the eigenvalue decomposition techniques, from which the term "spectral" derives.
Since the WEKA clustering framework relies on the distance between points, I have introduced a function to compute the similarity from the Euclidean distance:

s(x, y) = exp( - d(x, y) ^ 2 / (2 * sigma ^ 2))
Where:

This a limit of the implementation, because one could need a different similarity function, i.e. to deal with nominal attribute. I am working on, but I think that I need to extend in some way the framework by the introduction of new concepts (then classes) to define arbitrary similarity function. More information about the algorithm can also be retrieved in source documentation.

Release notes

The component has been developed using the WEKA stable version 3.2 and the COLT version 1.0.1. I expect that it is compatible with their newer versions. It has been also built with standard Sun JDK 1.3.1_01 compiler, then I am expecting that it works fine with any standard Java 2 implementation.

Since the algorithm relies on the similarity matrix it needs an amount of memory proportionate to square of the number of points. To reduce this amount there an option that switches to sparse matrix representation.

The time complexity of the implemented algorithm depends on the complexity of the eigenvalue decomposition algorithm (about O(n^3), where n is the number of rows/columns, or rather the number of items in the dataset), but it is possible to reduce the time complexity, since the algorithm needs certain eigenvectors only (which are corresponding to smallest eigenvalues in magnitude), I think to implement an ad-hoc decomposition function based on Lanczos or sub-space iteration methods for sparse matrices.

Currently, the implementation runs fine on dataset of hundreds of points: on a AMD K7 500 MHz workstation w/512 MB of main memory it requires some minutes (4-5) to complete an execution with over 200 points (Windows 2000 Professional and Sun JDK 1.3.0_01).

A limit is the lack of concise representation of partitions. In this version each cluster is modeled as the set of its own points. To clusterize a new instance the implementation search the set that contains the nearest point.

I do not find any proposal for modeling the clusters in this case, but I plan to improve the classification method by the introduction of a less naive algorithm (i.e. k-NN). In this case also I need to replace the distance function with a similarity one.

System requirements

In order to use this component, you need:

Installation instructions

The installation is a bit involved, because the WEKA environment lacks of a standard module registration procedure.

To use this module one must follow the subsequent instructions:

  1. Download and install the WEKA System. It can be downloaded from the University of Waikato website. To install it follow its own instructions.
  2. Download and install the COLT numeric package. It can be downloaded from the CERN website. To install it follow its own instructions. (Add the colt.jar file to the CLASSPATH would be enough)
  3. Download the Spectral Clusterer from here. (The source code, according to GNU GPL, is available here)
  4. Unpack the WEKA jar file (it is named weka.jar and it is located in the WEKA installation directory).
  5. Copy the SpectralCluster.class file previously downloaded in the directory weka/clusterers.
  6. Edit the GenericObjectEditor.props file in the directory weka/gui, in the section weka.clusterers.Clusterer add the reference to the new component as it is shown below:
    # Lists the Clusterers I want to choose from
    weka.clusterers.Clusterer=\
     weka.clusterers.EM, \
     weka.clusterers.SimpleKMeans, \
     weka.clusterers.Cobweb, \
     weka.clusterers.SpectralClusterer
    
  7. Optionally, repack the weka directory in the jar file and replace the old one. To run the jar file directly it must contain the COLT library too. Alternatively, put a reference to both jar files (weka.jar and colt) in CLASSPATH, and run the WEKA main class directly:
    	java weka.gui.GUIChooser
    
That's all folk. Now you can start the WEKA knowledge explorer, the new algorithm is available in the Cluster tab of the Explorer.

Build instructions

To build the Spectral Clusterer from the source code it is enough to add the COLT library (colt.jar file, located in the installation directory) and the WEKA library (weka.jar file, located in the installation directory) to the compiler CLASSPATH.

Keep in mind that the class belongs to the weka.clusterers package, then it must be put in the directory weka/clusterers of the source tree.

Documentation

For more information about the component and its options see the Javadoc file. It can be built from the source and it is available for convenience here.
See the source comment and the cited references for specific details about the algorithm and its implementation.

References

[1]	Shi, J., and J. Malik (1997) "Normalized Cuts and Image Segmentation",
	in Proc. of IEEE Conf. on Comp. Vision and Pattern Recognition, 
	Puerto Rico.
[2]	Kannan, R., S. Vempala, and A. Vetta (2000) "On Clusterings - Good, Bad 
	and Spectral", Tech. Report, CS Dept., Yale University.
[3]	Ng, A. Y., M. I. Jordan, and Y. Weiss (2001) "On Spectral Clustering:
 	Analysis and an algorithm", in Advances in Neural Information Processing
	Systems 14.
[4]	Weiss, Y. (1999) "Segmentation using eigenvectors: a unifying view", 
	Tech. Rep., CS. Dept., UC Berkeley.


luigi@luigidragone.com.