This is preliminary, but fully functional, implementation of the Spectral Clustering algorithm for the WEKA framework.
WEKA is an Open Source Knowledge Discovering and Data Mining system developed in Java by the University of Waikato in New Zealand It offers many algorithms and tools commonly available only in commercial systems, but, differently from the last ones, it is open and freely expandable.
The implementation relies on Java linear algebra package called
COLT and
written by Wolfgang Hoschek at CERN. WEKA is released under the GNU General
Public License, while COLT is released under ad-hoc, but GNU GPL compatible,
license, then the Spectral Clusterer is consequently released under GNU General
Public License. It can be freely used, modified and distributed in conformity
with the GNU General Public License.
This software is released under GNU General Public License. It can be freely
used, modified and distributed in conformity with the GNU General Public License.
The License Agreement is available here.
For any information about the GNU GPL and its implications please refer to
GNU site.
The content of this site is provided "AS IS" without warranties of any kind either express or implied. Currently, I am working on to improve the performance (and correct its bug too). Please send me any comment or suggestion and report me any bug you notice. If you plan to employ this component in some DM or KD application, please notice it me (this is not conforming to GNU GPL, but I would know if my work is useful).
To contact me please send an e-mail to
luigi@luigidragone.com.
Click here to go to my home page
(in Italian).
The proposed implementation is derived from [1] and [2]. In [3] is presented another version of spectral clustering algorithm, and a comparative analysis of various clustering algorithms based on spectral methods is in [4].
The spectral clustering algorithm is based on the concept of similarity between point instead of distance, as other algorithms do. The implemented algorithm is formulated as graph partition problem where the weight of each edge is the similarity between points that correspond to vertex connected by the edge.
The goal of the algorithm is find the minimum weight cuts in the graph, but this
problem can be addressed by the means of linear algebra, in particular by the
eigenvalue decomposition techniques, from which the term "spectral"
derives.
Since the WEKA clustering framework relies on the distance between points, I
have introduced a function to compute the similarity from the Euclidean
distance:
s(x, y) = exp( - d(x, y) ^ 2 / (2 * sigma ^ 2))Where:
s(x, y) is the similarity between the points x and
y;d(x, y) is the Euclidean distance;This a limit of the implementation, because one could need a different similarity function, i.e. to deal with nominal attribute. I am working on, but I think that I need to extend in some way the framework by the introduction of new concepts (then classes) to define arbitrary similarity function. More information about the algorithm can also be retrieved in source documentation.
The component has been developed using the WEKA stable version 3.2 and the COLT version 1.0.1. I expect that it is compatible with their newer versions. It has been also built with standard Sun JDK 1.3.1_01 compiler, then I am expecting that it works fine with any standard Java 2 implementation.
Since the algorithm relies on the similarity matrix it needs an amount of memory proportionate to square of the number of points. To reduce this amount there an option that switches to sparse matrix representation.
The time complexity of the implemented algorithm depends on the complexity of
the eigenvalue decomposition algorithm (about O(n^3), where
n is the number of rows/columns, or rather the number of items in the
dataset), but it is possible to reduce the time complexity, since the algorithm
needs certain eigenvectors only (which are corresponding to smallest eigenvalues
in magnitude), I think to implement an ad-hoc decomposition function based on
Lanczos or sub-space iteration methods for sparse matrices.
Currently, the implementation runs fine on dataset of hundreds of points: on a AMD K7 500 MHz workstation w/512 MB of main memory it requires some minutes (4-5) to complete an execution with over 200 points (Windows 2000 Professional and Sun JDK 1.3.0_01).
A limit is the lack of concise representation of partitions. In this version each cluster is modeled as the set of its own points. To clusterize a new instance the implementation search the set that contains the nearest point.
I do not find any proposal for modeling the clusters in this case, but I plan to improve the classification method by the introduction of a less naive algorithm (i.e. k-NN). In this case also I need to replace the distance function with a similarity one.
In order to use this component, you need:
The installation is a bit involved, because the WEKA environment lacks of a standard module registration procedure.
To use this module one must follow the subsequent instructions:
colt.jar
file to the CLASSPATH would be enough)
weka.jar and it is located
in the WEKA installation directory).
weka/clusterers.
GenericObjectEditor.props file in the directory
weka/gui, in the section weka.clusterers.Clusterer add the
reference to the new component as it is shown below:
# Lists the Clusterers I want to choose from weka.clusterers.Clusterer=\ weka.clusterers.EM, \ weka.clusterers.SimpleKMeans, \ weka.clusterers.Cobweb, \ weka.clusterers.SpectralClusterer
weka directory in the jar file and replace the
old one. To run the jar file directly it must contain the COLT library too.
Alternatively, put a reference to both jar files (weka.jar and
colt) in CLASSPATH, and run the WEKA main class directly:
java weka.gui.GUIChooser
To build the Spectral Clusterer from the source
code it is enough to add the COLT library (colt.jar file, located
in the installation directory) and the WEKA library (weka.jar file,
located in the installation directory) to the compiler CLASSPATH.
Keep in mind that the class belongs to the weka.clusterers package,
then it must be put in the directory weka/clusterers of the source
tree.
For more information about the component and its options see the Javadoc file.
It can be built from the source and it is available for convenience
here.
See the source comment and the cited references for
specific details about the algorithm and its implementation.
[1] Shi, J., and J. Malik (1997) "Normalized Cuts and Image Segmentation",
in Proc. of IEEE Conf. on Comp. Vision and Pattern Recognition,
Puerto Rico.
[2] Kannan, R., S. Vempala, and A. Vetta (2000) "On Clusterings - Good, Bad
and Spectral", Tech. Report, CS Dept., Yale University.
[3] Ng, A. Y., M. I. Jordan, and Y. Weiss (2001) "On Spectral Clustering:
Analysis and an algorithm", in Advances in Neural Information Processing
Systems 14.
[4] Weiss, Y. (1999) "Segmentation using eigenvectors: a unifying view",
Tech. Rep., CS. Dept., UC Berkeley.