The latest in K-tree

November 2013

Development has started on LMW-tree. It is a generic C++ template library implementing EM-tree, K-tree and related research using Boost and Intel Thread Building Blocks. It includes K-tree, EM-tree and other algorithms. It reduces memory usage and increases execution speed. Parallelized versions of the algorithms have been implemented. Streaming and distributed implementations of the EM-tree are being developed for clustering of extremely large collections containing billions of examples. Development is taking place on GitHub.

A brief introduction

K-tree is a tree structured clustering algorithm. It is also refered to as a Tree Structured Vector Quantizer (TSVQ). The goal of cluster analysis is to group objects based on similarity. Each object in a K-tree is represented by an n-dimensional vector. All vectors in the tree must have the same number of dimensions.

The algorithm is a hybrid of the B+-tree and k-means algorithms. It uses a similar tree structure to the B+-tree and uses k-means to perform splits. The tree forms a nearest neighbour search tree. Unlike k-means the number of clusters does not need to be specified upfront. However, a tree order must be specified that restricts how many vectors can be stored in any node. Each level of the tree produces a different number of clusters.

The K-tree algorithm is useful for clustering large data sets with many features. It scales best in comparison to traditional approaches when there are many objects to cluster into a large number of clusters. In this scenario each cluster contains relatively few objects. For example, a document collection of three million documents can be clustered into one hundred thousand clusters.

Future directions

Currently K-tree has implementations in C++, C, Java and Python. The Python version has recently been written by Ulf and focusses on rapid prototyping for research. Development has recently started on the C++ version.

Licensed under LGPL and GPL

To download K-tree see the software page. Please cite papers from the publications page when citing K-tree.

K-tree developers

The following people have contributed to the development of K-tree (sorted lexicographically by last name)

Lance De Vine, QUT

Chris De Vries, QUT

Shlomo Geva, QUT

Ulf Großekathöfer, Bielefeld University

The development of K-tree has been proudly supported by the QUT Faculty of Science and Technology.

2 normal distributions in 2D