The latest in K-tree and friends

ClueWeb Clusters

October 2014

The ClueWeb09 and ClueWeb12 document collections are some of the largest document collections used for research. They contain 500 million and 733 million English language documents respectively. The Streaming EM-tree algorithm using binary document vectors produced by TopSig has been used to cluster these collections into more than 500,000 clusters. This has been done using a single 16 core machine in 15 hours using the LMW-tree C++ template library. The experiments were run in the QUT Big Data Lab. It is is expected that a distributed version of the algorithm will be able to cluster the entire searchable web of around 50 billion documents into 1 million clusters. The clusters produced by these algorithms have been made browsable online: ClueWeb09 clusters and ClueWeb12 clusters. The cluster mappings and software are also available.

Paper Accepted

September 2014

The paper "Clustering and Labeling a web scale document collection using Wikipedia clusters" using TopSig binary vectors using clustering algorithms from the LMW-tree C++ template library has been accepted to Web-scale Knowledge, Representation and Reasoning (Web-KR 2014). Congratulations and thanks goes to all the authors. The experiments were run in the QUT Big Data Lab which is supported by the Science and Engineering Faculty and and High Performance Computing centre.

PhD Thesis Online

September 2014

The PhD thesis introducing the EM-tree algorithm and document clustering with binary document vectors is available online. It is titled "Document Clustering Algorithms, Representations and Evaluation for Information Retrival".


November 2013

Development has started on LMW-tree. It is a generic C++ template library implementing EM-tree, K-tree and related research using Boost and Intel Thread Building Blocks. It includes K-tree, EM-tree and other algorithms. It reduces memory usage and increases execution speed. Parallelized versions of the algorithms have been implemented. Streaming and distributed implementations of the EM-tree are being developed for clustering of extremely large collections containing billions of examples. Development is taking place on GitHub.

A brief introduction

K-tree is a tree structured clustering algorithm. It is also refered to as a Tree Structured Vector Quantizer (TSVQ). The goal of cluster analysis is to group objects based on similarity. Each object in a K-tree is represented by an n-dimensional vector. All vectors in the tree must have the same number of dimensions.

The algorithm is a hybrid of the B+-tree and k-means algorithms. It uses a similar tree structure to the B+-tree and uses k-means to perform splits. The tree forms a nearest neighbour search tree. Unlike k-means the number of clusters does not need to be specified upfront. However, a tree order must be specified that restricts how many vectors can be stored in any node. Each level of the tree produces a different number of clusters.

The K-tree algorithm is useful for clustering large data sets with many features. It scales best in comparison to traditional approaches when there are many objects to cluster into a large number of clusters. In this scenario each cluster contains relatively few objects. For example, a document collection of three million documents can be clustered into one hundred thousand clusters.

Further Development
Future directions

Currently K-tree has implementations in C++, C, Java and Python. The Python version has recently been written by Ulf and focusses on rapid prototyping for research. Development has recently started on the C++ version.

Download K-tree
Licensed under LGPL and GPL

To download K-tree see the software page. Please cite papers from the publications page when citing K-tree.

Development Team
K-tree developers

The following people have contributed to the development of K-tree (sorted lexicographically by last name)
    Lance De Vine, QUT
    Chris De Vries, QUT
    Shlomo Geva, QUT
    Ulf Großekathöfer, Bielefeld University

The development of K-tree has been proudly supported by the QUT Faculty of Science and Technology.

K-tree Codebook Clusters in 2D
2 normal distributions in 2D