The ClueWeb09 and ClueWeb12 document collections are some of the largest document collections used for research. They contain 500 million and 733 million English language documents respectively. The Streaming EM-tree algorithm using binary document vectors produced by TopSig has been used to cluster these collections into more than 500,000 clusters. This has been done using a single 16 core machine in 15 hours using the LMW-tree C++ template library. The experiments were run in the QUT Big Data Lab. It is is expected that a distributed version of the algorithm will be able to cluster the entire searchable web of around 50 billion documents into 1 million clusters. The clusters produced by these algorithms have been made browsable online: ClueWeb09 clusters and ClueWeb12 clusters. The cluster mappings and software are also available.
The paper "Clustering and Labeling a web scale document collection using Wikipedia clusters" using TopSig binary vectors using clustering algorithms from the LMW-tree C++ template library has been accepted to Web-scale Knowledge, Representation and Reasoning (Web-KR 2014). Congratulations and thanks goes to all the authors. The experiments were run in the QUT Big Data Lab which is supported by the Science and Engineering Faculty and and High Performance Computing centre.
The PhD thesis introducing the EM-tree algorithm and document clustering with binary document vectors is available online. It is titled "Document Clustering Algorithms, Representations and Evaluation for Information Retrival".
Development has started on LMW-tree. It is a generic C++ template library implementing EM-tree, K-tree and related research using Boost and Intel Thread Building Blocks. It includes K-tree, EM-tree and other algorithms. It reduces memory usage and increases execution speed. Parallelized versions of the algorithms have been implemented. Streaming and distributed implementations of the EM-tree are being developed for clustering of extremely large collections containing billions of examples. Development is taking place on GitHub.