What is SHReC?
SHReC is a Java package implementing a hierarchical document clustering algorithm based on a statistical co-occurence measure called subsumption. The algorithm is particularly suited to the problem of on-line "search results" clustering, requiring little amounts of text data.
The package is of particular interest to Information Retrieval researchers, since it offers a robust framework for experimenting with document and search results clustering. Under the term "research", we understand that the software is oriented mostly toward flexibility, sometimes at a price of performance losses.
The software is currently employed at the tumba! Portuguese Web search engine, where it is used to cluster search results into thematic categories.
SHReC was developed at the XLDB group of the Department of Informatics of the Faculty of Sciences of the University of Lisbon in Portugal. It was created to support the research paper "Hierarchically Clustering Web Search Results".
SHReC was written by Bruno Martins.
Document clustering is the act of collecting similar documents into bins, where similarity is some function on a document. It differs from other techniques (classification, taxonomy building, tagging, etc.) in that it is fully automated. The biggest challenge for document clustering has been to quickly find meaningful hierarchically organized groups that are concisely described.
The central idea of our algorithm is that a good cluster is one which possesses a good, readable description. So, rather than form clusters relying mainly on mathematical optimization, and then figure out how to describe them, we only form well-described clusters in the first place.
SHReC is based on term subsumption as originally presented in Sanderson & Croft, "Deriving concept hierarchies from text". The original proposal was extended with more advanced pre-processing stages, which involve the proper use of meta-data available for the documents (e.g. anchor text or extracted named entities).
SHReC is released under the BSD License, which basically states that you can do anything you like with it as long as you mention the authors and make it clear that the library is covered by the BSD License. It also exempts us from any liability, should this library eat your hard disc or kill your cat.
The software is relatively easy to install and run. We encourage you to try it out and let us know of any problems you find. We would also be very happy to hear from people who are using this package.