Automatic Collection Description

This activity is now closed. The information on this page is provided for historical purposes only.


This project will investigate the creation of centroids from digital collections and evaluate their usefulness in creating automatic descriptions of the collections.


Centroids provide an abstract characterization of the database in a standard format [ 1] and have been used as a referral mechanism in distributed searching environments.

A centroid can be thought of as a simple inverted index mechanism that can be shared amongst servers in a network environment in order to provide hints as to the location of data in a large, loosely coupled distributed database. A centroid is used by a server or user client to provide it with hints as to which other servers might contain information that is relevant to a user's search. These hints are known as "forward knowledge". [ 2]


  1. Centroids will be created for WorldCat, ETDCAT and selected OAI repositories. It is hoped that a sufficiently large selection of web pages (possibly provided by Google) can also be used as a source for a centroid.
  2. Centroids made from samples of those collections will be compared with full centroids and their differences evaluated.
  3. Centroids from different collections will be compared with the expectation that those differences can be used to generate keyword descriptions of the collections.
  4. Software to generate the centroids will be derived from existing Pears software.

Why OCLC is conducting this research and how it helps libraries

  • Users and owners of collections should benefit from this work.
  • Users will benefit by being able to make better-informed decisions about which collections to access.
  • Owners will benefit by both having the description done automatically and by the comprehensive nature of the evaluation.
  • Collection description and selection is a problem for the entire information community, and this project may demonstrate WorldCat's value as a standard for comparing the intellectual content of other collections.
  • This work will add new tools to the collection development community.

Research team

  • Ralph Levan (lead)


[1] "WHOIS++ & centroids (... plus some other stuff)". Martin Hamilton, Loughborough University. Available at <>; downloaded 25 November 2002.

[2] Jon P. Knight and Martin Hamilton, "The Use of Centroids in the ROADS Project." Originally available at ; downloaded 11.25.2002 from the Internet Archive: <>.