This activity is now closed. The information on this page is provided for historical purposes only.

The Webutils Open Source project offers perl utilities to support web harvesting and metadata extraction.


This software may be used without charge in accord with the terms of the OCLC Research Public License. A PDF version of the license also is available. (PDF:130K/3pp.)

As of 2006 we are issuing software under the Apache License, Version 2.0.

If you would like to use this software under the Apache license, please contact us and we may be able to update the software to use the Apache license.


The Webutils code in the CVS repository is divided into modules for ease of retrieval.

The modules are listed below. The documentation is viewable. The Webutils code may be downloaded for use or evaluation, without using CVS.

WWW::Harvester (v 1.15) Documentation Source
This module provides an extensible mechanism for harvesting web pages, i.e, as a spider or robot.
HTML::Normalizer (v 1.04) Documentation Source
This module extracts and normalizes the text of an HTML page.
HTML::MetaExtor (v 1.08) Documentation Source
This module extracts metadata from the META elements of an HTML page. If supplied with a list of index terms, it will also report which terms are in the page. (Note: MetaExtor is dependent on Normalizer.)