The Webutils Open Source project offers perl utilities to support web harvesting and metadata extraction.
As of 2006 we are issuing software under the Apache License, Version 2.0.
If you would like to use this software under the Apache license, please contact us and we may be able to update the software to use the Apache license.
The Webutils code in the CVS repository is divided into modules for ease of retrieval.
The modules are listed below. The documentation is viewable. The Webutils code may be downloaded for use or evaluation, without using CVS.
|This module provides an extensible mechanism for harvesting web pages, i.e, as a spider or robot.|
|This module extracts and normalizes the text of an HTML page.|
|This module extracts metadata from the META elements of an HTML page. If supplied with a list of index terms, it will also report which terms are in the page. (Note: MetaExtor is dependent on Normalizer.)|