Web harvesting

The first step in creating a topicmap is getting the web pages in a normalized form. The topicmap generation application requires each page to have its markup removed and its text normalized. A Perl script is included in the distribution that will harvest web pages and normalize them as required by the topicmap generation application. In addition to normalizing the pages, the script will extract metadata from pages where it is present. The metadata can be loaded directly into the database, without further processing.

Below are the command line parameters for bin/harvest.pl. It will fetch the page supplied, and then continue to fetch linked pages. The script will only fetch pages that have the same authority as the first URI fetched. The authority is the domain name and port number, if present, of a URI. As each page is fetched, its text is normalized and metadata is extracted. Normalized text and metadata are accumulated in separate RDF files.

To run the harvester, all the modules from Webutils must be installed.

[ bin/harvest.pl ]

Parameter Required Description
-u YES Initial URI to be fetched.
-d NO The depth down the link tree to which the harvester will traverse.
1 will fetch just the supplied URIs. 0, the default, means no limit.
-h NO Selects the way in which fetched pages will be handled.
"download" to save pages
"extract" to extract metadata and normalize HTML (default)
-l NO Directory into which downloaded pages will be put. (default: ./mirror)
-m NO File into which metadata will be put. ( RDF/ XML Format) (default: Metadata.rdf)
-n NO File into which normalized HTML will be put. ( RDF/ XML Format) (default: Normalized_HTML.rdf)
-p NO How long to pause between page fetches, in seconds. (default: 30)

We are a worldwide library cooperative, owned, governed and sustained by members since 1967. Our public purpose is a statement of commitment to each other—that we will work together to improve access to the information held in libraries around the globe, and find ways to reduce costs for libraries through collaboration.