The RDF Topicmaps project explores subject navigation of Web sites using semi-automatically generated finding aids.
Many institutions are struggling to solve problems with their official Web sites, but the contents constantly change, and the editors can't exercise sufficient control. One result is that an institution's major presence on the Web is difficult to navigate.
The goals of the project include:
- representing subject/topic information obtained from Web sites
- investigating the value of hypothetical metadata-based navigation for a collection of related Web sites
- developing and evaluating the utility of Open Source prototyping tools based on RDF.
RDF topicmaps are created by:
- normalizing input data (e.g., Web page files) by removing HTML tags and other programming codes
- extracting common terms from the text
- analyzing relationships among extracted terms
- filtering terms and relationships contextually
- structuring filtered terms for representation in an RDF graph.
Some topics generally are found to be unique to specific sites, while others are common to multiple sites. Dr. Godby gathered subject/topic metadata from Web sites by examining:
- HTML keywords
- subject lines in email messages
- an index of library/information science terms (relevant for the sites used in her demonstration)
- terms extracted automatically from text using natural-language-processing algorithms.
Some of the term relationships found on the sites were identified as:
- coordination (e.g., library and information science--library science, information science)
- related (e.g., Library-library classification scheme, library automation).
The subject/topic extraction software is embedded in a library of Open Source code that:
- harvests Web pages from a list of URLs supplied by the user
- extracts simple metadata and encodes it in RDF
- normalizes the text for the NLP component
- creates a MySQL database of RDF relationships
- makes the results available to the user through XSL stylesheets.
Open issues include:
- RDF knowledge in the user interface
- whether to encode in RDF or XML
- the construction of knowledge ontologies.
Based on project results thus far it appears that the enterprise succeeds or fails on the strength of the knowledge ontology. Sophisticated user interface design is required to exploit all of the encoded information.