Automatic Classification Research at OCLC
OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified.
Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change.
Our research explores the following questions:
- Can standard library classification schemes such as the Dewey Decimal Classification and the Library of Congress Classification be adapted to classify materials automatically-especially Web resources and other digitized electronic documents? Is there a role for indexes and topic maps that are obtained directly from source documents?
- What improvements to automatic classification systems can be made to get as close to human performance as possible? How useful is the result? Can the results be used in subject browsing and searching or the creation of minimal metadata records? Should an automatic classifier be included in toolkits for Webmasters or other human-mediated processes?
In an article in Scientific American, Tim Berners-Lee argues that the current Web will be transformed into the more intelligent Semantic Web when it is augmented with data for automated processing. For the Semantic Web to work, we must have access to structured collections of information, as well as appropriately annotated Web pages. Since some of the goals of our work are consistent with this vision, we are assessing the utility of the Resource Description Framework (RDF), a building block of the Semantic Web, in making Web documents more accessible by subject.
We draw on our background in information retrieval, Web metadata creation, natural language processing, and our intellectual involvement in two of the world's most widely used library classification schemes to address our reserch questions.
Tim Berners-Lee, James Hendler, and Ora Lassila. 2001 "The Semantic Web - Computers navigating tomorrow's Web will understand more of what's going on -- making it more likely that you'll get what you really want." Scientific American 284,5(May): 34-43.
Project papers and presentations
Jean Godby and Devon Smith. 2002. " Strategies for Subject Navigation Using RDF Topic Maps." Presented at the Knowledge Technologies 2002 Conference. Seattle, Washington, March 2002.
Jean Godby and Jay Stuler. 2001. " The Library of Congress Classification as a knowledge base for automatic subject categorization." Presented at the IFLA Preconference, "Subject Retrieval in a Networked Environment," Dublin, Ohio, August 2001.
Jean Godby. 2001. "Terminology identification from full text: OCLC's WordSmith experience." Presentation at The Southern Ohio Chapter of the American Society for Information Science & Technology (SO-ASIST) meeting, "Aboutness: Automated Indexing & Categorization," Lexis-Nexis, Miamisburg,Ohio, June 21, 2001.
Jean Godby. 2001. "The automatic encoding of lexical knowledge in RDF topicmaps." Presentation at the Knowledge Technologies 2001 Conference, Austin, Texas. March 6, 2001.
Jean Godby and Ray Reighart. 2001. "Terminology identification in a collection of Web resources." The Journal of Internet Cataloging 4,1/2: 49-65. Also published in CORC: New Tools and Possibilities for Cooperative Electronic Resource Description, edited by Karen Calhoun and John J. Riemer. Haworth Information Press: 49-65.
Jean Godby and Diane Vizine-Goetz. 2000. " ISKO participants discuss ways librarianship can improve responsiveness of the Web." OCLC Newsletter 247 (September/October): 22-25.
Anders Ardö, Jean Godby, Andrew Houghton, Traugott Koch, Ray Reighart, Roger Thompson and Diane Vizine-Goetz. 2000. "Browsing Engineering Resources on the Web." In Beghtol, L., Howarth, L. and Williamson, N. (editors), Dynamism and Stability in Knowledge Organization: Proceedings of the Sixth ISKO Conference , 10-13 July, 2000: 385-390.
Jean Godby, Eric Miller and Ray Reighart. 2000. " Automatically Generated Topic Maps of World Wide Web Resources." Presentation at the Ninth International World Wide Web Conference, Developers' Day Session on the Semantic Web, May 15, 2000.
The Library of Congress Classification as a knowledge base for automatic subject assignment. On the Scorpion demo page, choose the database entitled "Schedules QRST with filtered WorldCat and LCSA Headings." It contains adaptations of the schedules for Science, Medicine, Technology and Agriculture. For inquiries, contact Jean Godby.
Links to related OCLC projects
- Dewey Services
- The Scorpion Project
- The WordSmith Project
Links to related external sites
- The Dublin Core Home Page
- The Electronic Engineering Library, Sweden
- The Library of Congress Home Page
- The Resource Description Framework
- The Semantic Web