DSpace Harvesting
Overview
In the DSpace harvesting project, OCLC Research will periodically harvest OAI-compliant metadata from the institutional repositories of interested DSpace users. OCLC Research will convert the harvested metadata into a format suitable for re-harvesting by non-OAI services and popular search engines.
Specific tasks involved in this process include
- harvesting the DSpace metadata using OAI-PMH
- resolving DSpace handles so that originating institutions can be identified
- making the resulting URLs harvestable by search services such as Google
Background
Much of the scholarly material on the Web is missed by harvesters. This includes repository metadata which complies with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). DSpace is an OAI-PMH compliant repository system.
Search engines such as Google and Yahoo have several problems harvesting OAI repositories, which are different from standard Web pages.
- The standard implementation of DSpace uses the Handle system ( www.handle.net) for identifying items. Handles purposely mask the identity of the host, making harvesting difficult to schedule.
- Some of the URLs involved in harvesting OAI-PMH sites are not persistent, which interferes with standard harvesting approaches.
Importance and impact of this work
University libraries are becoming aware of the need for comprehensive management of their universities' digital assets, including the digitized resources and output of their faculties. Libraries have special expertise and experience to lead or join the effort to manage these resources.
DSpace is the premier open-source system for managing institutional repositories. By contributing to its development and the ease of accessing resources managed under it, OCLC Research leverages the library community's investment in knowledge organization, information storage and retrieval, metadata standards, and our own technical expertise, in support of a system that is efficient, full-featured, and freely available for the entire community.
Related OCLC Research Projects
About DSpace
DSpace is a university repository system designed to capture, store, index, distribute and archive the intellectual output created in digital form by faculty and researchers.
A joint project of MIT Libraries and the Hewlett-Packard Company, the system is now freely available as open source software. The system provides a flexible storage and retrieval architecture that can be adapted to a range of data formats and research disciplines. Each research community uses a customized portal that matches its practices to submit items into DSpace.
DSpace supports OAI-PMH. OAI support was implemented using OCLC's OAICat open source software, which makes DSpace item records available for harvesting by other OAI-compliant harvesters.
Related links
- DSpace Federation
http://www.dspace.org/ - Open Archives Initiative
http://www.openarchives.org/
OCLC Research team
- Thom Hickey (Lead)
- Jeff Young