Close window

OCLC RESEARCH

Get a grip...on identifiers

As the Web has become ubiquitous, the need for persistent identifiers has exploded

BY STUART L. WEIBEL, Senior Research Scientist, OCLC Research


The topic of persistent identifiers is at once familiar and perplexing, seemingly simple and yet, on close examination, confounding and contentious in many respects. Libraries have dealt with a variety of such identifiers for many years—ISBNs, ISSNs and OCLC numbers are all commonly understood and widely used. The Internet brings us many more, and just as with the familiar ones, the new ones have their own special characteristics.

Identifiers are becoming a fundamental component of the digital infrastructure that pervades our lives. Think of identifiers as a handle…a convenient handhold for every information asset, person, invoice, object and even concept. Our computers have them, we have them, our cars have them, and increasingly our dogs, cats and even cows have them.

One of the challenges confronting us in the digital information services realm is to better understand our market, how people want it to change, and deliver those new services. Identifiers will be one of the key components of the infrastructure necessary to accomplish this.

Important characteristics of identifiers

In the context of the Internet, the term identifier is often found in association with the phrase globally unique, persistent identifiers. The Web is itself built on globally unique identifiers—URLs (Uniform Resource Locators). The globally unique part is obvious and straightforward. It is the great virtue of the Web that files can be flexibly located in what amounts to a global file system whose naming elements are unique by virtue of the hierarchical structure of the Domain Name System at its upper levels, and the natural prohibition against duplicate file names at the local file system level.

Keeping track of the burgeoning bumper crop of identifiers is not easy. In the public information arena, OCLC and libraries will play a central role in this challenge in order to preserve and manage information assets that are part of the fabric of society.

Persistence, however, is another matter. The pace of change of technology makes persistence challenging to achieve. The number 404 has attained a prominent place in the jargon of popular technology because of its frequency as an error number for page not found. The locator part of URL is rather too fragile when we want our resources to be accessible for years and decades, not days and weeks.

Does persistence mean forever? Not necessarily. The FedEx delivery identifier need have a lifetime measured only in days. An identifier for a managed information asset, however, is likely to have a useful life measured in centuries. Governments, libraries and museums in particular are expected to preserve and manage information assets that are part of the fabric of the societies we live in, and are the natural homes for efforts to organize, preserve and provide access to our cultural artifacts and memories.

As the Web has become ubiquitous, the need for persistent identifiers has exploded, even to the extent of entering the public consciousness. Keeping track of the burgeoning bumper crop of identifiers is not easy.In the public information arena, OCLC and libraries will play a central role in this challenge. For this, we will increasingly rely on registries and directories of various types to assign, track, maintain and resolve the identifiers that are embedded in our systems.

Staff in Research and elsewhere in OCLC are involved in a variety of activities in the identifier arena:

  • OCLC has a proposal pending with ISO to develop, market and maintain the International Standard Text Code (ISTC) registry, a globally unique permanent identifier to assist in the management of text assets, both digital and in the print realm. The ISTC will facilitate exchange of information between collecting societies and rights administrators, authors, agents, publishers, retailers, librarians and other interested parties.

  • OCLC’s PURL service arose as a result of our involvement in the long-laboring Uniform Resource Name (URN) working group in the Internet Engineering Task Force. PURLs were developed as a demonstration that simple, off-the-shelf technology could be brought to bear on the problem of maintaining identifier alignment as actual file system locations changed. PURLs continue to be a popular low-barrier technology for organizations managing namespaces without incurring cost or overhead.

  • OpenURLs were developed to address the so-called appropriate copy problem. In a diverse heterogeneous information environment, a user needs to be directed to a copy of an information asset for which he or she has authorized access. OpenURLs depend on a consistent identification architecture that is independent of resolution. OCLC Research staff have been involved in prototyping registration and management infrastructure for OpenURLs.

  • The ‘info’ URI Internet draft provides a missing bit of Internet infrastructure that supports the separation of identity from resolution. The development of ‘info’ URIs was motivated by the need for an identifier architecture in OpenURLs, but has broader applicability as well.

  • ERRoLs are constructed, dynamic URLs that resolve to metadata, content and services related to items stored in a community of OAI repositories. ERRoLs are constructed by concatenating the ERRoL prefix (e.g., http://errol.oclc.org/”), an OAIidentifier and a metadata prefix or service extension. ERRoL resolution is generally accomplished by dynamically performing OAI-PMH requests to the home repository and transforming the responses using XSLT style sheets or HTTP redirects.

  • The VIAF activity (Virtual International Authority File) is a joint project of Die Deutsche Bibliothek in Frankfurt, the Library of Congress and OCLC to improve interoperability across national authority files. Language variants, collation conventions and character set issues all contribute to what librarians understand as the fog of naming. Persistent identifiers will be an important aspect of reducing this problem.

  • Registration of Dublin Core terminology namespaces (and URI schemes associated with them) is an essential part of making metadata modular, and in supporting the need to reference legacy terminologies from within new Internet standards such as the Dublin Core. It has become evident that the need is more general, and that the topic of terminology identifiers is salient to many related technologies and standards. Effective use of terminologies on the Web is a fundamental requirement for realizing the potentials of the Semantic Web, and identifiers are a foundation component of this technology.

Persistent identifiers represent an important strategic interest for OCLC and its constituents, both as infrastructural elements that require thoughtful design and management, and as part of the changing business environment of libraries. As such, they play a key role in demonstrating one of the fundamental value propositions of libraries: commitment to long-term access to the information assets of society.


Windows or Web? Choice is a great thing