RedLightGreen
What happens when you take a massive database of
bibliographic descriptions and redesign it for the Web, not just as a
resource for librarians, but as a tool for students and the public at
large? That was the project that produced RLG’s RedLightGreen.SM
RedLightGreen service
ended on November 1, 2006.
RedLightGreen helped undergraduates and other
researchers zero in on the most authoritative, useful sources of
information—with the kind of interface and usability expected
by Web-savvy students. Since it debuted in September 2003, the service
was recommended by over 100 libraries, who linked to RedLightGreen from
their Web sites.
Under the hood
RedLightGreenSM
was built on an IBM DB2 database containing XML data. The advantages of
XML were many, since it was an adaptable, extensible data format and
was
widely accepted. "Once you have data in XML format you have lots of
flexibility in using that data internally or delivering it to outside
partners," says RLG dataloads specialist Joe Altimus.
However, extracting bibliographic data from the RLG
Union Catalog's current mainframe databases to an XML format was "far
less straightforward than we expected," according to RLG software
development manager Judith Bush. Character set encoding was one
problem: Data was stored in the mainframe database in EBCDIC format,
and
supported many European, Asian, and Middle Eastern scripts (including
Arabic, Hebrew, Cyrillic, Chinese, Japanese, and Korean). That data
needed to be converted to UTF-8, a transformation of the
Unicode™ character set suited for 8-bit environments and XML.
"What RLG learned was that it was a very complex process to write all
the rules needed to translate the Union Catalog EBCDIC data into UTF-8,
particularly for the Asian and Middle Eastern scripts," says Altimus.
Also challenging was the process of coming up with an
XML format (known as a document type definition, or DTD) that supported
the full range of features needed by RedLightGreen. RLG started with
the Library of Congress MARC XML format. However, this format couldn't
effectively validate many of the records stored in the mainframe
database, because RLG's database adds more than 40 custom fields to
those defined by the MARC 21 standard. In adding these fields to the
DTD, RLG learned that some mainframe database element names could not
be used because they violated XML rules for element names.
Eventually, RLG developed its own XML DTD for MARC
records—an iterative process that required modifying the
initial Library of Congress DTD, testing it on sample data, and then
modifying it again. The current DTD (version 16 of RLG's XML format for
MARC records) is somewhat "looser" than the Library of Congress's,
according to Altimus, but works more effectively with the data actually
contained within the RLG Union Catalog.
By representing subfields with 2,000 fewer elements,
RLG's DTD was also about 20 percent the size of the original Library of
Congress DTD, so data conversion and XML validation went
correspondingly
faster. (Because RLG has already done MARC validation of the field
tags, indicators, and subfields in its mainframe databases, it wasn't
necessary to repeat this validation when migrating the data to XML.
Therefore, the LC MARC subfields could be removed without presenting
problems. At some point in the future, however, if data no longer
entered the RedLightGreen database from a prevalidated MARC source such
as the RLG Union Catalog, RLG would have had to develop a more rigorous
MARC
validation system.)
While the RLG Union Catalog was unusually large and
heterogeneous, many other library catalogs may face similar issues when
migrating from legacy database systems to XML—particularly if
their catalog's records haven't been validated against the MARC
standard as they were entered. Indeed, the Library of Congress has
published a simplified XML schema and a number of migration
tools to assist libraries with solving such migration problems.
RLG needed a way to organize the vast number of records
in the RLG Union Catalog without overwhelming users with a deluge of
information about different editions. RLG turned to the Functional
Requirements for Bibliographic Records (FRBR), an emerging model
proposed by the International Federation of Library Associations and
Institutions. FRBR distinguished between a work, its expressions (e.g.
translations), manifestations of those expressions (specific editions),
and items (specific copies). The RedLightGreen database collapsed
FRBR's four levels into just two, displaying a work and various
manifestations of that work. This approach would reduce a potentially
overwhelming number of editions into a smaller, more manageable set of
works that match a user's search terms.
Timeline
1980: The RLG Union Catalog
is launched. It is a unique shared cataloging database, used primarily
by librarians and scholars within academic or research institutions. To
access it, subscribing institutions use RLG's Web-based Eureka®
interface, telnet connections to RLIN®, and a Z39.50 gateway.
2001: A grant from The
Andrew W. Mellon Foundation enables RLG to begin envisioning what form
the RLG Union Catalog might take if it were a Web application aimed at
undergraduates and nonspecialist researchers. Brainstorming, research,
and design work by RLG staff and outside consultants marked the early
phases of the project, then known as the "Union Catalog on the Web."
2002: Nicknamed
"RedLightGreen" by RLG staffers, the project gets a functional
specification, a database design, and a customized XML expression of
the MARC data structure used in the RLG Union Catalog. RLG and outside
consultants design "wireframes" (mockups) of the Web application, and
conduct two rounds of user tests with students from Stanford, San Jose
State, and Santa Clara Universities. RLG also contracts with Recommind
to use that company's MindServer software to facilitate keyword
searching, enhance relevance ranking for search results, and provide
the ability to expand searches to related categories.
2003: RLG extracts 4 million records
from the RLG Union Catalog, a small subset of the over 45 million total
records, and loads them into the RedLightGreen database in order to
test MindServer's performance, XML conversion, and other technical
functions. RLG also builds the Web application that will be
RedLightGreen's public face. In September, the pilot deployments using
the full RLG Union Catalog database begin at Swarthmore College, New
York University, and Columbia University.
2004: Princeton
University joins the pilot partners. RLG conducts usability studies
with students at partner campuses; the service receives high
marks for ease of use. RedLightGreen officially moves out of
pilot phase, and dozens of libraries link to RedLightGreen from their
Web sites.
2006:
RedLightGreen service ends.
|
|