What happens when you take a massive database of bibliographic descriptions and redesign it for the Web, not just as a resource for librarians, but as a tool for students and the public at large? That was the project that produced RLG’s RedLightGreen. SM RedLightGreen service ended on November 1, 2006.
RedLightGreen helped undergraduates and other researchers zero in on the most authoritative, useful sources of information—with the kind of interface and usability expected by Web-savvy students. Since it debuted in September 2003, the service was recommended by over 100 libraries, who linked to RedLightGreen from their Web sites.
Under the hood
RedLightGreen SM was built on an IBM DB2 database containing XML data. The advantages of XML were many, since it was an adaptable, extensible data format and was widely accepted. "Once you have data in XML format you have lots of flexibility in using that data internally or delivering it to outside partners," says RLG dataloads specialist Joe Altimus.
However, extracting bibliographic data from the RLG Union Catalog's current mainframe databases to an XML format was "far less straightforward than we expected," according to RLG software development manager Judith Bush. Character set encoding was one problem: Data was stored in the mainframe database in EBCDIC format, and supported many European, Asian, and Middle Eastern scripts (including Arabic, Hebrew, Cyrillic, Chinese, Japanese, and Korean). That data needed to be converted to UTF-8, a transformation of the Unicode™ character set suited for 8-bit environments and XML. "What RLG learned was that it was a very complex process to write all the rules needed to translate the Union Catalog EBCDIC data into UTF-8, particularly for the Asian and Middle Eastern scripts," says Altimus.
Also challenging was the process of coming up with an XML format (known as a document type definition, or DTD) that supported the full range of features needed by RedLightGreen. RLG started with the Library of Congress MARC XML format. However, this format couldn't effectively validate many of the records stored in the mainframe database, because RLG's database adds more than 40 custom fields to those defined by the MARC 21 standard. In adding these fields to the DTD, RLG learned that some mainframe database element names could not be used because they violated XML rules for element names.
Eventually, RLG developed its own XML DTD for MARC records—an iterative process that required modifying the initial Library of Congress DTD, testing it on sample data, and then modifying it again. The current DTD (version 16 of RLG's XML format for MARC records) is somewhat "looser" than the Library of Congress's, according to Altimus, but works more effectively with the data actually contained within the RLG Union Catalog.
By representing subfields with 2,000 fewer elements, RLG's DTD was also about 20 percent the size of the original Library of Congress DTD, so data conversion and XML validation went correspondingly faster. (Because RLG has already done MARC validation of the field tags, indicators, and subfields in its mainframe databases, it wasn't necessary to repeat this validation when migrating the data to XML. Therefore, the LC MARC subfields could be removed without presenting problems. At some point in the future, however, if data no longer entered the RedLightGreen database from a prevalidated MARC source such as the RLG Union Catalog, RLG would have had to develop a more rigorous MARC validation system.)
While the RLG Union Catalog was unusually large and heterogeneous, many other library catalogs may face similar issues when migrating from legacy database systems to XML—particularly if their catalog's records haven't been validated against the MARC standard as they were entered. Indeed, the Library of Congress has published a simplified XML schema and a number of migration tools to assist libraries with solving such migration problems.
RLG needed a way to organize the vast number of records in the RLG Union Catalog without overwhelming users with a deluge of information about different editions. RLG turned to the Functional Requirements for Bibliographic Records (FRBR), an emerging model proposed by the International Federation of Library Associations and Institutions. FRBR distinguished between a work, its expressions (e.g. translations), manifestations of those expressions (specific editions), and items (specific copies). The RedLightGreen database collapsed FRBR's four levels into just two, displaying a work and various manifestations of that work. This approach would reduce a potentially overwhelming number of editions into a smaller, more manageable set of works that match a user's search terms.
1980: The RLG Union Catalog is launched. It is a unique shared cataloging database, used primarily by librarians and scholars within academic or research institutions. To access it, subscribing institutions use RLG's Web-based Eureka® interface, telnet connections to RLIN®, and a Z39.50 gateway.
2001: A grant from The Andrew W. Mellon Foundation enables RLG to begin envisioning what form the RLG Union Catalog might take if it were a Web application aimed at undergraduates and nonspecialist researchers. Brainstorming, research, and design work by RLG staff and outside consultants marked the early phases of the project, then known as the "Union Catalog on the Web."
2002: Nicknamed "RedLightGreen" by RLG staffers, the project gets a functional specification, a database design, and a customized XML expression of the MARC data structure used in the RLG Union Catalog. RLG and outside consultants design "wireframes" (mockups) of the Web application, and conduct two rounds of user tests with students from Stanford, San Jose State, and Santa Clara Universities. RLG also contracts with Recommind to use that company's MindServer software to facilitate keyword searching, enhance relevance ranking for search results, and provide the ability to expand searches to related categories.
2003: RLG extracts 4 million records from the RLG Union Catalog, a small subset of the over 45 million total records, and loads them into the RedLightGreen database in order to test MindServer's performance, XML conversion, and other technical functions. RLG also builds the Web application that will be RedLightGreen's public face. In September, the pilot deployments using the full RLG Union Catalog database begin at Swarthmore College, New York University, and Columbia University.
2004: Princeton University joins the pilot partners. RLG conducts usability studies with students at partner campuses; the service receives high marks for ease of use. RedLightGreen officially moves out of pilot phase, and dozens of libraries link to RedLightGreen from their Web sites.
2006: RedLightGreen service ends.