On-demand batch processing with the WorldCat Metadata API

The WorldCat Metadata API opens up new possibilities for institutions to contribute and manage their bibliographic data at scale. At the Dartmouth College Library, we have used the Bibliographic Resource and Holdings Resource operations of the API to create and update master records in WorldCat for the unique materials published through our Digital Library Program. Incorporating existing open-source components written for OCLC APIs, we developed a command-line utility written in Ruby. This tool gives us direct control over uploading hundreds of records at a time and makes it convenient for us to repurpose the resultant metadata for use in a variety of systems. In this post, I will introduce our use case and how we've applied the API so far.

MODS is the primary descriptive metadata schema for our local digital collections, which comprise digitized and born-digital materials across a variety of format types, including maps, photographs, manuscripts, e-books, and films. The size of each collection may range from a handful of items to tens of thousands. We create item-level records in MODS XML through a combination of original cataloging and transformations from existing metadata sources, depending on the collection. Once created, these records are stored in a local metadata repository, but this is far from the end of the process. We subsequently transform metadata from MODS into other schemas to populate local and external systems, such as our catalog, our discovery platform, and our DOI registration agency. We knew that we would also benefit from having bibliographic and holdings information for these materials represented in WorldCat, so we sought an efficient method for sharing with OCLC the MARC metadata created during our workflows.

A significant goal of this process was to support iterative metadata enhancement that could be propagated across multiple systems. We create metadata for Digital Library Program materials in a dual role as both cataloger and publisher, and we want to account for the ongoing possibility of making incremental updates that may affect a large number of records. MODS records constitute the master version of metadata for each resource, so any changes affecting a batch of records should originate in that environment, as opposed to within a MARC database. Therefore, for interaction with WorldCat, we needed a solution that would support updating records at scale as easily as it supports creating them.

The WorldCat Metadata API, which was introduced in 2013, offered a great opportunity to address this use case. It provides access to production-level WorldCat bibliographic and holdings data via HTTP, and it accepts and returns bibliographic data serialized as MARCXML. In addition, two essential components for implementing the API were already available open source and written in Ruby: a gem to manage authentication, developed by OCLC, and a wrapper around the various API endpoints, developed by Terry Reese. With these pieces of the puzzle in place, we began developing a tool that would provide a simple command-line interface for supplying a batch of input MARCXML to the API.

Provided as a Ruby executable, it offers three main technical features. The first is a straightforward usage pattern. Only two arguments are required at the command line:

  • the particular API operation requested and
  • the path to an input file containing either
    • one or more MARCXML records (when creating or updating master records) or
    • plain-text list of OCLC record numbers (when reading a set of existing records).

$ dcl-wc-metadata-api create ~/Desktop/dcl-ruby/input/marc-batch-2017012513480113.xml
OCLC WorldCat Metadata API: Create operation
Created 525 records, 0 failed
Records written to wc-create-20170125142407.xml
Log written to wc-create-20170125142407-log.txt

$ dcl-wc-metadata-api read numbers.txt
OCLC WorldCat Metadata API: Read operation
Read 3 records, 1 failed
Records written to wc-read-20150723112649.xml
Log written to wc-read-20150723112649-log.txt

The second is a Manager class that handles the requested operations. Most of the Metadata API's operations do not natively support batch processing, but instead act on a single record or record number at a time. The Manager iterates through the request, aggregating the resultant MARCXML, success/failure reports, and any error messages. This facilitates the third feature, which is automatic logging to disk. The returned MARCXML from a batch request is saved to one file, while the status report and any error messages are saved to another. Option flags can be passed to the program to modify the names given to these files as well as to have the status of each individual operation logged to the console during the batch process.

When new or updated master records are returned via the API, they contain standard administrative MARC fields, including the 001, 005, and 040. Instead of providing a new way to lock records in WorldCat, OCLC evaluates the values of the 005 and 040 when receiving updates via the API to prevent conflicts between multiple editors. If these fields in the submitted record do not match the current data in the master record, the update is rejected. As a result, this metadata is essential to record in our MODS repository to support future updates. This process also requires us to review the status of master records for any changes in advance of contributing our own batch updates, although the records created for our unique materials will not frequently be changed by other institutions.

Upon receiving the returned MARCXML—now with both local and WorldCat record identifiers—we use a small XSL transformation to merge the administrative metadata back into our MODS records, and we use MarcEdit to generate MARC data for loading into our catalog. With this tool for the Metadata API, we have been able to efficiently incorporate a pipeline to WorldCat into our workflow for metadata in the Digital Library Program. This approach could easily be extended to other sets of metadata that are created and processed at scale. Our utility is itself being iteratively enhanced, and is also available open-source on GitHub. I'll be sharing further details about its development and use at next week's inaugural DEVCONNECT conference.

  • Shaun Akhtar

    Shaun Akhtar

    Metadata Librarian, Dartmouth College

     

    Send an email