Defending Differences from Duplicate Detection

Jay Weitz

2016-05-2 defending differences

WorldCat is certainly the largest database of bibliographic data on Earth. Probably the universe, but let’s stick with what we know for sure. Among those 360 million bibliographic records, we figure there must be at least a few duplicates. In a database that size, built by tens of thousands of catalogers in thousands of institutions over the course of four decades, duplicate records are an unfortunate fact.

For most of those four decades, OCLC has also been hard at work trying to reduce the number of duplicate records through both manual and automated means. We began merging duplicates manually in 1983 and recently, specially-trained members of the Cooperative have been merging records as part of our ongoing Merge Pilot. From 1991 through 2005, OCLC’s automated Duplicate Detection and Resolution (DDR) software ran through WorldCat sixteen times, merging over 1.5 million duplicates. That original DDR dealt with books records only. We at OCLC were both painfully aware and constantly reminded of this limitation by you, our users.

Between 2005 and 2010, we methodically developed, tested, and refined a vastly improved version of DDR that could deal with all bibliographic records, not just books. Since we implemented expanded DDR in 2010, upwards of 19 million duplicate records have been eliminated. More than two dozen points of comparison drawn from nearly every part of a bibliographic record get parsed, analyzed, and weighed against each other in the process of trying to identify and merge duplicates. Thanks in great part to the reports you send to us about missed duplicates and the occasional incorrect merge, we constantly work to improve DDR’s complex algorithms.

Both sides of DDR

DDR’s intended purpose is to accurately identify and merge records that represent the same bibliographic resource. But the flip side of that is the equally vital identification and protection of records that represent legitimately distinct resources — records that should not be merged.

In the development of DDR, we have consistently erred on the side of not merging records when we could not be quite sure that we were merging them correctly. To that end, we have purposely set aside certain categories of records—secured them behind a bibliographic barricade – so that they are not even processed by DDR. Because of the nature of the resources themselves or as a result of cataloging practices and traditions, making the proper distinctions algorithmically has proven too difficult for these protected categories:  most rare and archival resources, including anything with a publication or creation date earlier than 1801; all cartographic materials with dates earlier than 1901; all SCIPIO records for art and rare book sales catalogs; Digital Collection Gateway records, and all photographs.

Cataloging defensively

When we introduced the improved version of DDR in 2010, we wanted to let users know a bit about how it worked. But even more, we wanted to arm users with the sword of MARC and the shield of AACR2 and RDA, reminding catalogers about the powers built into both the bibliographic format and the descriptive cataloging conventions to make sure that DDR would recognize differences that justify legitimately distinct records.

A Webinar entitled Cataloging Defensively:  When to Input a New Record in the Age of DDR was presented twice in October and November 2010, advising catalogers of valid ways to rebuff the attacks of DDR when sometimes subtle differences in resources allow for separate records.

In the years since 2010, Resource Description and Access (RDA) has become a dominant descriptive cataloging convention and the MARC format has changed in substantial ways in order to accommodate it. Questions from, and conversations with, catalogers within the OCLC cooperative strongly suggested the need for more information about DDR and more specific suggestions regarding particular types of materials.

For the January 2016 ALA Midwinter meeting of the Cataloging and Classification Committee of the Map and Geospatial Information Round Table (MAGIRT), I presented Cataloging Maps Defensively. For the annual meeting of the Music OCLC Users Group held in March 2016 in conjunction with the Music Library Association, I presented Cataloging Sound Recordings Defensively.

Preserve, protect, defend

The responses to these format-specific presentations have been gratifying. The MAGIRT/ALCTS/CaMMS Cartographic Resources Cataloging Interest Group has invited me to present Cataloging Maps Defensively a second time at the ALA Annual Conference in June 2016. In coming months, I intend to create and make available on the OCLC “Cataloging Defensively” page additional versions devoted to video recordings, musical scores, rare materials, books, and possibly others.

We shouldn’t think of WorldCat as a battlefield, but catalogers can use strategic knowledge of MARC, of the descriptive conventions, and of DDR itself to preserve, protect, and defend records that represent legitimately distinct resources.

What are your thoughts on the balance between eliminating duplicate records and “cataloging defensively?” Let us know with #OCLCnext on Twitter.