Library Linked Data in the Cloud

OCLC's Experiments with New Models of Resource Description


OCLC is a nonprofit library cooperative providing research, programs, and services that help libraries share the world’s knowledge. OCLC manages WorldCat, the largest and most comprehensive catalog of library resources from around the world. In the time period covered by this book, WorldCat contained more than 300 million bibliographic records that represented more than 2 billion items held by participating libraries. OCLC is also the custodian of the Dewey Decimal Classification, which has been used by libraries for over a hundred years to organize their collections. In addition, OCLC hosts the Virtual International Authority File, or VIAF, the largest aggregation of authoritative information collected by libraries about people and organizations and the creative works they have published. Resources such as these make it easier for libraries to fulfill their public mission of connecting library patrons to the works that satisfy their quest for information.

OCLC Research was founded in 1978 and has made significant contributions to the development of the library and Web standards that form the conceptual underpinning of WorldCat and other library resources. In the mid-1990s, OCLC researchers began making the case that libraries need to be integrated into the Web, because that’s where the information seekers are. As Semantic Web technologies have matured, this argument has only become more urgent.

OCLC researchers have participated in the Library Linked Data Incubator Group sponsored by the World Wide Web Consortium. They have been vocal members of the Schema Bib Extend Community Group, which recommends extensions to, the indexing vocabulary recommended by Google, Yahoo, Bing, and Yandex. OCLC researchers have also worked with the Wikipedia community to facilitate the cross-directional linking between library resources and Wikipedia articles. The results guide human readers from Wikipedia to libraries and enable machine processes to consume richer linked data through Wikipedia’s association with the Wikidata project. In addition, OCLC researchers have served as advisers to the BIBFRAME standard sponsored by the Library of Congress, whose goal is to replace MARC, the legacy standard for bibliographic description, with a linked data model. And in the past five years, OCLC has become a significant publisher of linked data, producing models and publicly accessible datasets containing billions of RDF triples describing the objects and concepts referenced in VIAF; the Dewey Decimal Classification, or DDC; Faceted Application of Subject Headings, or FAST; and the WorldCat catalog. It is from this rich experience that this volume emerges.

This book is about OCLC’s experiments in the redesign of traditional library resource descriptions as linked data. In practical terms, the goal of the work reported in this book is to define the first draft of an entity-relationship model of creative works and the events in the library community that impact them. The model is realized by mining the data stores maintained at OCLC and republishing them as large RDF datasets. Though the work is necessarily anchored to a particular point in time, we hope that readers will gain insight into the collective thinking of the world’s largest library cooperative, whose solutions will spur development by others who might benefit from our trials and errors as well as our successes. In a program whose goal is to express library metadata as linked data, we are doing work that is consistent with the core values of our profession, which places a premium on collaboration and openness. In return, we are confident that the Web of Data will be enriched by the collective expertise of over a hundred years of librarianship.

The impetus for this book arose from a 2012 conversation between Lorcan Dempsey, OCLC Vice President of Research and Chief Strategist, and Ying Ding, Associate Professor of Information Science at Indiana University. The outcome was an invitation to propose a monograph for the series Synthesis Lectures on the Semantic Web: Theory and Technology published by Morgan and Claypool. Once the proposal was accepted, Jean Godby, Senior Research Scientist with OCLC Research, was tasked with organizing contributions from colleagues and contributing to the volume herself. Beginning in 2013 and stretching into the first half of 2014, material was contributed, refined, and in some cases re-written as work in progress at OCLC as researchers and their technical allies moved forward. In describing OCLC’s projects, the aim is to tell a story about a large collection of interconnected projects. Each chapter is designed as a lecture on a problem that must be addressed if the enterprise of transforming library data to a format that is more effective at fulfilling the needs of the information-seeking public is to succeed.

Many OCLC colleagues have contributed intellectual content in addition to the three authors of this book: Lorcan Dempsey, Jonathan Fausey, Ted Fons, Janifer Gatenby, Thom Hickey, Maximilian Klein, Michael Panzer, Tod Matola, Ed O’Neill, Stephan Schindehette, Jenny Toves, Diane Vizine-Goetz, Richard Wallis, and Jeff Young. The authors are especially indebted to Karen Smith-Yoshimura, whose own important contributions to research on library metadata and whose thoughtful comments on the entire manuscript produced so many improvements that we realize, in hindsight, that she should have been a co-author. We are also grateful to OCLC colleagues who helped us with editorial and production tasks, including Eric Childress, Chris Galvin, Brad Gauder, Jenny Johnson, Jeanette McNichol, and JD Shipengrover.

In addition, we have benefited from engagement with colleagues in the library community, who mentored us, commented on the manuscript, tested some of our ideas at their own institutions, and joined with us in lengthy and often passionate discussion that produced many photos of whiteboards, some of which have found their way into the illustrations in this book. In particular, we are grateful to Kenning Arlitsch, Montana State University; Ray Denenberg, Library of Congress; Ying Ding, Indiana University; Kevin Ford, formerly of the Library of Congress; Paul Groth, Vrije Universiteit Amsterdam; Antoine Isaac,Vrije Universiteit Amsterdam; Nannette Naught, IMT Associates; Patrick OBrien, Montana State University; Philip Schreur, Stanford University; and Marcia Zeng, Kent State University. And of course, we are deeply indebted to our former OCLC colleagues Eric Miller and Stu Weibel, without whose groundbreaking work this enterprise might never have taken shape.

Carol Jean Godby, Shenghui Wang, and Jeffrey K. Mixter
March 2015