Data from the global bibliographic database WorldCat reveal transnational patterns in literary publishing, the preservation of individual countries’ literary heritage, and the cultural diversity present in the books.


Globally and nationally, books represent a central kind of cultural heritage. The UNESCO Institute for Statistics has been exploring library statistics for worldwide book consumption, and helped to found the European Expert Meeting on Book and Library Statistics. These bodies, as well as the International Federation of Library Associations, are especially interested in any global patterns in the book world as expressions of cultural diversity and heritage.

Such data, however, are not widely collected by any national publishing organizations or library statistics agencies. UNESCO maintains a database ( of translated works worldwide, but is unable on its own to access worldwide monographic statistics.

The increasingly global reach of the WorldCat database, on the other hand, makes it an obvious source to mine such data. OCLC’s bibliographic database represents more than 142 million items, with 1.43 billion copies held by libraries worldwide (numbers which are ever increasing); in addition, the database becomes increasingly more global in scope with the ingest of dozens of national libraries’ bibliographic data. OCLC researchers have already produced a prototype application which graphically displays worldwide patterns in bibliographic holdings (


The basic objectives of the project, were to: to mine WorldCat's "overwhelmingly" monographic records, to parse these data by date, country of publication, and language. On the importance of language, there is an axiomatic concept in Cognitive Anthropology that (in the words of Benjamin Lee Whorf), “Language shapes the way we think, and determines what we think about.” In other words, the language(s) spoken by a culture help to determine that culture’s perception of the world, and its expression of itself within that world.

Within these broad scopes, we set the following limits. We gathered details of non-serial textual materials (also excluding dissertations and government documents). The date of publication must be a valid number less than 2010. We excluded works with publication dates of, for instance, “19xx,” since we cannot fold them reliably into the rest of the data; we included books whose "date of publication" was as early as 1000 A.D., believing that WorldCat may be well-represented by archival and special collections data. A pre-test that profiled six countries (deliberately highlighting non-English works and non-English cataloging) was followed by refinements to the data extraction techniques, and then by proceeding to the rest of the world’s data.

The project produced a rich data portrait of the global literary arts (as reflected in the WorldCat database), with emphasis on cultural literary heritage by country and region. Researchers are able to track the overall annual publishing for every country of the world, the libraries that collect and even import a country’s works, the “foreign” monographs their libraries import, and the proportion of publications in various official and native languages; in addition, the interaction between a culture’s languages and the rest of the world is captured in data on translated works across time. The results provide a global overview of the publishing arts, and a wealth of case studies in single countries’ practices in both literary publishing and the preservation of their literary heritage.

This effort was one of several data mining projects whereby OCLC Research sought to extract intelligence from the data we have, and use it in different ways that provide value to libraries.


  • January 2009-November 2009


