Data Mining Research Area

Overview: Making Data Work Harder

Libraries have made huge investments in creating and maintaining rich, structured information describing the resources in their collections.

This data embodies considerable value by supporting access and inventory control. It also represents potential value in terms of

  • knowing more about the characteristics of library collections
  • generating interesting and innovative data displays
  • providing intelligence to support a range of library decision-making needs, including
    • collection development
    • digitization
    • preservation.

There is untold value in bibliographic information, but it is largely untapped. If libraries are to realize the full value of their bibliographic data—or, put another way, if libraries are to maximize the return on the investments they make to create this data—steps must be taken to release this value in innovative and useful ways.

Internet giants such as Amazon and Google provide valuable lessons on the importance of squeezing the full value from available data. Whether in the form of book recommendations (if you like this book, you'll also like . . .), search result rankings, targeted advertising, or collection views (e.g., Google Scholar), the "Amazoogle" companies make a concerted effort to release as much value as possible from the data at hand.

Libraries possess rich reservoirs of data. However, this data needs to be made to work harder in order to create value for librarians and users. To this end, the OCLC Research Data-Mining Research Area will focus on projects aimed at creating value from the bibliographic information in WorldCat and other library data sources.

Current projects

OCLC Research has a number of projects currently underway in the Data-Mining Research Area, with plans for several future projects as well.


  • Books as Expressions of Global Cultural Diversity: WorldCat data reveal transnational patterns in literary publishing, the preservation of individual countries’ literary heritage, and the cultural diversity present in the books.
  • The Systemwide Print Book Collection: analyzes the size and characteristics of aggregate print book holdings, with an emphasis on implications for digitization and preservation decision-making. (PowerPoint:300K/35slides)
    (A version of this presentation was given to the May 2005 meeting of the OCLC Members Council Digital Libraries Research Interest Group.)
  • Anatomy of Aggregate Collections: The Example of Google Print for Libraries: This D-Lib Magazine article offers some perspectives on the Google Print Library Project in light of what is known about library print book collections in general, and those of the Google 5 in particular, from information in OCLC's WorldCat bibliographic database and holdings file.
  • Audience Levels: infer materials' target audience, or audience level, using holdings information.
  • "Last Copy:" identify rare or unique materials in individual library collections. This activity was reported in:
    Connaway, Lynn Silipigni, Edward T. O'Neill, and Chandra Prabha. 2006. "Last Copies: What's at Risk?" College and Research Libraries, 67,4 (July): 370-379. Pre-print available online at: (PDF:151K/24pp.).
  • WorldMap: visualize geographic distribution of selected library data. Currently available data include holdings and titles, each by place of publication (from OCLC WorldCat) and number of libraries, librarians, users, volumes, and annual expenditures (from other sources).
  • Mining for Digital Resources: identification and characterization of digital resources cataloged in WorldCat. (PowerPoint:112K/15slides)
  • Comparative Collection Assessment: looks at collection development, assessment, and resource sharing for print- and e-book collections.
  • Publisher name server: prototype service that resolves ISBN prefixes to publisher name; resolves variant publisher names to a preferred form; and captures and makes available various publisher attributes (e.g., location, language, genre/format, dominant subject domain, etc. of the publisher's output)