Leveraging machine learning technology as part of ongoing WorldCat quality measures

OCLC Metadata Quality teams implement a variety of measures—both manual and automated—to improve the quality and usefulness of WorldCat data. These extensive and ongoing efforts ensure that WorldCat data supports the needs of our membership and our global network of thousands of libraries across a wide range of services. As the technologies and tools that allow us to do this important work evolve, we are continually exploring new methods for enriching, repairing, and de-duplicating WorldCat records—data that powers the global discovery and sharing of library resources.

Cleaning up duplicate records is one of the most impactful ways to improve the quality of WorldCat. Manual efforts by metadata professionals—and technology like our duplicate detection software—have had significant success in reducing the number of duplicates. And now we’re leveraging machine learning to accelerate that progress.

In December 2022, we invited the cataloging community to participate in a data labeling exercise to validate our machine learning model’s understanding of duplicate records in WorldCat. During the subsequent four and a half months, 336 total users labeled more than 34,000 “possible duplicates” using a simple, intuitive, online interface. Thank you to every individual who participated in the project—your collaboration helps advance the profession and the mission of libraries worldwide.

Leveraging the data we collected, we can now better scale the automated resolution of duplicate records in WorldCat, saving countless hours of time and improving the experience for the entire library community. 

We will soon implement this machine learning model as part of our ongoing efforts to mitigate and resolve duplicate records in WorldCat. On 19 August 2023, an initial run of one million records—500,000 pairs—will be processed through the algorithm. This will result in 500,000 duplicate record merges in WorldCat, which will improve cataloging, discovery, and interlibrary loan experiences for both library staff and end users.

For additional information about the project and using the machine learning model to merge duplicate records in WorldCat, please read our Hanging Together blog post.