Using OCLC APIs in OpenRefine

One question I get asked a lot about OCLC APIs is how people who don’t really write code can take advantage of them. Typically, I point them to existing tools that take advantage of OCLC APIs. However, another interesting use case for using OCLC APIs came up last month. I was asked how these APIs might be used in OpenRefine.

What is OpenRefine

OpenRefine is an open-source desktop tool for manipulating data. It enables users to transform data from one format into another and to extend the data with web services and external sources.

OpenRefine can import several different types of datasets, including Excel, TSV, CSV, XML, RDF, JSON, Google Sheets, and Fusion Tables. Once data is imported, users can filter rows of data or create new columns of data based on data in an existing column.

Catalogers and metadata librarians have been using OpenRefine to “wrangle” library metadata for quite a while. Owen Stephens has a nice series of posts on this. Library Carpentry has a module on using OpenRefine. Additionally, MarcEdit provides integration with OpenRefine that catalogers can leverage.

Why use OCLC APIs with OpenRefine?

Using OCLC APIs with OpenRefine allows one to enhance an existing dataset with data that exists in OCLC systems. Doing this with OCLC APIs that are unauthenticated, like VIAF, is a fairly well documented process. However, it is also possible to use OpenRefine with OCLC APIs that require authentication. Additionally, data that is exported from OCLC systems into an OpenRefine-compatible format can also be manipulated using OpenRefine. This blog post and others in this series will explore how to use OpenRefine and OCLC APIs in combination.

OpenRefine + WorldCat Search API = finding alternative ISBNs

Problem

As a metadata librarian, I want to find all the ISBNs related to another ISBN so that I can create a list of alternative identifiers to search by.

Solving this problem requires several steps.

  1. Load data from spreadsheet format into OpenRefine.
  2. Fetch related data from WorldCat.
  3. Parse the data returned from WorldCat for each query into a comma separated list of related ISBNs.

Step 1: Loading my spreadsheet data

Once you’ve installed OpenRefine and have it running, start a new project by getting data. TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents are all supported. In my case, I used an Excel file on my local computer that I navigated to and selected. OpenRefine will offer a preview of the data.

file

If you like what you see, you can use the “Create Project” button to pull the data into OpenRefine to work with. Once the data is in OpenRefine, it should look a bit like a spreadsheet.

file

Step 2: Fetching data from WorldCat

For each ISBN in my spreadsheet, I want to:

  1. make a call to WorldCat Search API;
  2. find all the records with this ISBN present, irrespective of FRBR grouping (i.e., don’t de-duplicate); and
  3. get the results back in Dublin Core format (this format will be easier than MARC to deal with in OpenRefine).

To do this, I need to add a column that fetches data based on the ISBN column in my dataset. On the column in my “sheet,” that is the ISBN. Use the dropdown to “Edit Column -> Add Column by Fetching URLs.”

file

Name the column ISBNFetch and input the following:

"http://www.worldcat.org/webservices/catalog/search/sru?wskey={myAPIKey}&query=srw.bn=" 
+ value + "&recordSchema=info%3Asrw%2Fschema%2F1%2Fdc&frbrGrouping=off"

Then click “OK.” You’ll need to wait while OpenRefine makes the API calls and adds a new column with the API responses for each request.

Important! Make sure you put in a valid WSKey to get data back from the API. (Get more information on how to request a WSKey)

file

Step 3: Parsing the data returned to get ISBNs

Now that I’ve got the data back, I need to parse it and put it in a useful format in a column. This requires me to

  1. look in each record in the result set for all the ISBN values, and
  2. create a list of unique ISBNs that are related to the one I passed in my query.

In OpenRefine, this means adding a column based on my “ISBNFetch” column data. Use the dropdown on the “ISBNFetch” column to “Edit Column -> Add column based on this column” and use the following expression:

join(uniques(forEach(value.split("<oclcdcs>"), v, v.partition("<dc:identifier>")[2].partition("</dc:identifier>")[0])), ",")




This code might look kind of scary, but it actually isn’t that complicated if you break it apart. So, let’s break this code down so it isn’t so scary. The inner most portion takes each element and extracts out the dc:identifier elements within.

forEach(value.split("<oclcdcs>"), v, v.partition("<dc:identifier>")[2].partition("</dc:identifier>")[0])

The next layer takes the result and removes all duplicates.

uniques(…)

The last bit joins together all the values into a comma separated list.

join(…, “,”)

https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language

Step 4: Parsing the data returned to get OCLC numbers

When I fetched the data from WorldCat, I retrieved a whole bunch of a basic metadata for each record in the result set, so I can extract more information than the ISBNs.

I can also extract the OCLC numbers using similar steps. Once again, I’m going to the “ISBNFetch” column and “Edit Column -> Add column based on this column.” I’m using a slightly different expression this time.

join(uniques(forEach(value.split("<oclcdcs>"), v, v.partition("<oclcterms:recordIdentifier>")[2].partition("</oclcterms:recordIdentifier>")[0])), ",")

Now I have a list of all the possible OCLC numbers for records that might match the ISBN. I can even return a count of these using the following expression:

forEach(value.split("<oclcdcs>"), v, v.partition("<oclcterms:recordIdentifier>")[2].partition("</oclcterms:recordIdentifier>")[0])

The key difference with this expression is that it takes each element and extracts out the oclcterms:recordIdentifier elements within.

I’m not interested in the actual number, but only how many matches exist, I can return a count with this expression:

uniques(forEach(value.split("<oclcdcs>"), v, v.partition("<oclcterms:recordIdentifier>")[2].partition("</oclcterms:recordIdentifier>")[0])).length()
file

Next steps

This post has demonstrated how to use the WorldCat Search API in OpenRefine. These same techniques can also be applied to other OCLC APIs that

  • use the HTTP GET method, and
  • don’t use authentication or allow the WSKey to be passed as a query parameter in the request URL.

It is possible to authenticate to OCLC APIs that use Access Tokens via OpenRefine. I’ll talk about how to do that in the next post in this series.

  • Karen Coombs

    Karen Coombs

    Senior Product Analyst