Learning Linked Data: Making Your Data Harvestable via RDFa

Most of my posts so far have been focused on the cool things the WorldCat Discovery API inspired me to learn and explore while focusing on consuming Linked Data:

More recently though, I’ve been turning my attention to another important piece of Linked Data support in the Discovery API: expressing your collection as Linked Data using RDFa or JSON-LD serialization.

The Value of Harvestable Data

Before we delve into the gory technical details of how to do this, let's talk about bit about why libraries might want to expose their collection information as Linked Data. One significant reason to expose a library's collections via Linked Data is to make the things in a library's collection part of the larger web of things. By making library data part of the larger web of things, the data is exposed to the wider web community, making it easily discoverable by other sites, services, and ultimately consumers. The ultimate goal being to make the things in a library's collection part of the sites and services that library users frequent daily – Google, Wikipedia, etc. This same reasoning is a driving factor behind why OCLC is exposing bibliographic entities such as Works.

Data for Humans and Machines Side by Side

So how can libraries do this and what does it have to do with the WorldCat Discovery API? Most of my work with Linked Data at OCLC has been focused on purely machine readable serializations. Reading data that only a machine can read. However, while building my demonstration application for WorldCat Discovery API I thought it would be useful to expose the data in both a human and machine readable format via the display screens. The great thing is that, because I'm retrieving Linked Data from the WorldCat Discovery API, all the data is “marked up”.

My goal was to to provide a proof of concept of how to make library data harvestable within such a UI using data that was retrieved from WorldCat Discovery API. To make library data harvestable it needs to be available in a format that the harvesters can read. This doesn't just mean using Schema.org as the ontology for marking up the elements. It also means providing a serialization that harvesters are able to read.

Choosing a Serialization to Share

Google's documentation indicates that they will read data embedded in web pages via microdata, microformats, RDFa or JSON-LD. Neither microdata or microformats would be readable by a Linked Data parser. That wasn't acceptable to me. This meant my choices were RDFa or JSON-LD.

If I only cared about Google harvesting then I'd choose to use JSON-LD for a number of reasons. First, because I'm starting with an existing graph and have an RDF library at my disposal, serializing JSON-LD and embedding it is a snap. Second, from my perspective JSON-LD keeps the HTML a little leaner and cleaner.

However, Google's support for JSON-LD is more recent and it is unclear if other harvesters would support this as well or if the standard Linked Data parsers which retrieve HTML would be capable of extracting the embedded JSON-LD. Given these factors, I ultimately decided that I had to include RDFa to cover more use cases.

Encoding RDFa from a graph

If you've never had to encode RDFa before, the Schema.org site provides information both about the ontology as well as examples marked up in microdata, RDFa and JSON-LD. If you have a good RDF library at your disposal it will also output RDFa for any graph which you can use as model for the RDFa you are trying to produce.

Unfortunately, an RDF library isn't going to magically solve your problems because you'll get plain HTML with RDFa and it won't match your UI's look and feel at all. So to achieve the results you want to get you're going to have to mark up the RDFa by hand. On the surface this might seem like a fairly time consuming but trivial task. The truth is appearances can be deceiving.

One of the things I find most difficult about working with RDFa is making sure the semantics one is trying to achieve aren't lost in the RDFa serialization. What do I mean by this? In simple terms if the graph is supposed to say that

URI <http://viaf.org/viaf/75785466>

is a Person

who has name “Karen Coombs”

and

URI <http://www.worldcat.org/oclc/179793524>

is a book

that has author <http://viaf.org/viaf/75785466>

The changing the serializations shouldn't change or remove those statements.

In the case of my sample application. because it is based on an existing graph, the semantics are already predefined. So I just need to make sure the RDFa I produce replicates that. If I don't. then I've lost semantics in the RDFa. My favorite way of checking this is to run my RDFa through a tool that will convert it to a serialization where it is easier to read the semantics. I prefer Turtle for this and often use the EasyRDF PHP library “Converter” page. The library has a built in function that lets you export a graph in any serialization. This page exposes that functionality for easy public use.

Additionally, you can test your RDFa output using Google Structure Data testing tool.You can provide a url to a public web page or cut and paste HTML into the tester. This will give you some idea of how Google and other search engines will “see” and extract your data.

Creating RDFa without an existing graph

If you are encoding data that doesn't already exist in a graph form, the process of creating RDFa is similar but inherently more difficult. In these cases several steps need to happen. First, you need to determine what entities you are encoding. Second you need to decide what statements you want to make about the entities and their relationships. You can use an existing vocabulary like Schema.org for your entities and statements, but you still have to figure out what you want to say. The last step is to properly encode the semantics so that machines can read them. We'll be talking about a specific example of this in our next blog post in this series.

  • Karen Coombs

    Karen Coombs

    Senior Product Analyst