In my last post, I talked about exposing data as RDFa. But what does RDFa embedded in websites mean for developers?
Finding Harvestable Data
With Google's new interest in structured data on the web, more sites are including RDFa. For developers, this means there might be sites that have RDFa but no other Linked Data serialization. How do you find these? One way is to use Green Turtle RDFa Chrome Extension. This adds a green turtle icon to the address bar. Clicking the icon will display an N-Triples version of the graph:
Try it out on http://worldcat.org/oclc/056948048
Ok, so there is this RDFa for WorldCat.org. However, there are other Linked Data serializations for this data available as well so why do I care about RDFa?
RDFa in WorldCat Identities
Have you ever found yourself wishing that WorldCat Identities output Linked Data? Technically it does. “What?" you say. "That isn’t reflected in the API documentation.” That is because the Linked Data is part of the text/html serialization - WorldCat Identities returns HTML with RDFa embedded. The data is encoded using Schema.org and includes Person, Organization and Creative Work Entities. Below is a Turtle representation for my WorldCat Identities page:
@prefix rdfa: <http://www.w3.org/ns/rdfa#> . @prefix schema: <http://schema.org/> . <http://worldcat.org/identities/lccn-n2008026273> rdfa:usesVocabulary schema: . <http://www.worldcat.org/identities/lccn-n2008026273> a schema:Person ; schema:name "Coombs, Karen A. " ; schema:knows <http://www.worldcat.org/identities/lccn-n2010059828>, <http://www.worldcat.org/identities/lccn-n2008026275> ; schema:sameAs <http://id.loc.gov/authorities/names/n2008026273>, <http://viaf.org/viaf/75785466> . http://www.worldcat.org/oclc/761325163 a schema:CreativeWork ; schema:contributor <http://www.worldcat.org/identities/lccn-n2008026273> ; schema:url <http://www.worldcat.org/oclc/761325163> ; schema:name "Open source Web applications for libraries" ; schema:author <http://www.worldcat.org/identities/lccn-n2008026273> ; schema:datePublished "2010" ; schema:inLanguage "English", "Undetermined" ; schema:description "Interest in open source software has never been stronger, yet a general lack of information about specific tools and benefits-along with nagging concerns about dependability and support-has hampered adoption in libraries. In Open Source Web Applications for Libraries, authors Coombs and Hollister address these issues and provide librarians with guidance on a range of applications that can be used to improve reference services, instruction, and outreach to library users" . http://www.worldcat.org/oclc/179793524 a schema:CreativeWork ; schema:contributor <http://www.worldcat.org/identities/lccn-n2008026273> ; schema:url <http://www.worldcat.org/oclc/179793524> ; schema:name "Library blogging" ; schema:author <http://www.worldcat.org/identities/lccn-n2008026273> ; schema:datePublished "2008" ; schema:inLanguage "English" ; schema:description "Thinking of setting up a blog for your school, academic, or public library? This book is for you! Learn all about the blogosphere and its place in your library. Learn the nitty gritty of setting up and hosting your library blog. Find out just what you need in hardware and software to make your blog work like a charm. See examples of groundbreaking uses for your library blog! Library Blogging is an overview of the world of blogs in libraries, including both use and technological discussions. These technology gurus bring you the \"why's\" of using a blog in a library context, the strengths of using blogs, and the actual how-to information. The book will gives an overview of the different options available for a library blog, the appropriateness of each option, and the possibilities of each program or service. This is all the information you need on the topic of library blogging" . http://www.worldcat.org/identities/lccn-n2008026273/ a schema:ProfilePage ; schema:about <http://www.worldcat.org/identities/lccn-n2008026273> .
I was able to create this Turtle serialization using the EasyRDF PHP library and some PHP code to read the RDFa and then output it as Turtle.
Knowing that RDFa is out on the web and having a library to consume it can open up lots of doors. I've been playing with the WorldCat Identities data to:
- Get URIs for Creative Works by a given identity
- Get URIs for related WorldCat Identities
Expanding the boundaries of the Graph
I can now create a graph that encompasses a lot of information about a particular identity and gives me the opportunity to display that data in a way that is akin to Google’s Knowledge Cards. But to to truly understand the doors that Linked Data opens, you need to think about where information about a Person might live and how it might all be pulled together - especially when we consider data beyond the library world's knowledge of a person.
First, Jason Clark is an author who has published a book that is in WorldCat. So he exists in WorldCat Identities. Second, because he is an author, he has a controlled name in authority files, which have been merged together. So an authority cluster exists for him in VIAF. Last, Jason's library web page lists him as staff and has information about him embedded in RDFa. Jason's web page includes a statement:
<http://www.lib.montana.edu/people/about/23> schema:sameAs <http://scholar.google.com/citations?user=2vZSuAwAAAAJ&hl=en>, <http://orcid.org/0000-0002-3588-6257>, <http://www.worldcat.org/identities/lccn-n2012016641>, <http://viaf.org/viaf/232895228> ;
This connects all of Jason's information together and allows a single graph to be built which crosses the boundaries of three different data sets:
- WorldCat Identities
- Montana State University's Library website
The graph could be extended even further. Because VIAF has sameAs statements which reference dbpedia, if Jason had a Wikipedia entry we could connect to that, too. The graph could also include more detailed information about Jason's creative works from WorldCat. The point here is that when you're consuming Linked Data you can choose to expand the graph at will.
Mashing-up up data to create an publishing new Linked Data graphs
Once you think about the possibilities of expand the graph in this way, you can create your own notion of what you want to include about a particular Person. Just remember that when you're traversing multiple data sets, you're making multiple HTTP calls which can slow processing down.
This is why locally caching particular data and graphs can be helpful. Local caching also provides you with the opportunity to enhance the data and make decisions about which statements from which sources you want to include in your graph. Maybe you only trust or find useful some of the data within dbpedia. Processing and caching the data locally allows you to make those decisions and store them locally. Caching locally also affords you the opportunity to coin your own URI for the mashed up graph or graphs you are creating.
Attributing Your Data
If you are creating these Linked Data mashed up graphs and republishing them it is really important to attribute your data. In fact, that is the license under which most of the data sets OCLC offers are released. You can see this in the Linked Data itself. First you'll see a statement like:
<http://www.worldcat.org/title/-/oclc/41266045> void:inDataset <http://purl.oclc.org/dataset/WorldCat> ;
This is basically saying that the URI <http://www.worldcat.org/title/-/oclc/41266045> is part of the dataset with the URI <http://purl.oclc.org/dataset/WorldCat>.
If I examine the Linked Data at the URI for the dataset, I see statements about the license and usage. Technically, if you are just connecting your dataset to WorldCat URI via a “sameAs” style relationship the URIs themselves are attribution enough. For more detailed information on the various use case see OCLC's Data License and Attribution page.
Attributing your data is important, but by the same token giving users consuming your data the same information about your dataset is also crucial so that they can use and attribute it appropriately in turn. So include a description of your dataset in the graphs you publish which tells people the license under which the data can be used.
We've discussed how RDFa gives libraries as data publishers the opportunity to share their data in a structured way, using HTML without additional weighty infrastructure and how useful that data can be to others in creating their own graphs. But what if your library has data that is published via APIs or exists in databases? Is RDFa really the best option? In the next post in this series we'll look at another lightweight option for providing additional Linked Data serializations - JSON-LD.
Are you working on a Linked Data project? The OCLC Developer Network community would love to hear about it. Share your experiences - and questions - on wc-devnet-l (subscribe here).
Senior Product Analyst