Multilingual Bibliographic Structure

The Multilingual Bibliographic Structure activity is designed to leverage the multilingual content of WorldCat, the world's largest network of library content and services, so that bibliographic information can be presented in the preferred language and script of the user. This project is experimenting with mining the data from bibliographic records representing works and their translations, parsing the data elements into linked data statements (with the language of each textual data element), and filling some gaps with data from Wikidata.

OCLC is also generating work-translation ("expression level") records—including the translated title and translator with links to the original work and the author—and adding them to VIAF (Virtual International Authority File), flagged as "xR".

Presentations and blog posts

  • Using the Semantic Web to Improve Knowledge of Translations
    Presentation (.pptx) | 26 October 2017
    by Karen Smith-Yoshimura
    How integrating information about translations from both WorldCat and Wikidata may improve our ability to present frequently-translated works in the preferred language and script of the user.
    Presented at the 2018 International Conference on Dublin Core and Metadata Applications, Washington, D.C.

Impact

Identifying the records representing translations will enable presenting a work in the user's preferred language, where available. This work will also enable us to gain a better understanding of the extent information is shared across cultures, e.g., the percentage of non-English works representing translations of English works, and vice-versa.

Work-translation xR records continue to be added to VIAF.

As of June 2018, more than two million xR records have been added to VIAF, representing:

  • 837,000 works
  • 1,174,,000 translations

575,000 of the translations have translators associated with them.

The number of "expressions" for certain popular works has increased substantially. For example, Jane Austen's Pride and Prejudice now has 50 translations represented compared to 13 before the xR records were added; Sense and Sensibility increased from 13 to 35.

The works associated with popular Japanese mystery writer Higashino Keigo increased from 13 to 47 and translations of his works in Chinese and Korean are represented in VIAF for the first time.

In 2018, we added data from WorldCat and the VIAF “expression” records that matched entries in Wikidata into Wikibase, a collection of applications to store, manage, and discover structured data, as part of the OCLC Research Linked Data Wikibase Prototype The WorldCat data is enriched by Wikidata information that the bibliographic records lack, such as titles in the original script (instead of romanized-only data in WorldCat) and first date of publication. This is possible for only a small “short head” of titles, however. We are also investigating automating the ingest of works and their translations from WorldCat with a “classifier” algorithm that can parse all data elements within MARC records to determine whether it is a translation. The Wikibase applications also provide some visualization tools that provides an overview of any work and its translations as well as helps improve data quality by highlighting outliers.

Background

More than half of the 400 million bibliographic records in WorldCat represent resources in languages other than English. These records are clustered together in worksets. Each workset may include multiple bibliographic records for the same title with data elements represented in different languages of cataloging, that is, the language of the metadata used to describe the resource. This information is supplied by catalogers and not transcribed from the resource, such as notes and subject headings. Since OCLC member institutions span the world, WorldCat records include many different languages of cataloging. For any one resource, there may be multiple records with summaries, subject headings and notes in various languages and scripts.

Although WorldCat.org offers interfaces in different languages and scripts, the displayed bibliographic content is based on only one record from the workset. Bibliographic records vary in their detail and coding accuracy. Given the years of cooperation among libraries, it is common to see "hybrid" records where information has been added in different languages, e.g., subject headings in multiple languages. The combined information within a workset can be very rich, but is not fully exploited as currently WorldCat.org presents content only from a single record. Thus, even if a user has selected a preferred interface language, the WorldCat records displayed within the interface are unlikely be in that user's preferred language and script.

This activity focuses on translations, how the most valued corpus of the world's cultural and knowledge heritage is shared. The Multilingual Bibliographic Structure project has data mined WorldCat's bibliographic records for translated works, with the goal to improve work clustering, presentation, linked data representations and contribute generally to global knowledge. Since a work may have several translations in the same language, we are also parsing the WorldCat records to identify translators. OCLC is generating work-translation ("expression level") records—including the translated title and translator with links to the original work and the author—and adding them to VIAF™ (Virtual International Authority File), flagged as "xR". At the same time, OCLC Research is working with the Linked Data Wikibase Prototype pilot partners on developing best practices for representing the relationship of each work with their associated translations and translators as linked data statements and properties that can be shared in the Semantic Web.

Details

A small group of authors is responsible for the translated works with the most editions and the most holdings in WorldCat. Only one million persons are associated with titles in more than one language; only 7,000 names are associated with titles in 10 or more languages. Focusing on this relatively small group of titles is expected to have the most impact on users looking for well-known works in a given language.

The effort to associate translations with the original works includes overcoming variations in cataloging practices such as: titles with (or missing) subtitles; different forms of uniform title or missing uniform titles; different transliteration schemes, or errors in the transliteration; inverted titles; incorrect coding of the language of the work.

Many records for translations do not have an added-entry for the translator; records that do have added entries often lack a role designator (neither a $4 nor a $e) to indicate that the entry was for a translator. OCLC parsed records' statements of responsibility (the 245 $c) for strings in different languages that mean "translator."