The problem of automatically recognizing, extracting, and disambiguating named entities (e.g., the names of people, places, and organizations) from digitized text has received considerable attention in research produced by the library, computer science, and linguistics communities in the past five years. Name identification and extraction tools, particularly when integrated with an authority file, can enhance reliable subject access for a document collection, improving on its discoverability by end users.
The Extracting Metadata for Preservation (EMP) Project, funded by the National Digital Information Infrastructure and Preservation (NDIIPP) Program, is addressing this ongoing challenge of identifying proper names to improve the accuracy of end-user information access via web-based search and retrieval.
As a three-way collaboration among the University of Illinois at Urbana-Champaign, OCLC, and the University of Maryland, EMP researchers bring multidisciplinary perspectives from the library, computer science, and linguistics communities to the problem of high-quality identification and disambiguation of names.
Our work has three goals:
- To advance the state of the art in automated name identification and disambiguation from unstructured text.
- To link the outputs of these programs to longstanding efforts in the library community to manage names and identities in the published record.
- To lower the barrier of access to sophisticated text-processing tools.
This project will:
- Contribute open source utilities that will enable software engineers to extract names from a digitized text repository and make its content more discoverable.
- Help bridge the gap between collections of unstructured text and structured metadata, which will enable many kinds of library applications to be linked.
- This project focuses on extracting names from unstructured text. Issues relating to document or resource collections that represent names enclosed in explicit descriptive markup are out of scope.
- The primary interest is in personal names. The extraction software and resolution software also recognizes organizational and geographic names, but these categories are not given the same careful attention.
- The most important outcome is a process for creating richer markup in unstructured text, which can be submitted to third-party processes that index, collocate, normalize, and display the data to the end user.
System building and evaluation.
- Open source software for a named entity resolution tool and an identity resolution tool that can be incorporated into digital library applications, accompanied by reproducible workflows for doing so.
- One or more demonstration projects showing links between structured and unstructured text.
Since this project focuses on infrastructure improvements, it is intended to benefit software developers who manage institutional repositories.
- Complete the requirements specified in the EMP project plan, which ends in December 2009.
- Apply the Named Entity Extractor and Identity Resolution tools to the problem of extracting book and author names from book and article citations and linking them to collections of structured metadata.
Presentations by the project team
- Name This! Automating Metadata Extraction through a Named Entity Recognition Tool. Digital Library Federation Spring Forum; Raleigh, North Carolina; May 6, 2009.
- Who’s Who in Your Digital Collection? Developing a Tool for Name Disambiguation and Identity Resolution. In preparation.
- Godby, Carol Jean, Patricia Hswe, Larry Jackson, Judith Klavans, Lev Ratinov, and Dan Roth. 2010. "Who’s Who in Your Digital Collection: Developing a Tool for Name Disambiguation and Identity Resolution." Proceedings of the Chicago Colloquium on Digital Humanities and Computer Science, 1,2. Available online at: https://letterpress.uchicago.edu/index.php/jdhcs/article/view/58.
Funding from the National Digital Information Infrastructure and Preservation Program is gratefully acknowledged. http://www.digitalpreservation.gov/