Scorpion database design

General concepts

The Scorpion database consists of an unordered set of concepts that are useful for classifying documents. Each concept is defined in a database record. Following the familiar vector-space model of information retrieval (Salton and McGill 1988), documents are submitted as queries against a database. The query returns a list of database records that contain terms found in the document, ranked in order of their similarity to the document. The result can be interpreted as a prioritized list of concepts that roughly characterize the content of the document.

In our studies, we have created Scorpion databases from two library classification schemes: the Dewey Decimal Classification ( DDC) and the Library of Congress Classification ( LCC). We include a test database derived from a portion of the LCC in this installation and refer to it here to illustrate a simple Scorpion database design.

However, Scorpion does not require the DDC or LCC database. Users of our Open Source software might have access to other classification schemes or subject thesauri that have been created from scratch or have been modified from existing resources. The major requirement for an effective Scorpion database is a set of concepts that cover the subjects of interest and are distinct from one another. Descriptions of our efforts to adapt the DDC and LCC to satisfy this requirement are found in Thompson, et al. (1996) and Godby and Stuler (2001).

Record structure

As shown in the file lccSample.sgml, the Scorpion database has four fields, properly nested in SGML tags:

The name of the concept.
A number that locates the concept in the classification scheme.
Terms that define the concept.
Terms that have been statistically mapped to the concept.
The upward hierarchy in the classification scheme that contains the concept.

The Scorpion record structure is easily illustrated with reference to a DDC concept. In Edition 21 of the DDC (Chan, et al. 1996), Robots is the caption for the concept 629.892 , which is a leaf node in the hierarchy

                    
                    
         600 Technology (Applied sciences)
          620 Engineering and allied operations
               629 (Other branches of engineering)
                    629.89 Computer control
                         629.892 Robots
                  
                  
  

The DDC concept robots shown here also has two sources of related terms. The first set contains the index terms assigned by the Dewey editors, such as robotics, evolutionary robotics and parallel robots. In the Scorpion database, we have assigned these terms to the field. The second set contains a set of terms that have been statistically mapped using an external knowledge source and are enclosed in tags. These sources of terms are valuable for defining the concept for the purposes of automatic classification, but we keep them separate because they vary in reliability. See Vizine-Goetz (2001) for more discussion of this issue.

In sum, the Scorpion database represents a single concept in the , , , and fields, as shown:


Robots
629.892
robotics, evolutionary robotics, parallel robots
industrial robots; robotics
Technology (Applied sciences); Engineering and allied operations; (Other branches of engineering); Computer Control

The fit between the Scorpion record structure and the native LCC record structure is slightly more abstract. Though the LCC has fields that contain data that correspond to the , and tags, the tag doesn't have a direct correspondence and must be constructed. In our definition, the data enclosed in the tags consists of the concept name plus all of the terms found in the field. In this form, the field provides the complete context for interpreting the concept names, which may have ambiguous meanings because they are relevant to several subjects. For example, Data processing has a "physics" as well as a "biology" sense because it is the lowest node in the hierarchies.

                    
                    
         Physics
          Nuclear and particle physics
               Atomic energy. Radioactivity
                    Data processing
                  
                  
  

and

                    
                    
         Biology (General)
          Methods of research
               Technique
                    Data processing
                  
                  
  

A new Scorpion database can be created by imitating the structure of lccSample.sgml with data of your choice. See Building the Database for information about how to create the database when your SGML file is finished.

References

Godby C.J., Stuler J., The Library of Congress Classification as a Subject Base for Automatic Classification. Presented at the IFLA Preconference "Subject Retrieval in a Networked Environment," Dublin, Ohio, August 2001. Accessible at: http://staff.oclc.org/~godby/auto_class/godby-ifla.html

Chan, L.M., Comaromi, J.P., Mitchell, J.S., Satija, M.P., Dewey Decimal Classification: A Practical Guide, 2nd edition, OCLC Forest Press, Albany NY, 1996.

Salton, G., Buckley, C., "Term-weighting approaches in automatic text retrieval,"
Information Processing and Management, 24(5), 513--523, 1988.

Thompson, R., Shafer, K., Vizine-Goetz, D. Evaluating Dewey Concepts as a Knowledge Base for Automatic Subject Assignment. Accessible at: http://orc.rsch.oclc.org:6109/eval_dc.html, 1996.

We are a worldwide library cooperative, owned, governed and sustained by members since 1967. Our public purpose is a statement of commitment to each other—that we will work together to improve access to the information held in libraries around the globe, and find ways to reduce costs for libraries through collaboration.