Scorpion database design
General concepts
The Scorpion database consists of an unordered set of concepts that are useful for classifying documents. Each concept is defined in a database record. Following the familiar vector-space model of information retrieval (Salton and McGill 1988), documents are submitted as queries against a database. The query returns a list of database records that contain terms found in the document, ranked in order of their similarity to the document. The result can be interpreted as a prioritized list of concepts that roughly characterize the content of the document.
In our studies, we have created Scorpion databases from two library classification schemes: the Dewey Decimal Classification ( DDC) and the Library of Congress Classification ( LCC). We include a test database derived from a portion of the LCC in this installation and refer to it here to illustrate a simple Scorpion database design.
However, Scorpion does not require the DDC or LCC database. Users of our Open Source software might have access to other classification schemes or subject thesauri that have been created from scratch or have been modified from existing resources. The major requirement for an effective Scorpion database is a set of concepts that cover the subjects of interest and are distinct from one another. Descriptions of our efforts to adapt the DDC and LCC to satisfy this requirement are found in Thompson, et al. (1996) and Godby and Stuler (2001).
Record structure
As shown in the file lccSample.sgml, the Scorpion database has four fields, properly nested in SGML
-
- The name of the concept.
-
- A number that locates the concept in the classification scheme.
-
- Terms that define the concept.
-
- Terms that have been statistically mapped to the concept.
-
- The upward hierarchy in the classification scheme that contains the concept.
The Scorpion record structure is easily illustrated with reference to a DDC concept. In Edition 21 of the DDC (Chan, et al. 1996), Robots is the caption for the concept 629.892 , which is a leaf node in the hierarchy
600 Technology (Applied sciences) 620 Engineering and allied operations 629 (Other branches of engineering) 629.89 Computer control 629.892 Robots
The DDC concept robots shown here also has two sources of related terms. The first set contains the index terms assigned by the Dewey editors, such as robotics, evolutionary robotics and parallel robots. In the Scorpion database, we have assigned these terms to the
In sum, the Scorpion database represents a single concept in the
The fit between the Scorpion record structure and the native LCC record structure is slightly more abstract. Though the LCC has fields that contain data that correspond to the
Physics Nuclear and particle physics Atomic energy. Radioactivity Data processing
and
Biology (General) Methods of research Technique Data processing
A new Scorpion database can be created by imitating the structure of lccSample.sgml with data of your choice. See Building the Database for information about how to create the database when your SGML file is finished.
References
Godby C.J., Stuler J., The Library of Congress Classification as a Subject Base for Automatic Classification. Presented at the IFLA Preconference "Subject Retrieval in a Networked Environment," Dublin, Ohio, August 2001. Accessible at: http://staff.oclc.org/~godby/auto_class/godby-ifla.html
Chan, L.M., Comaromi, J.P., Mitchell, J.S., Satija, M.P., Dewey Decimal Classification: A Practical Guide, 2nd edition, OCLC Forest Press, Albany NY, 1996.
Salton, G., Buckley, C., "Term-weighting approaches in automatic text retrieval,"
Information Processing and Management, 24(5), 513--523, 1988.
Thompson, R., Shafer, K., Vizine-Goetz, D. Evaluating Dewey Concepts as a Knowledge Base for Automatic Subject Assignment. Accessible at: http://orc.rsch.oclc.org:6109/eval_dc.html, 1996.