Please note: This research project has concluded.
The research project has been completed. Information on this page is provided for historical purposes only. Some portion of this content may be out-of-date and include broken links. Please visit the OCLC Research website to learn more about our current research.

Pears Indexing Classes

An index routine is declared as one of the parameter settings for an index definition within a Pears database description configuration file. It sets up how the OCLC SiteSearch Pears software extracts index terms from input data and how the software acts on that data (e.g., handling punctuation, extracting codes). Index routines create index terms in one of two basic formats: keyword, where each word is its own index entry, or phrase, where the contents of an entire field is an index entry. Pears provides a wide range of index routines for both keyword and phrase indexes. Words and Phrase routines are the most commonly used.

Routines

The following table lists the current index routines used by Pears to extract terms from input data to build indexes:

Routine	Description
Words Routines
`ORG.oclc.pears.Words`	Extends Phrase and extracts and stores individual terms from fields in a record The following table lists parameters that you can use with the Words routine to more specifically define how it extracts terms to build an index:
`ORG.oclc.pears.PluralWords`	Extends Words to stem plural endings from terms as they are extracted so that only the singular form of the term is stored in the index
`ORG.oclc.pears. StopwordEnforcer`	Ensures that stop words are not stored as terms in an index
`ORG.oclc.pears.SmartWords`	Extends PluralWords to ensure that terms are greater than two characters in length
`ORG.oclc.pears. WordsMinusBoundPhrases`	Allows you to declare open and closed boundaries (such as quotation marks) to identify data within a phrase that is to be ignored during the extraction process Note: When using this indexing routine, you must also use the `bounds` parameter within the index definition. Values for the `bounds` parameter must be declared in character pairs. An example would be: `bounds = ""` . In this example, anything between double quotes would not be indexed.
Phrase Routines
`ORG.oclc.pears.Phrase`	Creates simple bound phrases by extracting the contents of a field as a single index term The following table lists parameters that you can use with the Phrase routine to more specifically define how it extracts terms to build an index:
MARC Routines
`ORG.oclc.pears. MarcBibliographicLevel`	Extends Words to find the bibliographic byte in the leader string in a Marc record and generates an index term based on the code that it finds there
`ORG.oclc.pears. MarcFormat`	Extends Words to find the record type and bibliographic bytes in the leader string in a Marc record and generates an index term based on the codes that it finds in those two places The following is a parameter that you can use with the MarcFormat routine to more specifically define how it functions:
`ORG.oclc.pears. MarcTypeOfMaterial`	Extends Words to find the type of material byte in the leader string in a Marc record and generates an index term based upon what it finds there The following is a parameter that you can use with the MarcFormat routine to more specifically define how it functions:
Number Routines
`ORG.oclc.pears. Numbers`	Extends the Words routine and only extracts digit strings
`ORG.oclc.pears. LCCardNumber`	Extends Phrase to convert the LCCard number field in a Marc record into a searchable term
Date Routines
`ORG.oclc.pears. PublicationDate`	Extends the Words routine in order to extract and normalize the publication date field in a Marc record
Language Routines
`ORG.oclc.pears. ISO639Language`	Works with the HandleChinaMarc record handling routine to change Chinese two-character language codes into their English equivalent search terms
`ORG.oclc.pears. MarcLanguage`	Extends Words to convert the Marc three-letter language codes into English equivalent search terms The following is a parameter that you can use with the MarcFormat routine to more specifically define how it functions:
Miscellaneous Routines
`ORG.oclc.pears. IndexRoutines`	Abstract class that contains base methods for extracting index terms Note: The Phrase routine implements IndexRoutines and all other Pears indexing routines extend Phrase.

Parameters
`delimiters=\t\n\r+-=<>(){}[]:;/\\\"!?`
`extraDelimiters=`
`removeDelimiters=`
`minWordLength=`
`maxWordLength=`
`maxWords=`

Parameters
`bounds=`

Parameter	Description
`Collapse= < list of characters>`	Removes any of the characters in the list from the field
`ExtraTrimChars= < list of characters>`	Adds the list of characters to the default list of trimChars for the current index only
`TrimChars= < list of characters>`	Removes any of the characters on the list form the beginning or end of the field (default set: ' & . , : *)
`MaxLength=< number>`	Shortens the field to the specified number of characters
`StartOffset=< number>`	Ignores the first specified number of characters in the field Note: The offset is performed before any other trim or collapse rules are applied.
`ExtraIndex=< index ID>`	Any terms extracted for this index are also sent to the specified index ID.
`indicator1= < list of characters>`	Requires that indicator1 for this field must have a value from the specified list of characters Note: This can be used only with MARC-like records.
`indicator2= < list of characters>`	Requires that indicator2 for this field must have a value from the specified list of characters Note: This can be used only with MARC-like records.
`indicators= < list of character pairs >`	Requires that the two indicators must have a vlaue from the specified list of character pairs Note: This can be used only with MARC-like records.
`notIndicator1= < list of characters>`	Inidcator1 for this field must not have a value from the specified list of characters. Note: This can be used only with MARC-like records.
`notIndicator2= < list of characters>`	Indicator2 for this field must not have a value from the specified list of characters. Note: This can be used only with MARC-like records.
`notIndicators= < list of character pairs >`	Two indicators must not have a value from the specified list of characters. Note: This can be used only with MARC-like records.
`NonFilingIndicator1= true`	Value of the first indicator determines the number of characters to remove from the beginning of the field
`NonFilingIndicator2= true`	Value of the second indicator determines the number of characters to remove from the beginning of the field

Example: Since titles often have a trailing slash that needs to be removed...

[title] index=1 routine=ORG.oclc.pears.IndexRoutines.Phrase tagpath=245/1 extratrimchars=/ nonFilingIndicator2=true

Bib Level Code	Type of Material	Index Term Returned
a	analytic monograph	analytic
b	analytic serial	analytic
m	mongraph	monograph
s	serial	serial
c	collection	collection
d	subunit	subunit

Example: [BibLevel] index=1 routine=ORG.oclc.pears.IndexRoutines. \ MarcBibliographicLevel tagpath=0 startOffset=1

Record Type	Bibliographic Level Code	Abbreviation	Type of Material
a, t	m, c, a, d	bks	Books
e, f	any	map	Maps
p, b	any	mix	Mixed Materials
m	any	com	Computer Files
c, d	any	sco	Scores
any	s, b	ser	Serials
i, j	any	rec	Sound Recordings
g, k, o, r	any	vis	Visual Material

Parameter	Description
`DebugMarcFormat=`	Turns on internal debugging

Example: To extract material type from a MARC leader . . . [format] index=1 routine=ORG.oclc.pears.IndexRoutines. \ MarcFormat tagpath=0 staroffset=1

Type Code	Abbreviation	Type of Material
a, t	bks	Books
e, f	map	Maps
p	mix	Mixed Materials
m	com	Computer Files
c, d	sco	Scores
s	ser	Serials
i, j	rec	Sound Recordings
g, k, o, r	vis	Visual Materials

Parameter	Description
`DebugMarcTypeOfMaterial=`	Turns on internal debugging

Example: To extract material type from MARC 006 . . . [materialtype] index=1 routine=ORG.oclc.pears.IndexRoutines. \ MarcTypeOfMaterial tagpath=6

Parameter	Description
`DebugMarcLanguage=`	Turns on internal debugging

Example: To extract language from the MARC 008 field . . . [language] index=1 routine=ORG.oclc.pears.IndexRoutines. \ MarcLanguage tagpath=8 staroffset=35

Return to Pears homepage

Pears Indexing Classes

Routines

Follow OCLC Research

OCLC Research

Related sites

Subscribe to our blog