Menu
Search

Please note: This research project has concluded.
The research project has been completed. Information on this page is provided for historical purposes only. Some portion of this content may be out-of-date and include broken links. Please visit the OCLC Research website to learn more about our current research.

Pears Indexing Classes

An index routine is declared as one of the parameter settings for an index definition within a Pears database description configuration file. It sets up how the OCLC SiteSearch Pears software extracts index terms from input data and how the software acts on that data (e.g., handling punctuation, extracting codes). Index routines create index terms in one of two basic formats: keyword, where each word is its own index entry, or phrase, where the contents of an entire field is an index entry. Pears provides a wide range of index routines for both keyword and phrase indexes. Words and Phrase routines are the most commonly used.

Routines

The following table lists the current index routines used by Pears to extract terms from input data to build indexes:

Routine Description
Words Routines
ORG.oclc.pears.Words

Extends Phrase and extracts and stores individual terms from fields in a record

The following table lists parameters that you can use with the Words routine to more specifically define how it extracts terms to build an index:

ORG.oclc.pears.PluralWords Extends Words to stem plural endings from terms as they are extracted so that only the singular form of the term is stored in the index
ORG.oclc.pears.
   StopwordEnforcer
Ensures that stop words are not stored as terms in an index
ORG.oclc.pears.SmartWords Extends PluralWords to ensure that terms are greater than two characters in length
ORG.oclc.pears.
  WordsMinusBoundPhrases

Allows you to declare open and closed boundaries (such as quotation marks) to identify data within a phrase that is to be ignored during the extraction process

Note: When using this indexing routine, you must also use the bounds parameter within the index definition. Values for the bounds parameter must be declared in character pairs. An example would be: bounds = "" . In this example, anything between double quotes would not be indexed.

Phrase Routines
ORG.oclc.pears.Phrase

Creates simple bound phrases by extracting the contents of a field as a single index term

The following table lists parameters that you can use with the Phrase routine to more specifically define how it extracts terms to build an index:


MARC Routines
ORG.oclc.pears.
   MarcBibliographicLevel

Extends Words to find the bibliographic byte in the leader string in a Marc record and generates an index term based on the code that it finds there


ORG.oclc.pears.
  MarcFormat

Extends Words to find the record type and bibliographic bytes in the leader string in a Marc record and generates an index term based on the codes that it finds in those two places

The following is a parameter that you can use with the MarcFormat routine to more specifically define how it functions:

ORG.oclc.pears.
  MarcTypeOfMaterial

Extends Words to find the type of material byte in the leader string in a Marc record and generates an index term based upon what it finds there

The following is a parameter that you can use with the MarcFormat routine to more specifically define how it functions:

Number Routines
ORG.oclc.pears.
  Numbers
Extends the Words routine and only extracts digit strings
ORG.oclc.pears.
  LCCardNumber
Extends Phrase to convert the LCCard number field in a Marc record into a searchable term
Date Routines
ORG.oclc.pears.
  PublicationDate
Extends the Words routine in order to extract and normalize the publication date field in a Marc record
Language Routines
ORG.oclc.pears.
  ISO639Language
Works with the HandleChinaMarc record handling routine to change Chinese two-character language codes into their English equivalent search terms
ORG.oclc.pears.
  MarcLanguage

Extends Words to convert the Marc three-letter language codes into English equivalent search terms

The following is a parameter that you can use with the MarcFormat routine to more specifically define how it functions:

Miscellaneous Routines
ORG.oclc.pears.
  IndexRoutines

Abstract class that contains base methods for extracting index terms

Note: The Phrase routine implements IndexRoutines and all other Pears indexing routines extend Phrase.


Parameters
delimiters=\t\n\r+-=<>(){}[]:;/\\\"!?
extraDelimiters=
removeDelimiters=
minWordLength=
maxWordLength=
maxWords=
Parameters
bounds=
Parameter Description
Collapse=

< list of characters>
Removes any of the characters in the list from the field
ExtraTrimChars=

< list of characters>
Adds the list of characters to the default list of trimChars for the current index only
TrimChars=

< list of characters>
Removes any of the characters on the list form the beginning or end of the field

(default set: ' & . , : *)
MaxLength=< number> Shortens the field to the specified number of characters
StartOffset=< number> Ignores the first specified number of characters in the field

Note: The offset is performed before any other trim or collapse rules are applied.
ExtraIndex=< index ID> Any terms extracted for this index are also sent to the specified index ID.
indicator1=

< list of characters>
Requires that indicator1 for this field must have a value from the specified list of characters

Note: This can be used only with MARC-like records.
indicator2=

< list of characters>
Requires that indicator2 for this field must have a value from the specified list of characters

Note: This can be used only with MARC-like records.
indicators=

< list of character

pairs
>
Requires that the two indicators must have a vlaue from the specified list of character pairs

Note: This can be used only with MARC-like records.
notIndicator1=

< list of characters>
Inidcator1 for this field must not have a value from the specified list of characters.

Note: This can be used only with MARC-like records.
notIndicator2=

< list of characters>
Indicator2 for this field must not have a value from the specified list of characters.

Note: This can be used only with MARC-like records.
notIndicators=

< list of character

pairs
>
Two indicators must not have a value from the specified list of characters.

Note: This can be used only with MARC-like records.
NonFilingIndicator1=

true
Value of the first indicator determines the number of characters to remove from the beginning of the field
NonFilingIndicator2=

true
Value of the second indicator determines the number of characters to remove from the beginning of the field
Example: Since titles often have a trailing slash that needs to be removed...

[title]

index=1

routine=ORG.oclc.pears.IndexRoutines.Phrase

tagpath=245/1

extratrimchars=/

nonFilingIndicator2=true
Bib Level Code Type of Material Index Term Returned
a analytic monograph  analytic
b analytic serial analytic
m mongraph monograph
s  serial serial
c collection collection
d subunit subunit
Example: [BibLevel]

index=1

routine=ORG.oclc.pears.IndexRoutines. \

   MarcBibliographicLevel

tagpath=0

startOffset=1
Record

Type
Bibliographic

Level Code
Abbreviation Type of

Material
a, t m, c, a, d bks Books
e, f any map Maps
p, b any mix Mixed Materials
m any com Computer Files
c, d any sco Scores
any s, b ser Serials
i, j any rec Sound Recordings
g, k, o, r any vis Visual Material
Parameter Description
DebugMarcFormat= Turns on internal debugging
Example: To extract material type from a MARC leader . . .



[format]

index=1

routine=ORG.oclc.pears.IndexRoutines. \

  MarcFormat

tagpath=0

staroffset=1
Type

Code
Abbreviation Type of Material
a, t bks Books
e, f map  Maps
p mix Mixed Materials
m com  Computer Files
c, d sco Scores
s ser Serials
i, j rec  Sound Recordings
g, k, o, r vis  Visual Materials
Parameter Description
DebugMarcTypeOfMaterial= Turns on internal debugging
Example: To extract material type from MARC 006 . . .



[materialtype]

index=1

routine=ORG.oclc.pears.IndexRoutines. \

   MarcTypeOfMaterial

tagpath=6
Parameter Description
DebugMarcLanguage= Turns on internal debugging
Example: To extract language from the MARC 008 field . . .



[language]

index=1

routine=ORG.oclc.pears.IndexRoutines. \

  MarcLanguage

tagpath=8

staroffset=35