Search Functions on Web Sites
The search function commonly found on Web sites may lead to material that is not accessible to a Web harvester, which simply follows anchor links. This note describes an evaluation of Web site search functions found in the 2000 Web Characterization Project sample. The prevalence, character, and contained content of the search functions were estimated, with the goal of accessing the completeness of the annual Web sample.
The evaluation was a two-step process. First, the sample was scanned using a routine to identify pages in the sample that contained forms that might be used for searching. The second step was to review each potential search form manually. Consistent with the Web Characterization Project definitions, only search functions internal to the site were reviewed. If the search was directed to an external site, such as an area provided to search Yahoo! from the current site, the search was ignored.
The potential search function on the live site was polled to confirm it was a search. If it was, the type of data searched and an estimate of the amount of material unavailable to the harvester was estimated.
At a minimum, an html search function must contain a text input area and a submit button. Initial review found no examples of multiple text input area searches that were not accompanied on the same site with an simpler, single text input area search form. Therefore, only forms with a single text input are selected for manual review.
The routine identifies html forms with a single text input area as a potential search function. Within a site, forms with parameters and values identical to previously identified forms were eliminated from review as duplicates. The action parameter of the form and the text in the submit button was also reviewed for words indicating common uses for text input forms, such as "password", "buy", "login", etc. The routine thus eliminated from manual review forms that were clearly not associated with searching
The evaluation routine steps through each html type page in the harvest sample, and identifies each form on a page that might be used for searching, ready for manual review.
Two broad types of search results are defined here for working purposes. The results of a search are characterized as either a Site Search or a Record Search. A Site Search is a navigation function, and the information returned is directly, or though a link, pages of information that exist on the site. In addition to html pages found on the site, these searches may return archived materials or user manuals, up to full text articles or books This can be understood better in contrast to the definition here of Record Search, where only limited information is returned, and the information returned is itself the end goal as far as information on the site is concerned. Searches of this nature can return phone listings, shopping catalog items, or brief records describing information that is off-site such as a Yahoo! like directory. These two categories are not completely distinct, with a fuzzy boundary between the two when a unique abstract is used as a pointer to additional offsite information.
The search function on the live site was used for review since the search functionality is not harvestable. First, the harvested version was looked at to ensure the site had not substantially changed or moved since it was harvested. The search function was exercised, often using words typical of the site. For foreign languages, cutting and pasting words usually worked well. A single successful result was often sufficient to tell if this was a search function, and whether it was a Site or Record Search.
The Site or Record Search distinction is useful in understanding the information that can be obtained from the search. For the Record Search, the information is often too limited or structured to estimate the amount of data searched by polling techniques, and often unique-looking information is just reorganized versions of information returned in other searches. However, the results of these searches can often be cleanly thought of as a phone book, a catalog, directory, or etc.
The more robust Site Search can be polled through various keywords, and the links returned compared with the pages that were harvested in the Web sample. This can give a statistically valid estimate of the amount of information that is hidden to harvesters, often referred to as dark information. For example, if a keyword search returns ten pages, five of which were harvested, a point estimate that 50% of the site was harvested can be made. Several keyword trials are needed to provide a meaningful estimate of the number of pages not harvested. This, in addition to site specific information, was used to estimate the number of missed pages.
The entire 2000 Web Characterization Project sample was scanned using the identification routine. 518 potential search functions were identified. After manual review, 110 were determined not to be searches, and 75 were at sites that were gone or substantially revised so that the original functionality could not be determined. The initial categorization of the 333 unique searches was found to be:
Search functions in the Web sample
64 External to the site
5 Private and could not be accessed
169 Site Searches
95 Record Searches
Evaluation of site searches
The 169 Site Search functions were extensively evaluated, primarily looking for information on the site that was not accessible though point and click navigation. If a few pages were found which were obviously drafts or otherwise not intended to be made available, these were ignored. The goal was to find information that was available, but intended to be found only through the local search function. Of these, 15 search functions were found to index information that was not accessible by simple linking. 13 sites accounted for an estimated total of 9000 pages. Mostly, these were archives of older materials that probably were available on the site previously, such as newsletters. One site had three versions of the bible. The final site reviewed had an obviously large amount of indexed material in a complex format, and the amount of material for that site was not estimated. This site was the Ask Jeeves reference site.
Since the 2000 sample identified three-quarters of a million pages, the loss of the 9,000 pages in the 13 Site Searches with hidden information is clearly negligible. Of the nearly 3,000 Sites evaluated, only two had a significant amount of material unavailable to harvesters. Although a good deal of material is on the Web that is unavailable to harvesters, the number of the sites with a significant store of material is limited, in the range of a few thousand sites. Indexers of the Web can easily identify and provide access to these sites manually.
Evaluation of record searches
It is generally not possible to estimate the amount of material available through a Record Search function. To clarify the nature of these, they have been separated into a few categories.
Directory: This is used to describe line item type information. Examples would be phone lists, e-mail lists, addresses, etc.
Catalog: The information is these are more robust than in Directories, but it is a brief description of something that is not actually on the site. Library and shopping catalogs are examples of these.
Calculations: These are more focused on data manipulation than on the data itself. Examples would be mortgage calculators and plotting stock value trends.
Discussion Lists: Although these are more verbose than Directory listings, they still are a set of limited focus items that are typically less than a page in structure and content that is the focus of the Site Search category.
The most appropriate category was chosen for the 95 Record Search reviewed. These are:
Types of record searches
5 Discussion Lists
Although this information often valuable, it is difficult to estimate the number of items searched. As an observation, much of the material is of interest only to users already on the site. For example, going to the "City Hospital" Web site to search for a City Hospital phone number is reasonable, and having the information indexed by a global search engine is not very useful. Similarly, searching the catalog of a local library is probably only interesting to likely patrons of that library.
A second observation is that the amount of material available is usually small. A hospital phone listing would be a few hundred numbers. Most catalogs were for a narrow set of products, more typically tens of items rather than thousands. Similar to the Site Search fining, large troves of information were uncommon.
Search only access to Web pages is an uncommon practice on Web sites, although in a few instances a large amount of material may be available only though that route. Less than 10% of all Web sites had search functionality, and only a few had a large amount of information available only though searching. It is estimated that on the entire Web, only a few thousand sites have large databases of otherwise inaccessible information.
Most Web sites that had search functionality provided it as an aid to discovery, but not as the only method of access. Excluding searchable only material from Web page count estimates does not significantly impact the accuracy of the count.