Non-anchor Navigation and its Impact on Harvesting Completeness
A Web site is usually viewed as a collection of individual pages interconnected by a simple URL links. This is the common basis for Web harvesting engines, where these pages are harvested, indexed, and the search results made available to end-users. As Web sites become increasingly large and sophisticated, it is worthwhile to see how prevalent simple linking is, or if other Web page navigation techniques are replacing the simple linking model.
This research note looks at html form constructs used as a replacement for anchor or link tags. Often, links are put into html form drop down boxes to remove clutter from a page. Another common use is an html form submit button, which simply transfers the user to another page. This is an easy way to make a labeled button for a link. Other forms, of course, are part of a more complex interaction with the user.
The sites harvested in the 2000 Web Characterization Project sample are used to investigate the nature and frequency of form based links, and to estimate the pages inaccessible to a traditional harvester due the use of this type of linking.
The evaluation was a two-step process. First, the sample was scanned using a routine to identify pages in the sample that contained forms that might be used for navigation. The second step was to review each potential navigation form manually. If the form is used for navigation, an estimate of the pages directly missed was made. No effort was made to estimate the number of subsequent pages that could have been harvested.
As a working definition here, a form is used for navigation if it leads to a unique page of content. If the form allows you to modify the current or next page, or if the purpose of the form is to provide information to the publisher of the page, it is not considered to be a navigation form.
The simplest use of a form for navigation is a submit button with a page URL as the action. This has the same function as an anchor tag, but allow the Web developer to make a simple button with text in it without going to a second software package to develop a button image file.
This will produce a button that transfers the user to NextPage (not functional here).
A second simple use is a drop down menu, with each item having a URL to a new page. Web developers often use this to remove a fixed table menu of links, reducing the clutter on a page.
This will produce a simple drop down that will transfer the user to the NextPage page (not functional here).
Although these two cases are both relatively simple and commonly used, it is difficult to separate these from more complicated uses. The flexibility of the forms also allow for customized pages being built and returned on demand. It would be difficult for an automatic harvester to distinguish fixed unique pages from a large series of similar but inherently identical pages. The use of more than one multiple option widgets on the same page complicates the issue further. The combination of all possible options can quickly lead to a large number of potential return pages.
The evaluation routine was written to find the forms on the harvested html pages. Text input areas cannot be used for simple navigation, so forms containing these were ignored. The SELECT, OPTION, and INPUT tags were evaluated with their parameters, and this information was used to prevent the review of duplicate forms within a site.
The evaluation routine steps through each html type page in the harvest sample, and identifies each form on a page that might be used for navigation.
Each page identified by the evaluation routine was put into a browser and reviewed. The html source or live site was also reviewed as needed to confirm the observation. When the form was used for navigation, the pages linked to were counted.
Sites frequently had near duplicate forms that were used for navigation that the evaluation routine could not remove. A very common example is a drop down box, where the current page is removed from the option list. These near duplicates were identified and only counted once.
A sub-sample of the 2000 harvest was manually reviewed for form usage to estimate the impact on the harvest. The sub-sample contains 213 sites, or 7% of harvested sites. [The size of the sub-sample was chosen to fit on a single CD] There are 28,000 harvested pages in this sample, or roughly 8% of the harvested pages. Of the 213 sites, 57 sites were found by the analysis tool to potentially have navigation forms. The analysis tool had removed duplicates when it could, and 1773 form entries were left for manual evaluation. Most forms were near duplicates of other forms, leaving only 323 unique forms at 27 sites. These are summarized below:
Uses of forms found
27 Input forms, such as questionnaires, quizzes, etc.
4 Very complex selection lists, providing a list of pages based on the input. These were considered to be more like a search result than a navigation link, and not counted.
223 Were used for navigation. These are described in more detail below.
68 Other types. Most of these were on a single site where the drop down menu feature was used to provide a list of product attributes but no submission functionality. Others were non-functional ("junk html" left over from the editing process) or other simple functions.
Review of navigation links
Occurrences Link Count 89 1 2 2 3 3 11 4 2 5 5 6 34 11 36 30
Others are a few each. The 30 and 11 Link count forms are primarily on single sites. The largest are 226, 129 and 100
Total Links 3278
The links were not reviewed to see if the pages they pointed to could have been harvested though a separate anchor tag, or if the links were off site. However, the count does give an indication of how much of these sites is not harvestable due to the use of forms for navigation. The total page count for the sub-sample was 28,000 pages, so up to 10% of the Web is not harvestable due to the use of forms for navigation.
The use of html forms for navigation in place of anchors can have a significant impact on the ability of Web harvesters to access portions of a Web site. About 12% of Web sites use these html constructs, blocking harvester access to about 10% of the pages at those sites. Pages referred to by these pages are also inaccessible.
Web designers that want their site to be fully indexed need to provide links around these navigation forms, such as a site index page.