Web Sites: Concepts, Issues, and Definitions
A Web site is generally understood to be a collection of Web pages. Moving beyond this somewhat vague definition requires an explanation of how the cluster of Web pages comprising a given Web site is determined. From this perspective, defining the term "Web site" involves a trade off between emphasizing the Web site as a unit of information, or as an object delineated by physical criteria. In other media, information units and physical units are often equivalent: for example, a book is a discrete physical unit, as well as a meaningful information unit. This convenience is absent on the Web, where clustering Web pages by physical criteria and by information criteria can yield disparate results.
To illustrate, consider two possible definitions of a Web site: one relying strictly on physical (network infrastructure) criteria, and the other on information (content-oriented) criteria:
Web Site (Physical Definition): the set of Web pages located at one IP address.
- Location must be defined by the IP address, because some Web sites (OCLC estimates about 20%) do not have domain names. A location identified by an IP address is the lowest common denominator for all sites.
Web Site (Information Definition): a set of related Web pages that, in the aggregate, form a composite object of informational relevance.
- Informational relevance implies that the object in question addresses a non-trivial information need.
In some circumstances, the two definitions can be equivalent. Typically, some common theme can be found to relate the pages at a given IP address, apart from their shared location. However, this is not always possible; furthermore, it is not always the case that if a common theme can be identified, that this theme is the appropriate criterion for defining a meaningful information unit. For example, the Web pages located at the IP address 22.214.171.124 all share a common association with OCLC. But it may be more useful to define Web sites at a finer degree of granularity: for example, the Web pages devoted to OCLC's Office of Research may be considered a distinct Web site. The deconstruction can proceed still further: perhaps an even greater level of specificity is desired, such that the Web pages associated with the Office of Research's Web Characterization Project are viewed as a separate Web site.
As the OCLC example suggests, use of the information definition tends to cast the size of the Web in much larger terms than the physical definition. OCLC's Web Characterization Project employed a refined version of the physical definition to place the size of the Web at about 2,035,000 Web sites. In contrast, the Web navigation service Alexa estimates that there are about 20 million "content areas" on the Web, where a content area is defined as "top-level pages of sites, individual home pages, and significant subsections of corporate Web sites." Alexa's use of the term "content area" is oriented toward the information definition given above.
The use of either definition has both advantages and disadvantages. In the case of the physical definition, the primary advantage lies in ease of application. The physical definition utilizes an unambiguous criterion - the IP address - to uniquely assign a Web page to a Web site; therefore, it can be uniformly and objectively applied by network agents or computer processing routines, based solely on information contained within the URL. However, this definition can also produce arbitrary results from an information perspective. For example, consider an ISP that supplies hosting services for small businesses on its server. The physical definition implies that if all pages are accessible through a single shared IP address, then they comprise a single Web site. This would include the pages devoted to the ISP itself, in addition to each subset of pages pertaining to a business hosted at that location. Clearly, it would be preferable to treat each subset of Web pages associated with a distinct entity as a separate Web site.
Adopting an information emphasis alleviates the problem of arbitrary Web site definitions, but harbors an important difficulty of its own. The point at which an appropriate level of informational relevance is reached is clearly not the same across all sites. The number of "logical sites", if any, present within a collection of Web pages located at a single IP address is likely to be a function of the degree of heterogeneity present across the content of the Web pages. Clearly, the level of heterogeneity in Web page content will vary across sites. It is difficult to write a program that can analyze a collection of Web pages and cluster them according to their logical information units. Alternatively, a human being can manually review the pages and formulate appropriate Web sites, but this is a labor-intensive process, and, therefore, expensive, time-consuming, and prone to inconsistency. Given the vastness of the Web, and the amount of data that must be processed in order to estimate reliable metrics, the costs of such an approach are likely to be prohibitive.
It is recommended that researchers in the area of Web characterization adopt the physical definition as the standard, with several caveats that take into account some of the key insights of the information approach. The caveats correspond to a set of cases where use of the physical definition introduces ambiguity from an information perspective. These cases are enumerated below, along with recommendations for treating each case.
Canonical Web site
A collection of Web pages, located at one and only one IP address, that in aggregate form a composite object of informational relevance. If the IP address is mapped to a domain name, the mapping is one to one.
Variations of the canonical case
1) One domain name mapped to multiple IP addresses
It is not uncommon for a single domain name to be mapped to multiple IP addresses (e.g., for load balancing). It is invisible (and immaterial) to a client accessing the site through the domain name which IP address is actually accessed.
Example: Invoking nslookup on "www.microsoft.com" returns:
Recommendation: The set of all IP addresses mapped to one domain name should be treated as one logical Web site. The primary reason is that from the perspective of the client, there is only one site, uniquely identified by the domain name. Therefore, from an information perspective, there is only one true information unit.
2A) Multiple domain names mapped to one IP address - aliases
Some domain names have alternate forms, or "aliases", that can be used to access a given IP address.
Example: "www.oclc.org" and "ora.rsch.oclc.org" both map to 126.96.36.199
Recommendation: Since all of the aliases and the canonical form refer to the same set of Web pages, located at a single IP address, it is logical to treat all of these cases as one Web site.
2B) Multiple domain names mapped to one IP address - virtual hosting
The case of virtual hosting is more problematic. A virtual host - a single machine providing multiple Web services, each accessible through a separate domain name - clearly possesses logical Web sites. In HTTP 1.0, this situation did not present any difficulties, because each domain name required its own unique IP address. With HTTP 1.1, this is no longer required.
Example: Invoking nslookup on "www.homepageproductions.com" returns:
Invoking nslookup on "www.home4rent.com" returns:
Invoking nslookup on "www.wearcon.com" returns:
Recommendation: Each "virtual Web site" can be accessed via a unique domain name, without accessing the virtual hosting service provider's Web site first. In this sense, each virtual Web site has a unique URL and exists as a separate entity (apart from the shared IP address), and should therefore be treated as a separate Web site.
3) Logical Web sites contained within Web sites
The content of a Web site defined by the IP address criteria can be heterogeneous enough such that it is useful to consider subsets of Web pages as distinct sites.
Example: OCLC main site188.8.131.52/
OCLC Office of Research Web site184.108.40.206/oclc/research/
OCLC Web Characterization Project Web site220.127.116.11/oclc/research/projects/webstats/
Recommendation: Since identifying logical Web sites is difficult without resorting to manual review, and since logical Web sites have no unique hostname (like a domain name) that can be used to access the logical site without reference to the main site, the presence of logical Web sites within larger Web sites should be ignored in the context of defining Web sites. This can be contrasted with the case of virtual hosting, where there does exist a unique hostname that distinguishes the virtual site as a separate entity.
It may be useful to devise a separate term for logical Web sites (as Alexa did with their term "content area"), and view these objects as being distinct from the concept of a Web site - essentially an intermediate object between Web site and Web page.
4) Multihomed hosts
Multihomed hosts (more than one IP address corresponding to the same host machine) are not an issue, because the physical definition does not refer to physical location, but rather to network location, summarized by the IP address.
5) Web pages located at multiple IP addresses, but logically the same site
In some cases, a collection of Web pages are distributed over multiple IP addresses, but the intention of the Web site provider is to have the Web site consumer view these multiple IP addresses as one logical site.
Example: Most of the Web pages associated with various projects underway in the OCLC Office of Research are mounted on 18.104.22.168. However, some projects have their Web pages mounted on 22.214.171.124. There is no intent to make any distinction between the projects based on IP address location, and the multiple locations are essentially invisible to the Web site viewer.
Recommendation: Although this solution is not ideal from an information standpoint, in the interest of consistency it is best to treat pages located at another IP address as a separate Web site, even though the intent may have been to create a seamless integration of the Web pages located at multiple IP addresses. It is too difficult to establish consistent criteria for determining when a link to another IP address should be treated as internal or external to the site in question. In the extreme, the "linked" nature of the Web opens the door for an argument that the Web itself is one grand Web site.
6) Web pages located at an IP address have no common theme apart from shared location
It is likely that this case will only be encountered rarely; however, should the case arise, the physical definition should take precedence over information concerns.
7) Two collections of Web pages share the same IP address and domain name, but are located on different ports
TCP/IP permits the simultaneous availability of multiple Internet services on the same machine, or even multiple instances of the same Internet service.
Recommendation: Despite their shared IP address and domain name, the different port numbers are sufficient to distinguish the two collections as separate Web sites.
In the cases described above where modifications are made to a strict interpretation of the physical definition in order to account for information (content-related) issues, the purpose is generally to bring the Web site definition in line with the perspective of the Web site consumer. Of course, Web site consumers, or parties mainly interested in Web site content, are not the only users of Web characterization data. Other parties, such as Web site suppliers or Web infrastructure providers are also important users of this kind of data (see J. Meadows 1/5/99 message on the mailing list). It may be that some parties are interested in more detailed breakdowns of HTTP activity across the Internet (for example, how many IP addresses respond to HTTP connection attempts). It is suggested that in these cases, a different set of terms be used.
Most of the cases described were in fact encountered in the OCLC Web Characterization Project's Web sample. In this sense, these cases are not theoretical, but are real-world phenomena which must be treated when analyzing Web data.
An important issue associated with the cases described above is how to identify them in practice. Methods for identifying some of these cases are described in " A Methodology for Sampling the World Wide Web".
Identifying the other cases is a topic for further research.