Impact of Virtual Hosting on the Analysis

Introduction

One limitation of the WCP methodology is that Virtually Hosted sites are not identified and analyzed. The WCP methodology selects sites though the IP address, which can be completely and randomly surveyed. In Virtual Hosting, more than one Virtual site may exist at a single IP, and only one or perhaps none are available by accessing the IP address. This leads to an undercount in sites, and potentially a biased sample of sites if the characteristics of Virtual Sites differ from the sites found at the IP ("Root Sites" here).

The Netcraft survey claims that over half of the unique domain names registered are Virtual, and that these are serviced from about 100,000 IP addresses. Although our definitions of unique and whether a site is public or not differs from Netcraft, our counts of Root Web sites are roughly compatible. Netcraft found 7.4 Million active sites on 3.4 Million IP addresses in 2000. The latter number is very similar to the nearly 3 Million public sites that we found that same year. Since both studies look for unique content at the sites and the Root site counts are similar, we can assume that we would consider the Netcraft Virtual Web sites public with unique content as well.

The strengths of our analysis are a longitudinal study of Web growth and manual evaluation of a random sample of the content (language, country, economic activity). This review looks at the impact of Virtually Hosted sites on our analysis.

Assumptions

Virtual and Root Sites are indistinguishable to the end user. Any differences in the statistics related to these sites, such as how frequently they are cited, site size, etc., must be related to the site content and not whether it is Virtual or not.

The content or publisher of a site probably makes a difference as to whether a site is Virtual or Root. Free or low cost hosting services are clearly attractive to small local businesses or people wanting personal Web sites. A larger businesses that chooses to subcontract the Web site maintenance may be more likely to end up with a Virtual site than those that do not, and that choice is likely to be influenced by the business that they are in and their in-house expertise. Regulations or other regional considerations may affect these considerations as well.

Issues

I. Virtual Sites are missed, leading to an undercount of total sites and an apparent lower growth rate of the Web.

We conclude from the Netcraft numbers that we are only counting about half of the Web sites. Virtual Web sites, made possible by the http 1.1 protocol, were not prevalent when the Web Characterization Project survey started, so the growth rate of Web sites is also underestimated.

To review the importance of this, the harvested sample was scanned for domain name links external to the host site. A simple testing routine was written to compare the page returned when requesting the domain name to the page returned when requesting the IP address. If these pages were the same, the domain name was clearly not that of a Virtual site. Although it was not true in all cases, if the return pages differed the domain name was most likely to be virtual, and this test is sufficiently accurate for an estimate of the number of Virtual sites.

The 3045 public (excluding adult) sites were scanned for external domain names, resulting in over 107,000 unique (by text comparison) links. This is an average of 35 links per site, showing the Web is highly interconnected.

A random sample of nearly 1800 of these was tested, resulting in 22% that were potentially Virtual. From the Netcraft numbers we would expect more than 50% of the sites to be Virtual. Clearly the Virtual sites are less frequently cited than the Root sites.

Linking is a definitive attribute of the Web. Web traffic is concentrated in a relatively small percentage of the sites, and power laws have been used to characterize the interlinking of Web sites [2]. Google [3] has developed a successful search engine strategy based on ranking of search results by the linking frequency of a page. All of these present an interpretation of the Web as a well-connected core with less well-connected periphery, and that this well-connected core is the most relevant portion of the Web.

From this viewpoint, our survey captures the more relevant 80% of the Web, even if only 50% of the Web by unique domain name is reviewed. For the most important portion of the Web, our Web count and growth numbers are only slightly under-representative.

II. What is the expected impact on the content analysis?

At least two separate components of the content analysis can potentially be biased by the under representation of Virtually hosted sites. First, the analysis of providers of Web content providers using NAICS codes can be affected if providers in some NAICS categories are more likely to select Virtual Hosting than providers in other categories. Obviously, small businesses and people publishing personal Web sites are likely to be attracted by free or low cost Virtual hosting services. An additional evaluation is planned to address this bias more completely.

The second potential bias is geographical differences. For example, if Virtual hosting was more prevalent in Japan than in the Unites States, Japan, Japanese, and Japanese economic activity on the Web would also be underrepresented. To test this postulate, external domain names were selected from harvested sites of well-represented individual countries. The virtual testing scan was then performed on these sub-samples. Countries where the external links were concentrated in very few sites are not reported, since those may not be representative. The evaluation was not performed for Canada and the United Kingdom, since it is expected that these will be well connected to the dominant English Web sites of the United States.

Virtual Hosting by country

Country Sites in Sample Virtual External Links (%)
Germany 155 26%
Japan 125 24%
Netherlands 51 40%
Korea 43 34%

For the two non-English countries with the largest Web presence, the percentage of Virtual sites is similar to the overall average, so no significant bias is expected. The Virtual percentage is somewhat larger for Web sites of the next two countries, but the impact on the overall results will be negligible due to their small initial representation in the sample. Overall, we conclude that regional differences in the use of Virtual Web hosting will have a small effect on the Web Characterization Project statistics.

Summary

The Web Characterization Project sampling methodology is likely to undercount the total number of Web sites by 50% when all Virtual sites are included, but only by about 20% of the more relevant sites likely to be referenced from other sites. The distributions of country origin, language, and economic activity are reasonably representative, both due to the large percentage actually analyzed and the reasonable expectation that the important Virtual sites will show similar characteristics.

References

[1] The Netcraft Web Server Survey, Netcraft, reporting their July 2000 survey.
[2] The Web's Hidden Order, Adamic, L, and Huberman, B, Communications of the ACM, September 2001, Vol. 44, No. 9.
[3] Our Search: Why Use Google, Google.

We are a worldwide library cooperative, owned, governed and sustained by members since 1967. Our public purpose is a statement of commitment to each other—that we will work together to improve access to the information held in libraries around the globe, and find ways to reduce costs for libraries through collaboration.