Research Commentary by David Novak of The Spire Project

10 Billion pages or more
By David Novak

"This is a webpage."
[ Instructor points to a webpage displayed on the wall ]
"There are over 10 billion webpages."
[ pause for effect ]
"Finding the one you want is not easy."

I am thinking of starting my seminar this way but I can't seem to get away from the number 10 billion. Estimates of the size of the web are all over the place. I've heard 10 billion, 7 billion, 35 billion and 300 billion.

Much of the confusion emerges from how you define a webpage. Does it count if you have to use a form to get to the information? Does it count if you need a password? These are not trivial questions. It determines if you count every book listed in Amazon.com as a separate webpage (It is online though you do need to use a form) or if you count everything available in DialogWeb as a webpage (It is online though you do need your credit card and Dialog password). So, for example, do we count intranets? Do we count email messages sent to a mailing list that get archived?

When the Internet Archives connected its collection of past information to the internet, did the internet suddenly swell by an additional 10 billion webpages? ("Yes, multiple copies of past webpages, 1996 to the present, 10 billion in total.)

It may be tempting to say no to all of these but then Altavista occasionally indexes Amazon.com book descriptions and many search engines index archived mailing list messages.

As you can see, as we try to nail it down, it gets very slippery.

It is tempting to sidestep this question, but be careful. One pseudo-logical trap is to draw a judgement on the relative size of Google, from experience searching Google. To say "Google appears to index maybe a third of the pages I know…" is misguided. We can't estimate what is not indexed by searching what is.

Perhaps we can approach it another way. Doug Elix, Senior Vice President of IBM, at WCIT2002 in February 28th said, "Our [IBM] research labs project that internet-accessible data is increasing at an annual rate of 300%". [See: http://www.worldcongress2002.org - day 2 - Doug Elix - presentation.pdf]

This is an awe-inspiring projection. 300% each year! But the focus on relative growth does miss the point a little.

Absolute size is not trivial.

We use absolute size to assess the value of the tools we use to find information. Google's much lauded mega-database includes records from a fraction over 2 billion webpages as of June 2002. This either represents 22%, 16%, 4.5% or a miniscule 0.5% of all web material on the internet. This is simply dependent on whether we say the web is 7 billion or 300 billion webpages in girth.

Thinking it through again, the IBM projection probably had it right. If we sidestep the question of current size, and allow ourselves the luxury of a long-view, we get the same conclusion regardless of the present size of the web. The web is growing fast. Far faster than we are keeping pace. It will dwarf everything.

And that conclusion has some very important implications to the way we expect the internet to grow, and how we conduct research on the internet.

For starters, on many occasions we need to be looking for 'footprints' of information, or evidence that the information we seek is nearby. It is increasingly unlikely the webpage we are looking for will be indexed directly by Google/Altavista/what-have-you.

For example, when we are creating a list, we need to admit, loudly, we are not doing anything definitive. We are not even doing something remotely incomplete. We are just making a reference to a few sites that hit our fancy. To be even incomplete we need to do something with considerable effort, and continuous effort. This in turn has some important implications for the market value of Yahoo, once billed as THE Directory of Internet Resources. Hah! As if such a thing could exist today.

In research, it means even with the best search, and the best searching skills, it helps to conclude we have missed some vital, valuable and vibrant resource. Then we can get on with asking is it worth our time to find it.

The growth of the internet population has come off of the exponential curve it was experiencing for many years. Following this, for reasons described more fully in another article titled "The growth of the Internet should still scare you" (http://spireproject.com/art10.htm), we have entered into a phase where the quantity of information will grow at a geometric or greater rate for several more years. No matter how many pages we say are here today, it will be truly chaotic soon.

Sometimes I fear we avert our eyes from such a vision of chaos.

There is a partial solution. If the internet is getting more cumbersome to search, we need to consider becoming better at searching it. There are some simple as well as complex skills and concepts we can employ to tame the internet. Research, after all, is a skill. Internet research is just the same.

It helps if you understand how to title-search the web. It helps if you are good with Url interpretation, juggling windows, recognizing Context, Format and Source. It helps to align your previous library-focused experience with this new medium. It helps to know some of the better tools. There is much that can be done to make your time online more rewarding. This is, after all, what the Spire Project is about.

Oh, and for not very good reasons, I'm currently guessing the web is between 15 and 20 billion webpages, not counting commercial or intranet or form-based data...

* * *
David Novak, founder of the Spire Project, delivers seminars on Exceptional Internet Research around the world. I hope to see you one day. SpireProject.com/seminar/ for details.

The Spire Project - better ways to find information.
Like this? You should attend our public seminar and receive our bi-monthly update notice.
| SpireProject.com | SpireProject.co.uk | Project Background | Feedback. Copyright©David Novak 2002.