Research Commentary on the Spire Project

This article first appeared in print as the title article of ONLINE Jan/Feb 2003, a publication of Information Today, Inc.

Evolution of Internet Research: Shifting Allegiances
By David Novak

It has taken me several years to grasp just how vast a gulf lies between searching and researching the Internet. Searching the Internet is computer science. We practice our understanding of search technologies and search engines. Researching the Internet is library science. We act upon our understanding of how information is arranged.

These two approaches are very distinct.

I shall attempt to trace a gradual evolution in how we find information using the Internet. I believe we have been moving from Internet searching to Internet research - from computer science to library science. If I am right, this portends perhaps the single most dramatic change to library science in decades: a renaissance of library science and librarianship.

The two or three most effective ways to search the Internet change every year or two. It comes as a bit of a shock to realize, but even the very short history of the Internet has seen a wide range of tools and techniques come and go. Today, there appears to be a consensus that Google is the primary search tool for searching for Internet information. And yet this same conviction was directed to Yahoo! just 2 years ago. What has happened?

In the very early days, before the Web arrived, I remember pleading with my Internet service provider to mirror a copy of the many guidebooks that made up the Internet Clearinghouse Project. You may know of this project as its later re-incarnation: the Argus Clearinghouse. In its heyday it was internationally famous. One of its typical text guidebook, "Not Just Cows," described in detail all of the better Internet resources and active mailing lists for agriculture. When I met this archive, it was racing past 130 guidebooks.

Archie complemented this as a database of all the publicly accessible files found on FTP sites. Actually, Archie was not a complete database but was thought to index well over 95% of all FTP material. This coverage was so complete, it started the tradition that the publisher was responsible for informing a nearby Archie if a new FTP site was launched.

How far we have come today. Most of the guidebooks have grown up or disintegrated in time. Argus has not been updating for several years and is being folded into the Internet Public Library (IPL) directory. Lou Rosenfeld (of Argus) formed his own consulting company [] and gives seminars in conjunction with the Norman Nielsen group. Argus' direct competitor, AlphaSearch, is gone too. Even Archie gave way to, which was then purchased by C|net, then lost all pretence at completeness. But much more was lost. The idea that a single person could organize all the resources in a given topic was one casualty. So was the idea of a search engine that indexed all Internet resources, as Archie did for FTP. The Internet simply outgrew these ideas. In the early days it was both possible and brilliantly executed.

With the arrival of Gophers, Veronica stepped in and became a third vital approach to finding Internet information. Veronica was a quasi-definitive list of all Gopher categories. It never attained the completeness that Archie had for FTP resources and its fame slipped rapidly away once it became apparent that the Web was going to be far more interesting than Gopherspace.

The early search engines, with names like the World Wide Web Worm and Webcrawler, changed this environment significantly. These search engines indexed most of the Web, certainly achieving initially over 50% coverage, then slipping to 30% as the Web grew. These tools were as famous as Google and Yahoo! today. Everyone used them. And when the Web was young, they sparkled.

Unfortunately, the search algorithms used by early search engines were of the kind used by commercial databases of the day. A search for "Internet Research," returned a list of Web pages ranked by frequency and title. Web pages with "Internet Research" in their titles would lead the list, followed by pages with the words "Internet research" occurring several times in the text. This gave rise to the uninspired marketing maxim that you must place your primary keywords in the title and three or four times in the first paragraph.

These early search engines also invited and even expected publishers to inform them of new Web pages. The search engines would dutifully send out their spiders, sometimes immediately. For some reason, though, I don't remember much use of field searching in these early days. Perhaps the early search engines did not permit Title and URL searching, or perhaps we didn't know we needed these tools.

Complementing these early search engines were two simple techniques that gave the motion to Internet surfing. Initially, we would search for a hotlinks page. A search for "Accounting Hotlinks" would likely unearth a page created by someone who had just finished a scan of accounting resources. If it was a month or two old, it served as a very fine starting point for your efforts to do the same.

About a year later, as Hotlinks stopped being the word de jour, we would visit the "further links" section of an interesting Web site. Publishers were kindly creating these lists more and more, pointing out and linking to comparable sites. This may have been where the habit of surfing arose - you could hop on and gradually move from one Web site, to its further links page, to the next Web site, to its further links page - surfing to the information that peers recognized as useful.

The World Wide Web Virtual Library, soon followed by Yahoo!, began to succeed as the guidebooks began to falter. Yahoo! required much less effort to update, so rapidly delivered a far more extensive list of resources - though sadly listing few of the cherished mailing lists.

Yahoo! really made its move at a time when the early search engines were struggling to make the transition to popularity ranking. There were too many resources out there. The basic search algorithms that had delivered such brilliant results only a year earlier were now increasingly exasperating. They didn't work any more. The best information was often buried deep within a mass of other information.

Essentially, as the Web grew, and search engine databases struggled unsuccessfully to keep pace, the search engine results deteriorated. It did not help that these early search engines defaulted to OR, so that even a simple search for three blind mice would deliver millions of results. Adding the + symbol before each word - making an explicit request for a Boolean AND search - initially tamed this mess, but the trouble was more fundamental. It required a major rethink in how information was ranked to revitalize these search engines.

In this chaotic transition, Yahoo! reigned supreme. Suddenly you could not move fast enough to see what Yahoo! had to offer. The age of the directory also heralded a raging business model that, through massive promotion, made Yahoo! synonymous with Internet research for a time.

The growth of the Internet continued. When Google introduced ranking technologies, it changed everything. Here was a way to float the more popular and coincidentally the more recognized resources to the top of the long search engine lists. With the default changed to AND, the search engines began to work again as an effective research tool. Then the databases searched by search engines swelled in size.

There were fundamental shifts taking place. With these new algorithms, the search engines no longer required the assistance of publishers to index the best information. Initially, they began asking for email addresses - often bathing a publisher in spam as a price for indexing - and then some gradually stopped altogether. At the same time, as databases grew, the potential pay-off for a publisher shrank. Most new publishers would only occasionally see a visitor sent their way from any effort in informing the search engines of new pages.

When Google crested one billion records, the limitations of Yahoo! were becoming increasingly apparent. No directory could ever index the complete volume of the Internet effectively, it was said, forgetting that only a few years earlier Archie had effectively indexed all FTP resources. What had happened, of course, was rapid Internet growth that diluted earlier achievements to the point of being inadequate. It did not help that at this time Yahoo! began to charge a consideration fee for publishers wishing to be indexed.

Another change happened. The search engines allowed for field searching, and those in the know began to make much greater use of additional techniques to further refine their searching. A title search could be most helpful in certain circumstances. AlltheWeb permitted a title search using title.normal:words. This was later changed to match Altavista's simpler title:words though Google persisted for a long time in not inviting users to use its title search capability.

Almost by accident, many researchers began extending a skill I refer to as URL interpretation. From an early understanding that .gov means government and .au Australia, researchers could intuit additional information from the Web address. On a good day, I can tell the format, date, publisher, and type of author from the URL. Guessing these elements helps me to anticipate type and quality of information on the site.

Region also came into play. A simple would limit results to Australia. Even more effectively, Bryan Strome with his would (and still does) lead you quickly to a regional search engine; an Australian only search engine. Predictions swept the Web that the next great step forward would be in regional Webspace and in topic-specific search engines. Both predictions, I am mindful, play as yet minor roles in Internet research.

As the Internet grows further, search engines begin to run into trouble again. Google stands at just about 2 1/2 billion records now but the Web races ahead at a much faster pace. There are complex reasons for this pace - not least that the number of people capable of Internet publishing grows at an exponential rate. I've explained my views at and This growth is real and seriously disrupts popularity ranking. Estimating an absolute size of the Web is perilous, but if you accept an estimate of 15 billion Web pages, only 14% of the Web is indexed. Next year, as this figure surely dips below 7%, ranking technology takes on a whole new meaning.

Where once ranking would float the best information to our attention, by next year it will retreat to become similar to Yahoo! with its emphasis on site, time, and money. Google is not losing its battle but is definitely losing the technological war on organizing chaos. However, this war is being fought more successfully on other fronts.

There is more to this evolution than a change in tools. This is really a story about a change in approach. In the early days we expected almost all FTP resources to be indexed by Archie. With the early search engines, we expected most important Web pages to be represented. Tomorrow, we will expect most important Web sites to be represented. Yes, we will leap from Web pages to Web sites.

There is another message here. Over time, we discover better ways to find information.

For a simple illustration, consider how we judge the quality of Web-based information. In the early days, there were murmurs about assessing quality based on the .gov versus .com or perhaps just assuming the worst. Even today, some online advice suggests an assessment based on the presence of a copyright notice and date. Is the author identified on the article? Are the links working? Is the spelling correct?

Thankfully, we've progressed. We now look to context, format, and source. Who wrote it - and if we have a name, what else have they written (found with a simple search). Make an assessment of the author and publisher based on other items they have published. (Hack the URL or query Google with a URL field search to find information logically located nearby.) Look for evidence of peer review by considering the format in which the information was prepared. Perhaps consider Web site popularity (found with a link field search). We can still consider spelling.

Let's have a research example. One of my frequent tasks as a traveling public speaker is to find suitable auditoriums. This is not simple. Bluntly querying Google for a list of auditoriums in Dallas will only give me a list of those with Web sites, primarily those with some popularity. What I really want is a list of auditoriums. It turns out two organizations create such lists. The local convention and visitors bureau often has a list of meeting room venues that include auditoriums. The state agency involved in disability legislation also may have a definitive list of auditoriums and their respective handicap access status.

I learned this through a bit of feedback research. After I stumbled upon two such lists in other cities I began to actively seek such lists with a purpose. The key however, is to realize Google rarely indexes these lists. But knowing they exist, I'll first strike out and find the local convention and visitors bureau (with the help of Google or a list of convention centers) and then move through the Web site towards the list of meeting facilities. I may also consult a directory of museum Web sites - since they occasionally have auditoriums.

What has happened? Simple. Searching failed me. Without library science - knowledge of source, anticipating information, feedback research - I would have to admit defeat and choose a hotel.

Internet research continues to mature. About a year ago I had a delightful afternoon with Lecturer Theresa Anderson at University of Technology Sydney (UTS). She was completing her thesis on the criteria experienced researchers use to select information. With the help of multiple video cameras and computer memory, she has traced how skilled commercial-zone searchers interact with the information world dynamically, predicting what was out there, selecting and guiding their attention based on clues.

As we watched while I executed a difficult Internet search, we saw the same techniques at play. I was intimately aware of what I thought was out there, what I was finding, and constantly comparing the two. There was an internal dialogue selecting, reformulating, seeking a certain type of information, and being frustrated when I didn't find it. At the experiential level, Internet research techniques merge with commercial and information research techniques.

We have witnessed a voyage away from an era where the Internet was controlled and deeply understood from a computer science perspective. Internet research was initially about technically searching the Internet. It extended from search engines, to Boolean logic, to popularity ranking - all elements of computer science. Because most early adopters were computer techies, Internet research adopted this computer tech mantle.

This is changing, and the change is accelerating.

Over time, the Internet has grown. It has gradually morphed from a shallow pool, into a deep lake, into an ocean where the depths are largely unknown and not directly searchable. We simply can no longer see much of the information from a single vantage point.

The Internet transformed into the very beast found in the older information world - very much requiring library science and a research heritage distilled from years of working with incompletely indexed information with multiple and overlapping layers of organization.

The Internet became not congested, or chaotic, since it is clearly neither. The Internet began to grow up, add weight, and resemble its information birth parent.

Evidence of this lies in the amusement we now hold for early search techniques. Why don't we still search for hotlink pages? Why can't a single person write a guidebook organizing all the resources in agriculture? Do we really need to use quotes with search engines?

The one ill-fitting piece to this jigsaw is the early guidebooks I long held so dear. It reminds me that even in the early pre-Web era, library science was there, evident, and making an impact. But that impact was initially minor compared to the results of computer science and visibility of commercially viable search engines.

As the Internet has grown up, dwarfing our simplistic search tools and techniques, we have put in its place more and more library science to deliver us from confusion. This trend will continue. In fact it will continue until the very nature of Internet research shifts monumentally from computer science to library science.

The relative gifts of computer science will be eclipsed by an understanding that Internet research is more about finding information than about searching - and finding information is intimately library science.

Yes, the whole concept of Internet research will detach itself from computing science and merge as a discipline of library science. It will shift allegiance. The move is inevitable and I personally think it will take about 3 years.

What else could transpire? Could computer science absorb library science? Not likely. In the vast Internet, resembling in so many ways the reality of information research, computing science is relegated to a role in organizing discrete baskets of information - not the task of guiding research itself. The computing aspect of searching will become a sub-topic to the concept of Internet research.

As an aside, Internet cataloguing actually runs the opposite risk, of being absorbed into computer science. The relative gifts of thesaurus and classification schemes can be eclipsed by the more visible gifts of computer science - but that is another story.

How will we find information on an Internet with 50 billion records, where the largest index is but 3 or 4 billion records in size? The answer is with intellect, with skill, and primarily with the arsenal provided by library science. We will have a multi-tiered approach, where individuals with more skill will dig deeper and be more effective. We have been moving this direction for a decade.

The totality and inevitability of this move is the inspiring event. Slowly, Internet searching will come to be seen as an element of Internet research. Internet research will assume the undisputed mantle of library science.

The digitizing of our lives never altered the need for assistance - just the type of assistance the community required. The new forms of assistance will relate to digital information. Viewing the library community in its widest context, that of assisting and facilitating access to information, the library community belongs here. This is your home. I see three effects:

1) There is no urgency to selling a message that the Internet needs a librarian. There is no need to sell your role to the community: There is only the need to be there when they learn they need you.

2) Priorities within the library community are changing. There are ways to prepare for these changes with training and legislation. I personally want to see libraries involved in teaching Internet research to the community. Soon the community will come to you seeking advice on how to undertake a challenging bit of Internet research. Will you be ready?

3) This should inspire the library community. Its destiny is assured. Librarians will be as important as they've always been.

History will describe the early Internet as an aberration; the one time when the Internet did not resemble the whole information sphere, in all its complexity, organization and beauty. History will remember these last few years as the one time when Internet research was not part of library science.

* * *
David Novak, founder of the Spire Project, delivers seminars on Exceptional Internet Research around the world. I hope to see you one day. for details.

