toto article listarticle list
Research Commentary on the Spire Project

Search Engines need Finesse.
By David Novak

Searching the web is often just wishful thinking. I truly question the sense in sending a single word to a large search engine, then looking at the top 20 webpages from a list of 50,000. Just how silly to then assume you have found what the Internet has to offer?

This is computing at its most optimistic!

And on this point, may I also express my amusement when a friend exclaims, "I searched the Internet and didn't find a thing." Huh? You searched the Internet? Please! Even searching four of the top search engines scarcely qualifies as a search of the web. And there is so much more to the Internet than the web. Besides, you probably just typed in the wrong words.

To do a proper search, you need to understand a little bit about how search engines work. Search engines reach into a database filled with the text of a great many webpages. Search engines like Altavista (altavista.com) and All-the-Web (alltheweb.com) have databases of over 350 million webpages.

During the building of these databases, certain parts of each webpage are recorded separately, like the title and web address. The technical term for this is a 'field'. It works much the same as books in the library are indexed by title, author and subject. The US Library of Congress catalogue (catalog.loc.gov) has more than forty fields but search engines tend to have just a few: web address, title, maybe language and link.

Which leads me to ask, when was the last time you searched for a webpage by title?

Search engines also record the order words appear in each webpage, so we can craft a search request for words appearing next to each other on the webpage. The term for this is 'proximity' and it allows us to search for a phrase like "patent research". Commercial databases, like the ones on CD-ROM in your local library, let you dictate precise distances between desired words. On the web, we usually just have quotes "" to keep words together.

Field searching and proximity are the keys to a more valuable search. Search engines have made great strides in using fields, proximity, +/- and something called link analysis to automatically rank the more useful sites in the first twenty they show you. Don't trust this ranking system. You can do so much better simply crafting a more precise search request.

Let's now look at the use of the plus + symbol. If you search for patent research, you are asking for all the webpages with the words patent or research. This is why the number of matches are so very large. If you want just the webpages with both words, you need to type: +patent +research.

Please don't confuse the plus + symbol with AND: equivalent in theory but much more confusing to use. For reasons beyond my comprehension, word1 and word2 often means word1 is optional.

The NOT or - (subtract) works very predictably. The other option, OR, does not work reliably across the many search engines and is best ignored.

By mixing fields, proximity and the +/- symbols together, you can be very specific about the webpages you want to consider. In all cases, aim to get the number of matches you will want to look at; certainly no more than a hundred. If you have more, limit your search further. Simply add more words, fields, proximity and +/- symbols to your search.

Some occasions only need a simple search. If you know the name of a site, and just need the web address, most any search will do. There are advantages to using Google (google.com) or a meta-search engine (like www.debriefing.com) but you can certainly use your favourite search engine too.

There are numerous pitfalls to crafting search queries. Altavista can catch you with capital letters. Google has problems with plurals. Field searches are poorly described or only available through the advanced search pages, and no two search engines use fields in the same way. (title: works on Altavista, normal.title: for All-the-Web.)

Search engines are presented as the simplest way to navigate the web. They often are, especially if you have just a little finesse in crafting your search request. As our examples illustrate, there is a world of difference between a search for 'patents' and a search for +title:"patent research".
* * *
David Novak manages The Spire Project, an Internet research resource and thinktank.

Search Request       Number of Matches
patent research         29,600,081
+patent +research         418,576
"patent research"         1,910
+"patent research" +"search guide"        9
All-the-Web 14th October 2000

Search Request     Number of Matches
patent research         1,509,565
+title:patent +research         4,491
+title:patent +"patent research"         93
+title:"patent research"         17
Altavista 14th October 2000


to article listThe Spire Project - better ways to find information.
Like this? You should attend our public seminar and receive our bi-monthly update notice.
 | SpireProject.com | SpireProject.co.uk | Project Background | Feedback. Copyright©David Novak 2002.