Research Commentary by David Novak of The Spire Project

This article first appeared in print in ONLINE July/August 2003 (Vol.27 No.4), a publication of Information Today, Inc.
Please note, minor corrections have been made to this article.

Manipulating Forms to Improve Research
By David Novak

The Internet is resplendent with technologies well developed from a computing perspective but incompletely applied from an information/library perspective. Consider field searching. The use of field searching in quality assessment and traversing the Internet is poorly understood. Consider URL interpretation. So much more can be found buried in the URL than just country and organizational type. Consider another technology: Forms.

A form is the simple box for entering information on a Web page. Forms are found everywhere and used widely as doorways to databases. Delightfully, forms are portable and adaptable by nature. With minimal fuss, we can move, limit, restrict, and generally personalize the forms to publicly accessible databases - making it possible to improve and speed research.

Forms allow us to enter information on a Web page. We type our search terms into a form every time we visit Yahoo! or Google. Forms are an integral part of the Hypertext Markup Language (HTML), so all browsers support them. The underlying HTML tags are simple to understand. There are just five. Forms themselves are simple. It is only in connecting forms to something else that it quickly becomes complicated.

The wonderful nature of forms is their portability and adaptability. A form from one Web page can easily be shifted to another. Once moved, forms can be altered to make them simpler or more specific. Limits can be pre-defined with as little as a single HTML tag. Add JavaScript or an intervening script for even more flexibility. From an information research perspective, working with forms offers interesting opportunities to change the way we work with information.

HIDDEN INFORMATION REVEALED
The first opportunity involves breaking the barriers separating us from non-standard Web material. As the fine cataloguing work of Gary Price demonstrates [www.resourceshofl.com], there is a vast section of the Internet beyond the reach of the search engines. Colloquially known as the hidden Web, the invisible Web, or the deep Web, this region of the Internet generally is of higher quality and has tighter organization. A perhaps typical example would be the contents of the MEDLINE database - a collection of articles and abstracts about medical research. Information is enfolded within, available for all to see, if we first negotiate the search function to find the information.

Time has shown the development of large public free databases can be enormously effective in distributing information. MEDLINE, ERIC, CRIS, LOCOC, EDGAR, SEDAR, US Patents, MOCAT: The list goes on of large commercial-quality databases that, when funded externally, are so very successful at reaching large audiences. These large publicly accessible databases are one of the few formats perfectly in tune with the Internet medium.

Unfortunately, their very existence as databases - accessed through a form - limits the way we can approach this information. To reach into and retrieve information from a database, we are restricted to the tools made available on the Web site hosting the database. As information professionals, we can do little more than point a patron to the database and exclaim: "Look! The information is in there somewhere."

Ah, but that is not quite accurate. If we can manipulate the forms to these databases, we can present patrons with a selection of just the material we think will interest them.

CONVERTING FORMS INTO LINKS
A simple approach to achieving this involves converting forms into static URLs. This is possible with forms that use Method=GET. Most notably, this includes the global search engines and MEDLINE.

For example, a search on Google.com, for form manipulation will generate a Web page with the Web address of:

http://www.google.com/search?hl=en&ie=UTF-8
&oe=UTF-8&q=form+manipulation

This page lists the top ten matches from the Google database with the words: form AND manipulation. If we were to visit this address, we would directly repeat this search. Make the above address the destination of a link, and we have a link to the results of this search. A search on a form thus converts into a static URL.

My exploration of forms received a major boost from an article by Sunny Worel ("Integrating Medical Information into Web pages." EContent, August/September 1999) describing how MEDLINE information could be integrated more closely into Web pages. When the MEDLINE Database became available through PubMed, it became possible to craft a Web address that would point directly at a specific MEDLINE record. MEDLINE article abstracts can be treated just like Web pages. Link to them, frame them, and reference them, even though they are buried in a database.

URLs can also be created for specific PubMed searches. Like the Google search above, an important search can be cemented into a link. Anyone who clicks the link would repeat the search. As an elegant demonstration, a form of Current Awareness Service can be crafted: "Show me all the articles on corneal transplants in the last year." Such a search can be converted into an URL then used again a month or a year later.

MEDLINE allows for this kind of manipulation. The PubMed interface was designed with this in mind. Interestingly, MEDLINE is not alone. This interconnectedness is more a feature of the Internet than of MEDLINE. Almost all Method=GET databases can be manipulated in this manner.

MOVING FORMS
We can do better. Many of our actions on the Internet are repetitive. If these actions are remotely related to a form, then moving forms offers a way to shorten and speed a search. We can often shave two or three steps from the research process. We can embed numerous related forms into one single page for our convenience. We can embed forms in information needed to use the tool effectively.

Consider a search of Google. If we could move the Google search form, we could place it on our own page, residing on our own computer. We would not need to wait the half-second for the page to return from Google. More significantly, on our page we can add information we need to effectively search Google; perhaps a reminder of Google's hidden field search terms.

A form starts with an HTML <form> tag. It ends with a </form> tag. The <form> tag has two primary attributes: a method (either GET or POST) and an action (a pointer to a Web address).

A basic form looks like this:

There are a few simple rules. 1) Forms can't overlap. We must close one form before starting another. 2) GET and POST are different methods of sending information and the difference can be important, kind of. 3) The action=address is the location of the program that will interpret the information we are sending.

What information is sent? Another tag, called an <input> tag, comes in a variety of different flavors depending on how it is displayed on the Web page. We have the <input type=text> for a textbox. We also have the <input type=radio> for radio buttons, <input type=checkbox> for checkboxes, and then of course the <input type=submit> for a button that triggers the sending of the information to wherever it is going. One further input box is special: the <input type=hidden>. It holds information hidden from view, meaning not displayed on the Web page.

Each input tag has a name. Think of this as the variable name. Each input tag may have further attributes: perhaps a size (how long a textbox do we want), a check (which radio button do we want selected at the start), or a value (is there a word already in our textbox?). Don't concern yourself with the numerous additional attributes. Most are cosmetic or self-explanatory. Greater help with form tags can be found at "HTML 2.0: Forms and Obscurities" [www.cwru.edu/help/interHTML/toc.html].

Two other tags have a slightly different construction. The multiple line textbox looks like this: <textarea name= rows= cols=> </textarea>. The select box, where we select from a list of existing values, looks like this: <select name=> <option> <option> </select>.

Here is the form for Yahoo!, as retrieved in January 2003.

Don't worry about the style= on line 1. It's cosmetic. We may also notice there is no <form method=???. When not defined, it defaults to GET. Similarly when <input type=??? is not defined, as on line 2, it is assumed to be a textbox. Thus, we have a simple form starting with a <form action=???>, including a textbox (named 'p' and 30 characters long), a submit button (titled: Search) and an end to the form.

Keep in mind these tags are squeezed between other tags defining a table, some images, and perhaps a few words. I had to remove this unrelated information. To view the HTML of only the form, open the Web page with the form in Windows Explorer. Select the View drop-down menu, then select Source. The HTML for the page opens in a simple notepad. Remove everything above the <form> tag then manually delete everything not a form from there. The HTML to the form remains.

The form for the Google search, as found on Google.com, looks like this:

Again there is some less than critical information. Basically it starts with a <form action="???"> (line 1), proceeds to three hidden variables (line 2, 3 & 4), a textbox (line 5), and then two submit buttons aptly called "Google Search" and "I'm Feeling Lucky". Then the form ends.

The form is a simple technique for communicating information from a Web page to a computer program. To move a form, all we must do is tease out the elements of the form from the HTML page then - very importantly - add in the destination domain, which is often left out. Thus, if it reads <form action=/search>, we replace it with <form action=http://google.com/search>. When the action=address is relative, we must make it absolute.

For a time, I set about embedding numerous forms as part of an effort to explore and present guidance on Internet research from the perspective of information research. The Spire Project once included 25 articles, each teasing out the better tools required to accomplish particular searches. Over 125 embedded forms brought similar tools together, each sunk into discussion about when to use specific tools and how to search most effectively. Further details are at "The Spire Project: innovative gateway on the process of finding information", The New Review of Information Networking, Volume 6, 2000.

There is a significant advantage to aggregating information together. From one page, search the Library of Congress, WHSmith's Internet bookshop, MOCAT, and a database of free Internet books; forms stacked one after another, sequestered within another two dozen links, and ample discussion on relative size and search criterion. The result is quite impressive. Certainly a step towards reducing confusion. Consider which is simpler. A form on the page where we introduce a resource - or ask our patrons to visit an unfamiliar Web site, select a specific kind of search, then fill in their search request into a form we know is more complicated than our patrons require.

SIMPLIFYING FORMS
The joy in working with forms is we can do much more than just move the form around. We can change forms too. Make hidden information visible, make visible information hidden, and generally alter the form as we desire.

For example, perhaps we dislike how Google only presents us with ten matches at a time. Google allows for a special variable called num that tells Google how many items to display on the results page. The default for Google is a miserly ten matches. I prefer 40. Also, Google's three hidden variables are not really required. So, I will place on my homepage a simple script for a search of Google that returns 40 matches. It will look like this:

This form points to the right place, includes a box, a hidden variable, and a submit button.

How did I know that Google has a variable called "num"? If we visit the Google Advanced Search page [www.google.com/advanced_search] we can see it has a little dropdown box with a few numbers in it. The HTML for this advanced page calls that variable "num." Just look for the line that says:

<select name=num>
<option value="100">100 results
</select>

All I did was replace the explicit select box with:

That is, I assigned the variable num a value of 40 but in a way not visible on the Web page.

All the form tags assign variables in the same way. We can swap among hidden variables, select boxes, textboxes, radio buttons, or check boxes, as we prefer. An <input type=text name=q value="form manipulation"> can be converted into <input type=hidden name=q value="form manipulation"> and so on.

The advanced and simple search forms generally send information to the same program for interpretation, so a variable that works on the advanced search will work on the simple search too. Furthermore, many hidden variables may not be necessary. If left unassigned, variables often have meaningful default values. In the Google form above, <input type=hidden name=hl value=en> defines the preferred language as English (coded as en). If left out, it defaults to English anyway.

Just remember, software designers are trying to make their software simple, so we have their help in our quest to make their information more useful.

Example 1: A search form for NewsBlip is placed beside a search form for NewsIndex. Both are newswire meta-searches. To keep matters simple, I set the search category to All and results ordered by Date. There are other options on Newsblip.com but a simple form is less confusing.

Example 2: I wish to enhance a group of links to regional search engines with the form to translate a foreign language Web page. I move the form for the Babel Fish translation engine (found on AltaVista) into place. However, the Babel Fish form is longer than I need, so I drop the section for translating a block of text and retain just the translation for a Web address.

Example 3: In an article on periodicals, I restrict the Library of Congress Online Catalog to just a search of periodicals held there. It is a good alternative to the more definitive Ulrich's International Periodicals Directory.

I was most fortunate to discover Google's hidden field search terms about two years ago. As I was working with the form to Google's advanced search, I came upon a variable called "allintitle". I already knew that on occasion, such variables also work from within the textbox. (Type allintitle:word into Google some day.) So I pondered. If there is an allintitle, could intitle exist too? Yes. Google accepts intitle: for a title search and inurl: for an URL field search.

FURTHER WAYS TO ADAPT A FORM
Sometimes adapting a form can be accomplished by adding something to the search request. Perhaps we wish to create a Google form that searches just Australian Web sites. There are two methods to arrange this. Firstly, arrange for site:au to appear within the search box when it is initially displayed. If left in the box, a search would be limited to Australia. Add an initial value like this:

The second method involves a little JavaScripting - in this case, embedded in the

tag.

When the submit button is pressed, the onSubmit JavaScript is run, adding site:au to the contents of the textbox.

Forms can also be simplified to be nothing more than a button. Turn all the variables into hidden variables, leaving just the submit button (to trigger the sending of information). This creates a situation similar to converting forms into links as described earlier. And this time it works for the method=POST forms. Yes, all forms can be converted into static direct links or canned searches. If necessary, a redirection Perl script allows us to convert a method=POST into method=GET.

REACHING FURTHER WITH SCRIPTS
Scripts offer greater flexibility in many situations. Sometimes to access a database, we must initially retrieve a current UserID number before we are given the form to submit a search. The Library of Congress Catalog is a fine example of this. The solution is again to harness the interconnected nature of the Internet, and go fetch a current userID number. I do this with a little spider written in Perl. A form directs the request to my Perl script, which nips over to grab a current UserID number, then submits the request using the newly retrieved UserID. The results appear as a one-step search of the Library of Congress catalog.

This script, minus the spider (spiders can be dangerous), can be found at http://spireproject.com/forms1.htm

While grazing the Internet, I found a JavaScript that places the cursor waiting in the textbox when a page is first loaded. This creates a slightly faster search, since we no longer need to reach for a mouse to click on the search box before we begin typing. It also allows us to copy and paste words into a search box faster. This script can be found at http://spireproject.com/forms2.htm

On another occasion, I wished for a single textbox to search several different databases. Which database would depend on which radio button was selected. Essentially, I sought to overthrow the "Forms can't overlap" rule mentioned earlier. After some research I uncovered a suitable script and set about adapting that script to suit my purpose.

A UNIFIED SEARCH ENGINE FORM
What is the ideal homepage? I have researched, scripted, and adapted forms into what I call the Unified Search Engine Form. It facilitates access to several of the top global search engines. It also demonstrates most of the techniques expressed in this article. I isolated and moved the forms to five global search engines, placing them on the same Web page. I then modified an interesting JavaScript capable of merging several forms into a single form - a single textbox and five radio buttons - each radio button a different search engine.

To further enhance this form, I asked each search engine to provide 40 or so matches instead of a miserly ten. I next arranged for the cursor to wait in the textbox when the Web page is first loaded. I then instructed the form to translate the field search syntax and, meaningfully, to exchange 'http://' for 'url:' - so if I copy and paste in a Web address, it becomes a URL field search for additional local information. Lastly, this page is my home page, on my computer, available at a moment's notice whenever I press my home button on my Web browser.

It is a strange result. Not a meta-search, but a unified search; a search tool to help me use the Internet faster and more effectively. And this is my offering on just how far we can adapt and improve the forms to publicly accessible databases [http://spireproject.com/plus.htm]. I invite anyone to place it where they wish.

THE BEAUTY OF FORMS
Forms are portable and adaptable. This arises from the very nature of the Internet. It challenges us to link, reference, and incorporate information in novel and innovative ways. It entreats us to improve on existing information. It requests we recognize all information is fluid and can be co-opted on the Internet.

Form manipulation is more than a timesaving parlor trick. It minimizes confusion and draws attention to specific information or collections of information in a very precise and informative manner.

* * *
David Novak, founder of the Spire Project, delivers seminars on Internet Research around the world. I hope to see you one day. SpireProject.com for details.

The Spire Project - better ways to find information.
Like this? Attend our public seminar and receive our bi-monthly update notice.
| SpireProject.com | SpireProject.co.uk | Project Background | Feedback. Copyright©David Novak 2003.