School of Computer Science and Software Engineering,
Monash University, Melbourne, Australia
Jason.Lowder@csse.monash.edu.au and Xindong.Wu@csse.monash.edu.au
Searching the Internet can be done in a number of ways, none of which may be immediately obvious to the Web novice. Browsing, hierarchical menus, and keyword searches provide access to information. Some searches require mathematical-like syntax in order to bring back meaningful search results. Tools exist to allow comprehensive searches of the WWW by use of multiple search engines (such as Savvy Search). A problem with the present search based WWW information retrieval is that it does not support a novice who cannot categorise, or recognise keywords to retrieve information and is unfamiliar with the use of search engines. [Pinkerton 1994] points out that 1.5 words are used to describe the average query. Altavista's Java based category browser tries to improve on this by providing a graphical network-like term structure. If a user is learning a topic domain, however, the ability to describe the desired information and relationships in the topic domain is limited.
Web search results are normally a set of links and a selection of pages as targets. Storing these pages is required to keep the links, while relationships to the ideas and the information which prompted a query, is lost entirely. This forfeits a large amount of the learning power provided by hypertext though forming for relationships. Also, the update-ability of hypertext can result in the loss of relationships due to alteration of content, focus or structure. To ensure that the link remains relevant, some kind of "snippet" of information describing the original document should be kept. These are problems with the reading tools, not problems that can be solved by a site-by-site basis. At present if you wish to search the Web the learner has to deal with the use of a new indexing method and develop their own strategy for collecting information, before they can deal with the structure. The objectives of any new interface for searching should provide: consistency across platforms and browsers and hyperdocuments, metaphors to normal reading tasks for learnability and features though re-use of present interface elements.
[Kabbash and Buxton 1995] discuss the use of "Area Selection" as a mechanism for improving the accuracy of selection. Selecting a graphical object with a wider cursor captures the "area" of the user's interest. Areas of text also trigger the drive to find more related information, some of these areas are often not seen as relevant or related by an author. In a distributed hypertext environment, it makes sense to let people form their own links between pages. Basing this idea around the already familiar cut/copy mechanism provided by windowing systems, we can make a searching metaphor called, Wide Area Selection Search Interface (WASSI). This interface allows users to select relevant areas of a document, and submit these as a search query, while preserving the text used, and making the returned matches persistently linked to this selection of text. These kind of links we can refer to as "squishy links" (somewhere between a hard link and a soft link). A squishy link, as the name suggests is malleable, may or may not dangle, and is reader controlled, the links occuring in only their browser/WASSI interface.
After selection, keyword extraction takes place, involving prioritisation of the words according to their significance within the document. If a user wishes to collect multiple area selections, they can return to the browser, select another area, and then add this to the query. A prioritisation window allows vetting via a list of words with checkboxes, which allows words to be included or excluded. A Cut Off number indicates the number of occurrences of a word in the text required to set the word's check box on by default. A No Stopwords check box, when turned on, will keep stopwords out of the query, otherwise words which occur in an extended stop word list will go to the list as unchecked words, if they have occurrences exceeding the Cut Off setting. A stopword threshold is a percentage which can be set by the user to remove these words from the stopword list when a certain percentage of overall queries have used them as keywords, partially automating recognition of stopwords which have become keywords.
The method for search term extraction is based on [Augusti and Marchetti 1992]. Consideration takes place based on the entire collection of documents of interest and, secondly, the conceptual plane on which the search terms occur. Difficulties occur here because distributed hyperdocuments may have no obvious topic boundary, making the conceptual plane hard to define. Search terms are considered more important if they occur in the title and in any describing Meta Information. A keyword is considered more accurate than the document description. If a conceptual plane is to be constructed, links can be traversed until keywords no longer occur within a document, indicating that this document is out of the scope of the search for term ranking.
The search engine interface WASSI comes in two components. A Search Manager controls the operation of a number of Search Engine Interfaces. Each search engine interface is tailored to communicate with a specific search engine. The search manager controls activation of the search engine interfaces (determined by the user), gives them a standard query and collates their returned results. Some search results include match relevancy or at least return documents that match most closely first. This ordering and any ranking that is provided is used to rank and combine documents used. A formula to create a match index percentage for search results not containing indices, can be calculated based on the formula suggested by [Salton 1989].
WASSI is an advance on search interfaces and information collection in distributed hypermedia. Disorientation and forgetting links is reduced as links, searches and results are persistent, forming the squishy links between previously unlinked documents. Query accuracy improves because of the increased likelihood of relevant keywords in adjacent selected text. Link text is stored to keep a persistent record of link context. Users are presented with a familiar metaphor (that of the highlighter), with which to mark and collect information. Moreover, users can form squishy links between documents by highlighting text which is of interest, vetting the keywords to be sent to the search mechanism and organising the relevant links.