Search Intranets
Current Issue
March/April 2012
Editorial
Columns
Features
News & Tools
Read_Me_File

Services
About Intranets
Subscribe to
Intranets
Past Issues
Sample Issue (PDF)
Intranet Search Engines
- May/Jun 2002 Issue Posted May 1, 2002 Print Version  
Page 1

As intranets grow, providing access to more and more documents, their value grows. The larger the collection, the harder and harder is becomes to find that important presentation, contract, or HR form. Enterprise Information Portals (EIPs) provide a starting point to intranets, and a search engine helps locate information, including archives and unstructured data. Search engines need to be tuned and indexed to provide the best answers.

How a Search Engine Works
An intranet search engine is much the same as the Web-wide search engines such as AltaVista or Google. The search engine locates the documents, extracts the text, and stores it in an index file, making an entry for each word. When an end-user or employee types a word into an HTML form and clicks the Search button, the browser sends it to the server. The search engine receives the search query, looks for matching words in the index file, gathers related document information, sorts the documents by relevance, formats the results into HTML, and sends the page back to the user—all within a few seconds! The best search engines have good defaults in all these aspects and allow search administrators to override them when appropriate.

Several indexing aspects require attention from the intranet site manager. Indexing integrates content from many sources: pages on internal sites, content management systems, internal databases, mailing list archives, e-mail public folders, analyst reports, data feeds, journal subscriptions, and more.

Gathering documents for indexing
Search engine indexing robots start with the server host names and follow links on Web pages to index the text. This is how the giant Web-wide search engines, such as AltaVista and Google, locate pages, and it works pretty well for intranets. Search engine administration systems should allow control over which hosts get indexed and exclude problem areas. For example, you may want to index the current calendar pages, but not those for 5 years out.

Search engines can index mounted file servers directly, without going through a Web interface. Some can interface with e-mail server public folders, content management systems, databases (usually via ODBC or JDBC connections), and external data feeds such as Reuters news.

The index should be kept current: As soon as the new content is published, it should be indexed. Publishing or content management systems can notify the indexer of new data; otherwise, index the frequently changing areas more often. If the search engine cannot respond to queries when updating, use mirrored servers or switch search engines.

Indexing Content
In addition to HTML, XML, and text, intranet search engines deal with binary file formats such as PDF, MS Office formats, including Word, Excel, and PowerPoint, WordPerfect, and others. Most indexers come with file format converters, and if this is not the case, translators are available from such vendors as Stellant (Inside Out, formerly Inso). It is important to keep translators up-to-date, as the binary formats may change and make the new files invisible to the search engine.

The index should store the entire content of every file, even very long documents. It should keep every word and the word position in the document, for later phrase searching and match highlighting. Many intranets span countries, but some search engines are limited to English. If the intranet has text in other languages, the search engine must handle extended characters such é and ß, recognize alternate punctuation, apply appropriate stemming, and provide an interface in the users' languages.

Intranets generally include various levels of security and access controls, and the index should store this information, so it can show only the accessible content in the search results. For high-security content, it is a good idea to create a separate index file to avoid co-mingling private and public text.

Intranet Metadata, for Better or Worse
Metadata is information about information. It can be metatags in an HTML header section, XML tag names, or database field names. Metadata is helpful when it is implemented well, but can cause problems if it is inconsistent or repetitive.

The basic HTML metadata are the title tag, the file size, and the file modification date. Some people include an abstract of the page in the meta description tag. The Dublin Core standard provides some other basic metadata tags such as author, publication information and the actual date of changes in the publication (dc.date.modified).

In many cases, the metadata is not consistent or even coherent, even for title tags. People forget to edit titles, little realizing how annoying this is in the search results—this is particularly common in framed pages. Binary files (such as MS Word) are even worse: People tend to copy templates without changing the Properties tags, so the tags are duplicated. While it is possible to train intranet site managers and publishers to test this before posting to the intranet, it is an extra burden on them. One intranet manager gave up and stopped tracking binary file metadata, while another implemented a publication process that required a title before the page could be posted.

Important Features for Intranet Search

Search functionality is divided into several parts: the search form and query options, the search engine retrieval and relevance ranking, and results display.

Search Forms
The Web has trained everyone to expect a simple free-text search field, so intranets should provide a familiar field with a Search button on the home page and on navigation lists. It is nice to provide a simple search page with the most useful options, such as search zones—big categories such as departments or countries. Save the advanced search for power users and librarians.

Search forms must take in the security culture of an organization. For example, one telecommunications company intranet offers an option to search standard data, such as product features and company holidays, without requiring anyone to log in. Once the employee does log in, the search engine only displays documents that the person and their groups are allowed to access.

Search Functionality
When the user clicks the Search button, the browser sends a query to the search engine server. It looks for the words in the index file. Some search engines use stemming to locate singular and plural forms of words; a spell-checker; and/or a synonym list, so a person looking for "doctor" would also find "physician." By adjusting synonyms, one health benefits search engine significantly improved the quality of search results.

Once it locates the matches, the search engine gets information about the associated documents, such as URL and titles. It sorts the documents by relevance, as defined by an internal set of rules, by frequency of matched terms in the documents, phrases, and location in the document. Some search engines let administrators specify which areas and URL paths are more important, such as product home pages, and which are less valuable, such as online conferencing and old calendars. With personalization and collaborative filtering, search results are weighted according to employee roles and documents others have found useful.

Search Results Pages
Search results are not a place to surprise users with experimental interfaces. It is best to conform to the basic conventions of Web search results, with a listing of documents showing titles and descriptions. The Internet can be used to identify useful features.

Search Problems and No-Matches Pages
Searches fail for various reasons:

• The user forgets to type anything in the search field • The user is searching for text that is not in the scope of the index
• The user is using a term that is not used in the index (such as sick day vs. PTO)
• The user has made a spelling or typing mistake
• The user is doing a search in which all the query requirements are not met (for example, one word was matched but the other was not).

To avoid common search failures, create a page that explains these errors and helps users understand what is within the scope of the search engine. If a taxonomy or hierarchy exists, display it on the page to allow users to drill down through the categories.

Intranet Search—Scaling to Millions
Intranets tend to grow exponentially, and this puts a big load on the intranet search engine. Concurrent searches and especially the increasing number of documents indexed add to requirements for RAM and disk space. In the long run, large intranets have to move to multiple search servers, whether that means separating collections onto separate servers or using a high-end search engine that automatically distributes the search among servers.

EIP Search Administration
Search engines can make it hard or easy for search administrators. The more expensive products tend to have better interfaces. Free open source search engines require significant technical resources to compile and build, but also to make any little changes in indexing rules, schedules, results page interface, and other aspects of the search engine. Some commercial search engines are similarly limited, while others, from the low-priced Phantom, MondoSearch, and dtSearch, to the high-end Inktomi and AltaVista, provide nice graphical user interfaces so search administrators can make changes without having to bother the folks with root access.

Search administration tools should provide the ability to set the indexing paths and schedules, metadata and tags, search zones or collections, synonym listings, adjustments to relevance weightings, and results page customizations. The best products allow multiple results pages for different zones and languages, each customized accordingly.

There are several options to go beyond the turnkey search engine, for example, to integrate with other applications such as personalization or complex backend databases. Free open-source search engines, notably SWISH-E, http://Dig, mnoGoSearch, Lucene, and ASPSeek, allow programmers to make appropriate interfaces very easily. Other search engines are available in the form of code libraries, such as LexTek Onix, Lucene, MPS, and ZNOW. Many high-end turnkey search engines come with SDKs that provide an interface to the core query engine, so programmers can integrate it into the system: These include AltaVista, Inktomi, FAST Search, Verity, and IBM Intelligent Miner.

Search Log Analysis
Search logs are a great window into the minds of intranet users. If the search log tracks the query and the number of matches, this is good. This makes it possible to count the 25 or 100 most popular search terms and to make sure these topics are adequately covered. It is also possible to track the most common terms that do not find matches and to address these problems by adding material that covers these topics, including the terms as metadata in appropriate pages, or adding them to the synonym listings.

Conclusions
Intranet search engines provide vital access to online documents, open the resources of the institution, extend the value of research, and provide information across distance and time. For the best results, index as much data as possible, though it is important to keep the index current. Using the best practices described above, set up search forms, results pages, and relevance ranking that address the true information needs of the users, and watch the search logs to keep up with changes. Search engines are powerful tools, but they are not magic: They work best when they are fine-tuned for the requirements of the situation.


 

Print Version  
Page 1