Note: This article appeared in Intranet Professional, prior to its re-launch as Intranets (in 2004).
This article covers when—and why—you should consider the technology route for taxonomy development and deployment (see Cyborg Categorization: The Salvation of Search in the last issue of IP for an introduction on auto-categorization). It includes some key variables in comparing auto- categorization software and describes product offerings from selected software vendors to develop criteria that can assist in distinguishing and choosing between them.
Why Consider the Technology Route?
Reasons that might lead to the purchase of auto-categorization software for an enterprise portal or intranet are as follows:
• The organization is planning a portal, content management, or search initiative, and the role and value of a browsable directory is understood.
• The organization is currently using manual classification processes, but the growing volume of documents and document types that populate the intranet or portal is making that more time consuming and costly. Related to that may be the difficulty in keeping up with demands for currency. Documents may remain uncategorized for so long that business opportunities are missed or the topical headings and resulting categorization may no longer reflect the business of the organization.
• End-users are expressing difficulty in locating the information they need due to the growing volume of documents and possible inefficiencies of the search capability.
• Categorization has been a collaborative activity that occurs throughout the organization, but a lack of consistency and a proliferation of thesauri and indexes may be creating problems.
Key Factors in Comparing Auto-Categorization Products
Vendor Background and Experience
There are an increasing number of newcomers to the field, some formerly focused on the related but different field of data mining. You need to consider how long the company has been active in the specific field. Compare the backgrounds of the management and R&D teams. As auto-categorization is based on scientific and mathematical principles and technologies, you will want to see evidence of expertise in the area. Some vendors have solutions that best suit very large enterprises while others may be more suited to smaller organizations. Some vendors are focusing their solutions to meet the needs of specific industries. Learn who the existing customers for the product are and compare them to your own organization in terms of industry fit and enterprise size.
The Approach
It is helpful to recall the three broad categories into which automated categorization products fall: rules-based, catalog-by-example, and statistical clustering. As Tom Reamy pointed out in his article in the January/February issue of Intranet Professional, rules-based categorization is becoming a component of all products. This leaves us to understand the differences between the latter two approaches. Why is this important? An understanding of the processes involved will assist in determining the appropriateness for your organization and provide an understanding of what your role as a user of the product will be.
Steps to Deploy an Auto-Categorization Solution for "Categorize by Example" Products
• If the organization does not already have a taxonomy, work with the vendor to develop a high-level taxonomy of categories. Business rules may be set up for each category, for example, the type or length of document eligible to be in the category.
• Training sets, of "exemplary" documents that represent a category are selected—one set per category. The vendor may have efficiency tools to assist in this process. The number of documents per training set varies. The variables include the degree of accuracy sought (the more sets, the more accuracy) and the nature of the subject categories. Whatever the number, this stage is a time-consuming one for the information manager.
• The categorization software compares new documents from internal (file servers, networks) and external sources (Internet, news feeds). Every categorization product has proprietary methods and algorithms designed for this task.
• Information managers review the results and provide input. Some products do not provide insights into the categorization and do not make it easy to revise the results. The only way to do so may be by selecting and adding more training sets. These are referred to as "black box products." Some newer products—labeled "white box"—permit an information manager to see how and why categorization is occurring and to easily intercede to revise the results and improve classification algorithms.
• Over time the auto-categorization tool "learns" from the exemplary documents and from the decisions that the information manager makes.
Steps to Deploy an Auto-Categorization Solution for "Statistical Clustering" Products:
• Software products in this category can use statistical analysis to create a machine-generated taxonomy from the initial set of documents. However, greater success results from starting with a predetermined high-level taxonomy—one developed by the organization or one customized by the vendor for the organization. Business rules are set up at this stage.
• Text is collected from internal and external sources.
• Working from the content of the document and using linguistic and statistical algorithms, the software identifies concepts, then either clusters them to generate a taxonomy or matches them to the predetermined taxonomy.
• Information managers review the results and suggest revisions and additional categories.
Accuracy
Accuracy relates to precision (is the document in the right category?) and recall (have all the relevant documents been included in the category?). There is agreement that 60-70 percent accuracy is typical of primarily automated solutions. Results improve with more human intervention and are dependent on other variables, such as the number of categories, homogeneity of documents and document types. Product literature is replete with hyperbole, but there are few benchmarks for comparing actual performance claims. Vendor-specific benchmarks are likely to be biased, and few organizations that have tested multiple products have made their findings public. Nor has the industry as a whole developed benchmarking tools and data. Auto-categorization products have basic differences in their approach and underlying technologies that can make it difficult to compare specific variables. As a result, it is impossible to include this criteria in a direct-comparison chart.
Integration
Auto-categorization products are not stand-alone, so it is important to know which stage your organization has reached in relation to portal technology implementation. Integration of auto-categorization tools with existing systems will be critical. Look for vendors whose solutions can be integrated, who have that experience, and who have developed partnerships with content-management and portal-solution vendors.
Workflow
One of the latest developments in auto-categorization concerns the value of human involvement in the categorization process. More products are being positioned as hybrids of manual and fully automatic categorization. This is the place to consider your organization's commitment and practice to taxonomy building and maintenance. If resources for these activities are limited, you may seek a more automated approach with limited human intervention. If there is a commitment to the human approach, decisions will need to be taken as to whether the responsibility will be centralized or distributed across the organization. Some of this is dependent on the differences between the information created and acquired by different departments and by the enterprise as a whole. If the process is to be collaborative, you will want to look at products that offer workflow capabilities.
Vendor Overview
The products from four auto-categorization vendors (GammaSite, Inxight, Quiver and Semio) are described. The products were selected because they represent different approaches to auto-categorization and because they represent both established companies in the industry as well as more recent start-ups. Product literature was reviewed, interviews conducted with vendor representatives, and, where possible, a demo of the product was viewed.
GammaSite
GammaSite is an Israeli-based software company founded in 1999. It has clients in the U.K. and France. The company hopes to penetrate the U.S. market and plans to open a U.S. office in mid-2002. While it has clients in the e-content publishing industry, it has test sites but no signed-up customers in two key target markets—financial services and pharmaceuticals. GammaSite would like to be viewed as a scientific leader in the auto-categorization field; three members of its management team are scientists in the field of statistical machine learning. Its marketing focus is on performance around precision and recall. GammaSite was the performance "winner" over five larger market players in a test of categorization software run by Encyclopedia Britannica. Unfortunately, at the present time no detailed information about the test is available.
Inxight
Inxight Software has been in business since 1996. It has over 200 customers. The company has several KM products for analyzing, organizing, categorizing, and navigating information on the Internet, intranets, and extranets. Its auto-categorization tool, Inxight Categorizer, is based on research from Xerox PARC and Xerox Research Centre Europe. The best direct market for Categorizer has been e-content publishers, including Factiva. Inxight points to its linguistic technology that understands content and context in 12 languages as a key feature. Inxight is in the process of developing Categorizer 3.0, to be released in March/April. It will include enhancements to the graphical and editorial interface for setup and training. Later in the year, the plan is to include rules-based categorization, integration with Microsoft Exchange and automatic taxonomy generation.
Quiver
Quiver started up in 1999, launching its first product, QKS Classifier, in August 2001. A key product differentiator is its goal to give users more input and control in categorization, not just at the front end but also throughout the process. There are intuitive workflow management tools that are easy to use and can be distributed across the organization. This means each content editor has the ability to create/modify a custom taxonomy, choose training set examples, and view and control retrieved documents. Quiver was named in the December 2001 issue of EContent Magazine's "Content 100: Guide to the Content Companies to Watch." Upcoming initiatives include the rollout of support for multilingual content.
Semio
Founded in 1996, Semio, like Inxight, is one of the established companies in auto-categorization. Both Semio and Inxight have been listed in KM World's "100 Companies That Matter." Semio's clients are large organizations (Elli Lily, U.S. Postal Service, Mitre Corp.) that have correspondingly large source collections. Semio considers itself the only vendor that can offer scalability and performance. It has installations with 30,000-50,000 categories. An installation to handle hundreds of thousands of documents daily would be a medium-size job for Semio. Another important feature is Semio's library of standard thesauri expanded from industry practice (healthcare, pharmaceuticals, legal, financial services). This year, Semio plans to increase the human element of its product by introducing a feature, called Tag Taxonomy Suggesting, that will give users the option of manually overriding SemioTagger's categorization and categorizing documents. This is in response to requests from some of its clients. It will also launch Tagger 5.0, which will change the underlying technology from batch processing to incremental processing, so that changes will occur in real time.
The preceding table lists some additional (but not exhaustive) criteria for comparing auto-categorization products. The sample of specific auto-categorization products is intended to illustrate the variations as well as the more universal elements common to these products. The high-level variables (approach and scalability) appear first, followed by criteria listed roughly in the order in which they appear in the auto-categorization process.