Note: This article appeared in Intranet Professional, prior to its re-launch as Intranets (in 2004).
[In Part 1, Tom discussed auto-categorization, what it offers a corporate intranet, and proposed a mixed human/auto-categorization model he named the "cyborg." In Part 2, Tom outlines how to build one.]There are four distinct, but interconnected, phases of taxonomies that help build the cyborg. Phase 1 is most dependent on human intervention. Phase 2 also requires human participation. Phase 3 is where automatic classifiers shine. Phase 4, which is often overlooked, counts the most.
Phase 1: Initial Taxonomy
An initial taxonomy is normally something that starts with humans. It is possible, with some products, to have a program take an initial set of documents and cluster the documents into groups based on statistical or vector space analysis. Even if you start with a machine-generated taxonomy, the results will need heavy human refinement before you can begin to make sense of them.
Machine-generated taxonomies have obvious drawbacks unless the content is very uniform in size, writing, use of vocabulary, and is either restricted to one topic or is very general in nature. The big drawback is that it does not include the existing context that a human information architect builds into the taxonomy. For example, on a corporate intranet, knowing which department created the document and the target audience impacts document categorization. The only way a machine-generated taxonomy could find that structure is if there were identifiable statistical patterns in the vocabulary that clustered just right. A small set of documents would not be generated statistically, but a human would know it is an important category of information. That is not to say that taxonomy builders are not useful. They can be applied at lower levels in taxonomy, providing guidance and some rough initial categorization of the lower levels.
Unless you have absolutely no idea where to start or you simply want to try an experiment in serendipitous discovery, starting with a machine-generated taxonomy is probably not a good idea.
What are the typical human activities and how can they be supported? The first step is to set a high-level taxonomy, for example, 7-12 categories, and select documents that fit into each category. Once this is done, either create rules—such as anything that in this set of URLs is a member of the category—or select a set of documents that is representative of the category which will be used as a training set. Training sets can be as small as 10 documents. A few hundred may be needed for each category to get good results. Rules can be as simple as a set of URLs and their links or as complex as a 200-term stored query. Rules can be based on metadata with some products.
Phase 2: Refining the Taxonomy
Information architects will spend much time here. Usability and categorization software features are significant. Software support includes category or taxonomy building at the second and/or third level. This includes a good, easy-to-use user interface to allow an information architect to try different rules or different sets of documents and quickly see the results. The software should be able to make suggestions such as alternate categories or keywords for individual documents or for the category as a whole.
The software could have the ability to write metadata to documents based on its analysis—with human monitoring. The metadata should be editable by the information architects who are responsible for selecting which metadata gets created and saved. Automatically generated system metadata fields are a plus as long as additions can be made to the set.
Automatic summarization is a critical feature, although it is normally presented as a component of search results. It typically takes important sentences or phrases and builds a summary. For example, the first sentence of the first content paragraph is almost always included along with the last sentence of the first paragraph. Paragraphs in which the search term appears are often included. It is useful in presenting search results and it is better than just a snippet of the first 200 characters. To give a really good idea of what is in a document, more than a few sentences are needed, and if users are given one or two full paragraphs for each result, the document becomes unwieldy. A good automatic summary aids the information architect. It can provide enough to check on the appropriateness of an automatic categorization, or to test a preliminarily assigned category for semantic distance from the rest of the documents in the category.
We will have to wait for a real summary, that is, a summary that is a paraphrase, not just selected sentences. A good interim solution for corporate intranets is to have humans enter a short one- to two-line description of a page or document and put it in a description metatag that appears in the search results. Larger summaries or abstracts could be generated and stored in a summary or abstract meta-data field by the authors or subject matter experts.
Another valuable feature in this phase is workflow. Having a good workflow capability built into the software can enable a more flexible approach to categorization by utilizing task distribution among central information architects and a variety of subject matter experts who can provide initial categorization. Rather than looking for more ways to replace human efforts, this approach looks for more ways to support and enhance the human effort. It can provide for collaborative filtering applied before presenting the results to users. Some products not only support collaborative categorization, they support ranking documents as the best of the category so the best always appear at the top of a search list.
A final wish-list item is a facility to present relationships among documents and terms and keywords not only textually, but also in a variety of visual modes. A hyperbolic tree visually presents the words and weights used to auto categorize the document. Another option is to visually present the semantic relationships among terms in the document and between those terms and a semantic framework.
Phase 3: Maintaining the Taxonomy
This is typically where auto-categorization vendors focus their efforts. These vendors are useful for categorizing a large number of similar documents and/or categorizing massive amounts of Web pages for a content aggregator. However, the situation is different for intranets. The bad news is there is highly disparate content coming in each day, including short updates to HTML pages, whole word documents of 200 pages, a spreadsheet or two, and so on. The good news is that humans are already working on the documents, and not only are they a known entity, the humans work for the same company. This means the economic equation is skewed toward more human involvement, especially if it can be intelligently supported within an existing workflow. A human author, supported by an existing taxonomy and writing for a particular department or group Web site for a particular audience, already has most of the context—a reasonable first categorization of the audience's own documents. In many cases, the author provides a categorization that will be perfectly suited to the audience.
There are still multiple roles for categorization software. For example, an auto-categorization feature could suggest both keywords and an initial categorization for new or changed content. This would be valuable if it is combined with a noun phrase extraction facility and the whole thing takes place within a controlled vocabulary.
The categorization function needs to support provisional categorization suggestions, passing these to a human editor or information architect for review. The software needs to support humans who are suggesting a category and running the category through an automatic checker that flags it if something does not fit. The software should be able to learn, that is, get better at suggesting or checking categorization, based on a human tutor.
As in Phase 2, a distributed workflow model may well be the best overall method for corporate intranets. This is particularly true when using a content-management system in which the categorization piece can be integrated. For example, Interwoven makes it part of the normal publishing process to check if a document needs new or changed meta-data. There is software which can be integrated into the process that suggests values for some fields like keywords.
Phase 4: Applying the Taxonomy
Little attention has been given to this area, but it is the most important phase. In this phase, the real value of categorization, whether automatic, humanatic, or cyborgian, is realized. There are a variety of ways in which categorization and search can and should be integrated. As in the other three phases, finding the right balance of automatic and human categorization will be the primary challenge.
The first integration point involves setting up a browse and search facility à la Yahoo!. This provides for drilling down into categories and doing a qualified search at any level. When users search, the results list displays category information and/or list results by category and allows new browsing from those results.
There are two other areas in which categorization can be utilized. The first is software that clusters or categorizes in real time. Straight out of the box, these kinds of clusters can be useful but are rarely consistent. Instead of just statistical clusters of co-occurring terms, it works better if results are categorized by a high-level category, including designated best bets, plus clustering around a controlled vocabulary. The second area of integration involves using categorization to support collaborative filtering. The software learns from the user behavior and uses that knowledge to improve its category suggestions. The software monitors search requests, tracks how long people spend on a search, which avenues they try (what is the balance between browse and search), what results are selected as hits within each category, and so on. This can be used to develop better and richer sets of related categories that can be offered as options for browsing from search results.
Seven Lessons Learned
1. Out of the box? (out of your mind!)
• While there have been very significant new developments in the auto-categorization arena, the whole field is still young. No one software package has everything and the integration issues are not trivial. You will have to customize the solution to put it on a corporate intranet with the varied content and specialized vocabularies.
2. Needs to learn to play well with others. • Auto-categorization should be integrated over the four phases of taxonomy building, with other software, particularly, search and content management, and with statistics of browse/search behavior, including both explicit and implicit collaborative filtering.
3. Cyborg brain surgery for fun and profit.
• Cyborg categorization is better than either auto-categorization or human categorization, but a cyborg on which brain surgery cannot be performed is only half a cyborg. In other words, the automatic component of the categorization solution should be a white box that can be tuned with more than simply selecting training sets. It needs to be able to learn.
4. The world revolves around you.
• The current trend toward providing a large framework of world knowledge, whether in the form of a semantic or subject-matter framework or using the entire intranet as a training set, is a major addition to categorization. It is on a par with machine learning, and both work because they add new and rich contexts to what used to be a simple, statistically generated bag of words.
5. Quality counts and size matters (but not as much as you think).
• Particularly in a distributed workflow environment where untrained or partially trained authors categorize the content, the quality of automatic or machine-based categorization can be significant but not as significant as features like ease of use, integration with human categorization, and the ability to edit and change things like the balance between precision and recall.
6. Let a hundred flowers bloom.
• The best answer to human categorization support is a distributed workflow that is integrated with a content-management system. Authors working on their documents providing an initial categorization aided by a machine-generated suggestion or machine review distribute the cost and effort in a way that is manageable and takes into account the strengths of both machine and human. A central repository where a final human review can be made completes it. The central group should maintain the repository and be the source of categorization training for authors (or flowers or whatever you want to call them).
7. The end.
• Finally, remember categorization is not an end in itself. No matter how sophisticated the algorithm, no matter how big the training sets, no matter how much world knowledge is brought to bear, and no matter how well librarians and information architects like it, the real value of categorization is how it enhances the user experience by supporting all forms of search behavior and knowledge discovery.