Search Intranets
Current Issue
March/April 2012
Editorial
Columns
Features
News & Tools
Read_Me_File

Services
About Intranets
Subscribe to
Intranets
Past Issues
Sample Issue (PDF)
Cyborg Categorization Part 1—The Salvation of Search?
- Jan/Feb 2002 Issue Posted Jan 1, 2002 Print Version  
Page 1

Note: This article appeared in Intranet Professional, prior to its re-launch as Intranets (in 2004).


[This begins a two-part series on cyborg categorization. In Part 1, Tom Reamy discusses auto-categorization, what it can offer a corporate intranet, and proposes a case for a mix of human and auto-categorization. Part 2, to be published in the March/April issue, will discuss lessons learned from the categorization evaluation and the cyborg approach.]

In the last year there has been an explosion of categorization companies. Two years ago the choice was largely Autonomy or Semio. Now there are products to evaluate from Verity, Inxight, TopicalNet, Mohomine, Simile, H5Technologies, Metagger, Applied Semantics, Sageware, SmartLogik, GammaSite, Quiver, and Purple Yogi. Two things are clear from this large and growing list:

• There is a widespread realization that traditional approaches to search need something —and one overwhelming need is the ability to categorize the hundreds of thousands of documents showing up in Internet and intranet sites and bloated search results lists.
• Either all the conventional company names have been taken or there is something about this product space that brings forth the whimsical in naming a company.

The need for categorization is evident in a variety of ways, from anecdotes of frustrated users to research papers, like the one from Forrester subtitled, "Must Search Stink?" Search is failing to deliver either the right set of documents (all the relevant ones and no more) or the answer the user was attempting to find.

A growing consensus has been forming that what is needed is a taxonomy to add structure and intelligence to the huge collections of unstructured content. This means developing a categorization schema and then categorizing that huge store of documents both before and after a search is initiated. It can be very costly to have a large team of humans do that categorization. Given this, companies start to look more closely at software that can lower the cost and labor to develop taxonomies: auto-categorization and/or hybrid automatic and human categorization. According to one vendor, organizations used to focus their need on the search engine with some limited built-in categorization functionality; now the reverse occurs: Customers want the categorization component and vendors throw in a search foundation.

Auto-Categorization: From News Feeds to Corporate Intranets
The first generation of companies to try to enhance search by adding auto-categorization has had limited success. Their first market was news and content providers. In this limited arena, there were three factors that enabled auto-categorization to function reasonably well. First, the content was of a fairly uniform size and structure. Second, professionals wrote the content. Third, there was present either a fairly easy-to-generate controlled vocabulary of terms, or else the subject was general enough that no specialized vocabulary was needed.

Once these first generation auto-categorizers were applied to other unstructured content, the results were less than optimal. On a corporate intranet the range of document sizes can go from a three-line description of a new idea to a 200-page PDF on new regulatory requirements. It is difficult enough to rate the relevance of two such documents in a search, but trying to build categories on the basis of such wildly varying sizes is even more difficult. Also, corporate intranet content can be written by editors, writers, business experts, legal experts, programmers, Web developers, and so on. This means a wide range of skill levels and a wide range of writing styles and structures. Some auto-categorization products can make inferences based on structure, but when faced with the variety of structure and idiosyncratic use of structural elements, the results are disappointing. Finally, intranets not only have specialized vocabularies that can include the heavy use of acronyms and other jargon, but they have a wide variety of them, all co-existing in a Tower of Babel harmony. In some cases the different vocabulary dimensions can simply exist side by side, but often there are areas of overlap where term x in a fairly standard HR vocabulary refers to term y in a customer facing how-to. The dramatic increase in complexity in a corporate intranet over news feeds and other uniform content doesn't necessarily mean that auto-categorization can't be applied, but it does mean that there will be more set up, more customization, and more human involvement.

Approaches to Auto-Categorization
Typically there are three approaches to auto-categorization: rules-based, catalog by example, and statistical clustering. One trend is to combine two or more approaches along with more support for the human component.

• Rules-based categorization is essentially sets of IF-THEN rules established by human editors, information architects, or subject matter experts. Rules-based categorization is rapidly becoming a component of all products. Verity's Intelligent Classifier and Inktomi's CCE are examples of this approach.
• Catalog by example uses training sets to teach the program to recognize whether or not a document belongs to a particular category. Humans select these training sets. The software finds patterns within a "bag of words" that define that category. Examples of this approach are Mohomine's MohoClassifier, Inxight's Categorizer, and Autonomy.
• Statistical clustering, using co-occurrence of terms or neural networks, finds clumps or clusters of more closely related documents and assigns them to a category. This is really the only truly automatic classification approach since humans must first set up rules and training sets. However, it can be employed in conjunction with human editors and/or pre-existing taxonomies (and the results are usually better if it is). Examples in this area are Semio, Autonomy, and Mohomine.

A variation and an advance on a catalog by example method is a technology called SVM (Support Vector Machines), which uses machine learning. The combination of a more sophisticated representation of the relationships between words and documents combined with an ability to learn seems to yield results that are fast and accurate. Verity and GammaSite use this approach.

Recently a couple of companies have begun offering a new alternative to aid in the creation of an initial taxonomy, providing a rich context of documents and/or meanings, that is, starting with predefined world knowledge rather than a blank slate. For example, TopicalNet has developed a complete starting taxonomy by using the Internet as a training set. Applied Semantics has built a 1.2 million-term hierarchical representation of world knowledge. H5Technologies developed a 400,000-word categorized vocabulary, against which it matches terms from a document and produces a bar code of categories.

The trend toward more humanlike methods (machine learning and the use of world knowledge) is familiar to the early days of AI. Back then, it was often claimed that all that was needed was massive speed or a flexible learning approach like neural networks and a computer could intelligently interact with the world without having a rich set of contexts that constitute world knowledge.

It's still early, but these approaches will ultimately succeed more than simple statistical clustering or catalog by example. Another point about these new approaches is the massive scale that is needed to succeed. This is an indication of how much world knowledge we all bring to every human task. However, if the question is, are any of these approaches sufficient by themselves, the answer is clearly, no. The real question is what is the best way to combine them with each other and with the one necessary component—the human component? Before reviewing how these existing products combine approaches, let's take a look at the strengths and weaknesses of automatic and human categorization.

Automatic versus Humanatic Categorization
Well, "humanatic" isn't a real word, but it should be. The standard answer to human versus auto-categorization is that humans are precise but slower and more expensive and auto-categorization is fast and scaleable, but imprecise, especially in terms of relevancy. It's more complicated than that, but for a generalization it's not bad. I've argued elsewhere ("From Information Architecture to Knowledge Architecture," IP, Sept/Oct. 2001) that knowledge architecture is information architecture plus intellectual, personal, and social contexts. When it comes to categorization, humans bring their knowledge of these contexts to the task, that is, they can and do base decisions on contexts outside the information in the document. They can, at a glance, understand subtle conceptual nuances that escape a program. They can also bring to bear an understanding of the context of the document—the purpose of the document, related ideas from other documents not present, what similar documents are used for, and what that implies for the purpose of this one.

Humans are definitely not as consistent as machines, but they do a much better job than machines in assigning documents to the right general category. Even if humans make mistakes, they tend to be mistakes that are understandable by other humans, whereas automatic categorizers can and do make mistakes that no one can understand. This might not seem important until you factor in such things as user acceptance. If users lose confidence in the browse facility, it will be hard to regain that confidence, and humans remember odd, inexplicable events much more strongly than wrong but understandable events. So, even with the caveat cited above, it is safe to say that humans do produce higher-quality categorization, i.e., categorization that is more accurate and contains richer, multiple contexts of related content.

On the other side of the equation is time and cost. There is no doubt that computers are faster than humans when it comes to most things, and categorization is no exception. In evaluating the costs, the situation is more complex. Even in relation to time, one needs to factor in the user's time. For example, if you use three human categorizers for 40 hours at $80 an hour, you have a cost of $9,600. Let's say on the other side, you have an auto-categorization product that cuts the human effort to one-half person for 4 hours for a cost of $160. Quite a savings, even if you throw in a total software cost of $200,000 spread over a 2-year period of about $2,000 a week. However, if the quality of the end result is significantly poorer, the cost goes way up. So, let's take a very conservative hypothetical. We have 20,000 users who take 60 seconds longer on average to find information using the auto-generated taxonomy (spread out over a week's worth of user sessions). Suddenly the cheap solution has cost the company $26,667. This doesn't count the cost of not finding information, a cost that can be significantly higher but is very hard to quantify.

At best, it is an open question of whether human categorizers or auto-categorization is cheaper. As noted, there is no doubt that auto-categorization is faster. And this has been a major selling point for automatic-categorization vendors. However, the cost debate betrays the influence of the first (and still the best) market for auto categorization—news or content providers who must process thousands of stories or tens of thousands of Web pages a day.

On a corporate intranet, the situation is different and so the cost equation is different. Your customers are part of your company and you can't pass on the cost. What impacts your customers, impacts your bottom line, so unless the various departments or enterprises are virtually at war with each other, it makes no sense for the information architect team to pass on its costs to the sales force or corporate support.

The Real Question Is, "How Do You Build Your Cyborg?"
There seems to be a growing consensus that it's neither automatic nor humanatic categorization that is the answer. So the question becomes, what is the best way to merge the two? Stay tuned for the answer in the March/April issue.


 

Print Version  
Page 1